CN116301007A

CN116301007A - A multi-quadrotor UAV assembly task path planning method based on reinforcement learning

Info

Publication number: CN116301007A
Application number: CN202310454330.7A
Authority: CN
Inventors: 罗俊海; 严泽成; 田雨鑫
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-06-23

Abstract

The invention discloses a multi-quad-rotor unmanned aerial vehicle (MPUAV) gathering task path planning method based on reinforcement learning, which comprises the steps of firstly constructing a PyBullet-based multi-quad-rotor unmanned aerial vehicle Gym environment, setting a reward function mechanism through abstracting a state space and an action space of the MPUAV, then carrying out path planning decision by using an improved deep reinforcement learning algorithm, finally training an improved deep reinforcement learning network, and controlling the MPUAVs through output action information, so that each MPUAV successfully reaches a specified target task in the shortest time. According to the method, N steps of return are utilized in the TD3 algorithm, so that more accurate return estimation, faster learning speed and better generalization performance are obtained, the sampling deviation is reduced by using preferential experience playback, the model deviation caused by unbalanced sampling is reduced, the stability of the algorithm is improved, the TD3 algorithm is more suitable for continuous multidimensional decision-making, and an optimal route can be planned in the shortest time under the condition of achieving specified instantaneity and accuracy.

Description

Intensive task path planning method for multi-quad-rotor unmanned helicopter based on reinforcement learning

Technical Field

The invention belongs to the technical field of path planning of multi-quad-rotor unmanned aerial vehicles, and particularly relates to a multi-quad-rotor unmanned aerial vehicle gathering task path planning method based on reinforcement learning.

Background

The four-rotor unmanned aerial vehicle balances the gravity of the aircraft through the lifting force generated by a plurality of rotors, can realize hovering and vertical lifting, has low requirements on a take-off field, and has relatively low flying speed. Therefore, the multi-rotor unmanned aerial vehicle is suitable for application scenes with complicated environments and small range, such as tasks of aerial photography, monitoring, building modeling and the like. With the continuous development of unmanned aerial vehicle technology, unmanned aerial vehicle technology has been widely applied to the civil field, and the complexity of tasks performed is also continuously increasing. Because the single unmanned aerial vehicle has limited load and flight capacity, cooperation of multiple unmanned aerial vehicles is required to improve the performance capacity and range of tasks.

Since most of the unmanned aerial vehicle tasks relate to shortest path planning, the shortest path planning problem is the key point and research difficulty of unmanned aerial vehicle path planning in recent years. In the shortest path planning problem, the tasks can be further divided into a clustered task and a distributed task according to different characteristics of the tasks. The staging task aims at planning the optimal path for each drone from the respective origin to the same destination point. The goal of such tasks is typically to have all unmanned aerial vehicles reach the target point simultaneously and complete the task as soon as possible. In this case, the goal is generally to minimize the total task time or total path length. The staging task has greater versatility than the distributed task.

Compared with the existing algorithm based on rules or heuristic search and the like, the path planning method based on reinforcement learning has better adaptability and expansibility. The existing method needs to manually design and adjust rules according to the environment, and the reinforcement learning method can adapt to the environment through autonomous learning of an intelligent body. Since the agent in reinforcement learning has an autonomous decision-making capability, the agent can learn optimal behavior by interacting with the environment. In addition, the deep learning algorithm has strong perceptibility, and the deep reinforcement learning algorithm combined with the deep learning algorithm can process higher-dimensional input and is more suitable for the topics of multiple unmanned aerial vehicles in the text. The deep reinforcement learning algorithm can better cope with unknowns and changes than existing methods, and the intelligent agent can perform continuous decision tasks in a complex environment.

The current solution technology for implementing the centralized task for multiple unmanned aerial vehicles still faces many challenges, including environmental modeling, low learning efficiency, and complex action space and state space. First, for the agent project based on the deep reinforcement learning algorithm, the construction of the simulation environment is the basis of the whole experiment, and the design of the unmanned aerial vehicle system must depend on simulation tools. Therefore, establishing an appropriate unmanned aerial vehicle simulator is critical to the development of academic research and safety critical applications. However, many current environments that conduct simulation experiments based on deep reinforcement learning algorithm models lack real world portability, with many reinforcement learning environments sacrificing reality in order to achieve high sample throughput. In addition, training efficiency for path planning for multiple unmanned aerial vehicles using deep reinforcement learning algorithms is generally low. In most simulation environments, path planning tasks are rewarded sparsely, agents can only acquire rewarding signals after the task is over, and training is difficult to begin at an early stage due to the difficulty of efficient exploration in complex environments. Finally, such multi-unmanned aerial vehicle path planning problems generally involve multiple agents and multiple obstacles, so that the state space, action space, rewarding function and the like of the problems often have the characteristics of high dimensionality and complexity, and the modeling and solving difficulty of the problems is increased. Because the action space of multi-unmanned aerial vehicle path planning is large, an effective search strategy is needed to solve the problem of high-dimensional action space. In summary, the collective path planning has important significance for task execution of multiple unmanned aerial vehicles.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-quad-rotor unmanned aerial vehicle centralized task path planning method based on reinforcement learning, which aims at solving centralized tasks in multi-unmanned aerial vehicle path planning.

The technical scheme of the invention is as follows: a multi-quad-rotor unmanned helicopter collective task path planning method based on reinforcement learning comprises the following specific steps:

s1, constructing a PyBullet-based multi-quad-rotor unmanned aerial vehicle Gym environment;

s2, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle, and setting a reward function mechanism to enable the unmanned aerial vehicle to interact with the environment;

s3, performing path planning decision by using an improved deep reinforcement learning algorithm, and performing path planning for each four-rotor unmanned aerial vehicle under an aggregation task;

s4, training the improved deep reinforcement learning network, and controlling the angular speed and the linear speed of the four-rotor unmanned aerial vehicle through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time.

Further, in the step S1, the specific steps are as follows:

s11, constructing a dynamics simulation model of the multi-quad-rotor unmanned helicopter;

the four-rotor unmanned aerial vehicle dynamic equation is formed by the motion equation and the aerodynamic effect of the four-rotor unmanned aerial vehicle, so that the construction of the four-rotor unmanned aerial vehicle dynamic simulation model is completed, and the method is as follows:

The pybullets were used to build force and torque models for each quad-rotor drone in Gym, and the kinetic equations for all quad-rotor drones were calculated and updated using the physics engine.

The arm length of each quadrotor unmanned aerial vehicle is set to be L, the mass is set to be m, the inertial property is set to be J, the physical constant and the convex collision shape are described through separate URDF files, and the configuration of the 'x' -shaped quadrotor unmanned aerial vehicle is used.

First, set the gravitational acceleration g and the physical step frequency in PyBullet, force F applied to 4 motors _i And a torque T about the Z-axis of the drone _o With motor speed P _i Is proportional to the square of F _i And P _i The expression is as follows:

F _i ＝k _F ·P _i ² (1)

wherein k is _F And k _T Indicating a predetermined constant.

Setting the real-time control of the model, the kinetic equation of the quadrotor unmanned plane is expressed as follows:

J ^T T _o ＝Ma+h (3)

wherein J represents a jacobian matrix, M represents an inertial matrix, a represents a generalized acceleration, h represents Coriolis and gravitational effects, and superscript T represents a transposed operation.

In practice, flying near the ground or near other unmanned aerial vehicles creates additional aerodynamic effects, which are modeled separately and used in combination in the pybullets, including: propeller resistance D, ground effect G acting on single motor _i A wash-down effect W on the centroid.

Resistance D is produced to four rotor unmanned aerial vehicle's rotatory screw, and resistance D and four rotor unmanned aerial vehicle linear velocity

Angular speed of propeller and constant drag coefficient matrix k _D Proportional, expressed as follows:

wherein,,

indicating the angular velocity of the propeller, 60 is 60s; constant drag coefficient moment k _D The specific expression is as follows:

k _D ＝diag(k _⊥ ,k _⊥ ,k _|| ) (5)

wherein k is _⊥ Represents the vertical drag coefficient, k _|| Represents parallel drag coefficients, and matrix k _D The data is fitted using a least squares method.

When hovering at very low altitudes, there is a ground effect, the effect of which on each motor G _i Radius r to the propeller _P Speed P _i Height h _i And constant k _G The proportional equation is made as follows:

when two quadrotor robots pass through a path at the same position at different heights, there is a down-wash effect, the effect of which is reduced to a single force applied to the centre of mass of the unmanned aerial vehicle, the magnitude W of which depends on the distance (delta) between the two robots in the coordinate system x, y, z _x ,δ _y ,δ _z ) And a constant k determined experimentally _D1 ，k _D2 ，k _D3 The expression of W is as follows:

s12, constructing an observation space and an action space of the multi-four-rotor unmanned aerial vehicle;

in the constructed Gym environment, when the quadrotor unmanned aerial vehicle executes each action and outputs an observation vector, the observation space expression of the quadrotor unmanned aerial vehicle is as follows:

Wherein N is ∈ [ 0..N.)]Representing the number of the quadrotors; x is X _n ＝[x,y,z] _n Representing the position of the quadrotor unmanned aerial vehicle; q _n Representing quaternions for attitude control of a quad-rotor unmanned helicopter; r is (r) _n 、p _n 、y _n Respectively representing a Roll angle Roll, a Pitch angle Pitch and a Yaw angle Yaw, namely three angles for attitude estimation;

is->

Represents the linear velocity omega of the nth quad-rotor unmanned helicopter _n Is [ omega ] _x ,ω _y ,ω _z ] _n Representing the angular velocity of the nth quad-rotor unmanned helicopter; p (P) _n Is [ P ] ₀ ,P ₁ ,P ₂ ,P ₃ ] _n Indicating the motor speeds of all unmanned aerial vehicles.

In the invention, the four-rotor unmanned aerial vehicle uses the laser radars to detect the obstacle, k laser radars are set in the model, and the k laser radars are used for observing the environment.

Wherein the k laser radars have a scanning angle range of pi and an angle between two lasers of 2 pi/k; (d) ₁ ,...,d _k ) Representing the ray lengths of k radars on a horizontal plane; d, d _i The ray length of the ith radar is expressed as follows:

then the environment information s _E The definition is as follows:

s _E ＝[ρ _i ,d _i ] ^T ,i＝1...k (10)

wherein ρ is _i The unique thermal code representing the ith radar is expressed as follows:

then for any one quadrotor unmanned aerial vehicle, the action space expression is as follows:

{n:[v _x ,v _y ,v _z ,v _M ] _n } (12)

wherein [ v ] _x ,v _y ,v _z ,v _M ] _n Representing the speed of input to a quadrotor drone, v _x ，v _y ，v _z Representing the components of a unit vector, v _M Indicating the magnitude of the desired velocity; and the action space can be represented by the rotating speeds of four motors, and the expression is as follows:

{n:[P ₀ ,P ₁ ,P ₂ ,P ₃ ] _n } (13)

finally, the conversion of the input into pulse width modulation PWM and motor speed is delegated to a controller consisting of position and attitude control subroutines.

Further, the step S2 specifically includes the following steps:

s21, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle;

the states of the quad-rotor unmanned helicopter include: position and quaternion q of four-rotor unmanned aerial vehicle _n Roll angle roller _n Pitch angle pitch _n Yaw angle Yawy _n Linear velocity

Angular velocity omega _n Motor speed P of all unmanned aerial vehicles _n ＝[P ₀ ,P ₁ ,P ₂ ,P ₃ ]Angle beta between first viewing angle direction of unmanned plane and target link _n And the global coordinates (x, y, z) of the nth unmanned aerial vehicle and the global coordinates (x) of the target _t ,y _t ,z _t ) Difference d between _0n 。

Replacing the global position of the unmanned aerial vehicle with the relative position DeltaX of the unmanned aerial vehicle and the target in the state _n I.e. [ Deltax, deltay, deltaz ]] _n Then the status s of the unmanned aerial vehicle _U The following are provided:

from the state s of the unmanned aerial vehicle _U And the laser radar detected environmental state s _E The state space s of the quadrotor unmanned aerial vehicle can be obtained, and the expression is as follows:

the action space of the four-rotor unmanned aerial vehicle environment is composed of the speeds input to the four-rotor unmanned aerial vehicle, and with reference to the reference formula (12), the action space expression for any one four-rotor unmanned aerial vehicle is as follows:

a＝[v _x ,v _y ,v _z ,v _M ] ^T (16)

S22, setting a reward function mechanism to enable the quadrotor unmanned aerial vehicle to interact with the environment;

the reward function R (s, a) represents the environmental feedback resulting from taking action a in state s; setting a reward function consisting of three parts to enable the quadrotor unmanned aerial vehicle to reach an aggregation task target point as soon as possible, wherein the reward function is specifically as follows:

first, distance between the quadrotor unmanned plane and the target point is awarded R _t Forcing the quadrotor unmanned aerial vehicle to reach the target, R _t The expression is as follows:

wherein d ₀ Representing the distance of the quadrotor unmanned from the target,

represents the distance of the nth quad-rotor unmanned helicopter from the target,/->

Indicating the distance from the target to the quadrotor drone at the nth next time.

Secondly, distance between the quadrotor unmanned plane and the obstacle is set to be rewarded R _o Make unmanned aerial vehicle keep away from barrier setting, R _o The expression is as follows:

wherein d _i The ray length of the ith radar is represented, namely the detection distance from the quadrotor unmanned aerial vehicle to an obstacle or other quadrotor unmanned aerial vehicle,

indicating the distance between the nth four-rotor unmanned aerial vehicle and the target, and setting the unmanned aerial vehicle and the obstacleDistance d of safety _safe 。

Finally, setting an angle reward R between the quadrotor unmanned aerial vehicle and the target point _a To promote the unmanned plane to approach to the target direction, if beta _n The larger the penalty the larger, R _a The expression is as follows:

further, the step S3 specifically includes the following steps:

the ITD3 algorithm is obtained by N-step report and priority experience playback improvement TD3 algorithm, the ITD3 algorithm is composed of four sub-networks, namely two critics networks and two actors networks, and the improved deep reinforcement learning algorithm is realized by the ITD3 algorithm.

Firstly, introducing N-step returns into a TD3 algorithm, wherein the N-step returns add the returns of N time steps in the future to provide more comprehensive information than single-step returns;

in the case of sparse rewards, most state transitions p (s' |s, a) have no rewards information, and the one-step rewards will not be valid; n-step rewards alleviate the problem of rewarding sparseness by sampling N transfers.

Modifying the equation of the TD3 algorithm reviewer network by adding N-step returns, and in the j-th round of sampling, modifying the time difference error function expression delta _j The following are provided:

wherein phi and phi' represent parameters of the double commentator network, k represents the kth step return, r _k Representing the return at step k, s and a representing the current state and action, s _N And a _N Representing target states and actions, Q(s) _t ,a _t Phi) represents the value function of the critic network, Q'(s) _t+N ,a _t+N And phi') represents the value function of the target reviewer network, and gamma represents the discount factor.

Second, using preferential empirical playback in the original TD3 algorithm, at the beginning of the sample, the sampling probability of the jth transition is defined as P (j), expressed as follows:

wherein p is _j Representing the priority of the j-th experience; alpha denotes a constant for adjusting the sampling weight, which determines how much priority to use, and when alpha is equal to 0, uniform random sampling will be employed.

Then, the sampling weight w for each transition of the update network _j Calculated by the following formula, which represents the importance of each transfer data, M represents the size of the small batch, max _i w _i Representing the maximized sampling weight for normalization:

finally, using proportion prioritization, updating the transferred priority according to the time difference error, wherein the priority is shown in the following formula:

p _j ＝|δ _j |+∈ (23)

wherein delta _j Representing a time difference error, e represents a small value preset to avoid a 0 priority.

Further, the step S4 specifically includes the following steps:

the ITD3 network training network is implemented by a two-part neural network: an actor network consisting of three fully connected layers performs mapping of states to actions, and a reviewer network that uses four fully connected layers to estimate Q-values.

In the ITD3 network, for both actor networks, the input is state and the output is action. The reviewer network takes as its input the state-action pairs and generates a state-action value function (Q-value). The ITD3 algorithm training process is specifically as follows:

First a small set of samples (s, a, s ', r) is preferentially extracted from the experience playback buffer, s ' is input into the actor's target network. Then, a ' is obtained next time, and the state-action pair (s ', a ') is input to the critique target network.

After obtaining two target Q-values

And->

) Then, the smaller one is selected to calculate the target value function y (r, s'), the target value function expression is as follows:

wherein r represents the return, and the discount factor gamma is the same as the value of formula (20), phi _i Is a random parameter of the critics network.

On the other hand, (s, a) is input into the criticizing network to obtain two Q-values (Q ₁ (. Cndot.) and Q ₂ (. Cndot.)). Then, they are used to calculate the mean square error of y (r, s'), and the sum of the mean square errors is back-propagated to update the parameters of the two commentator networks, and N-step playback is added to the time difference error update.

Next, the Q-value obtained from the first reviewer network is input into the actor model network, and the parameters of the actor network are updated in the direction in which the Q-value increases (once per two iterations).

And finally, updating all target networks by adopting a soft updating method.

After training, the angular speed and the linear speed of the four-rotor unmanned aerial vehicle are controlled through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time, and the collective task path planning is completed.

The beneficial effects of the invention are as follows: according to the method, firstly, a multi-quad-rotor unmanned aerial vehicle Gym environment based on PyBullet is constructed, a reward function mechanism is set through abstracting a state space and an action space of the quad-rotor unmanned aerial vehicle, then an improved deep reinforcement learning algorithm is used for making a path planning decision, finally an improved deep reinforcement learning network is trained, and the quad-rotor unmanned aerial vehicles are controlled through output action information, so that each quad-rotor unmanned aerial vehicle can achieve a specified target task in the shortest time. According to the method, N steps of return are utilized in the TD3 algorithm, so that more accurate return estimation, faster learning speed and better generalization performance are obtained, the sampling deviation is reduced by using preferential experience playback, the model deviation caused by unbalanced sampling is reduced, the stability of the algorithm is improved, the TD3 algorithm is more suitable for continuous multidimensional decision-making, and an optimal route can be planned in the shortest time under the condition of achieving specified instantaneity and accuracy.

Drawings

Fig. 1 is a flow chart of a reinforcement learning-based multi-quad-rotor unmanned helicopter collective task path planning method.

Fig. 2 is a model view of an "x" type quad-rotor unmanned helicopter in an embodiment of the invention.

Fig. 3 is a schematic diagram of laser radar detecting environmental information in a horizontal direction in an embodiment of the present invention.

Fig. 4 is a state diagram of a quad-rotor unmanned helicopter in an embodiment of the invention.

Fig. 5 is a schematic diagram of an ITD3 algorithm in an embodiment of the invention.

Fig. 6 is a diagram of a neural network in ITD3 according to an embodiment of the present invention.

Detailed Description

The process according to the invention is further described below with reference to the figures and examples.

As shown in fig. 1, the invention relates to a multi-quad-rotor unmanned helicopter collective task path planning method based on reinforcement learning, which comprises the following specific steps:

In this embodiment, the step S1 is specifically as follows:

a pybull was used to build a force and torque model for each quad-rotor drone in Gym, and the kinetic equations for all drones were calculated and updated using the physics engine.

As shown in fig. 2, the power model of the simplified "x" type quadrotor unmanned aerial vehicle constructed in the present embodiment sets the arm length of each unmanned aerial vehicle to L, the mass to m, the inertial property to J, the physical constant and the convex collision shape to be described by separate URDF documents, and is used for configuration of the "x" type quadrotor unmanned aerial vehicle.

First, the gravitational acceleration g and the physical step frequency (finer than the control frequency of Gym steps) are set in pybullets; in addition to physical properties and constants, the URDF information may be used in PyBullet to load CAD models of quadrotors; force F exerted on 4 motors _i And a torque T about the Z-axis of the drone _o With motor speed P _i Is proportional to the square of F _i And P _i The expression is as follows:

F _i ＝k _F ·P _i ² (1)

wherein k is _F And k _T Indicating a predetermined constant.

F _i And P _i In linear dependence on the input pulse width modulation (Pulse Width Modulation, PWM), the control of the model is set in real time, and the equation of motion of the quadrotor drone is expressed as follows:

J ^T T _o ＝Ma+h (3)

Where J represents a jacobian matrix, M represents an inertial matrix, a represents a generalized acceleration, h represents a Coriolis (Coriolis) and the effect of gravity, and superscript T represents a transpose operation.

In practice, flying near the ground or near other unmanned aerial vehicles may create additional aerodynamic effects such as G in FIG. 2 _i＝0,1,2,3 The relevant forces are denoted D and W. They can be modeled separately and used in combination in a pybullets, including: propeller resistance D, ground effect G acting on single motor _i And a wash-down effect W on the centroid.

The rotating propellers of the quadrotor unmanned create a drag D, which is a force acting in a direction opposite to the direction of motion. Resistance D and four rotor unmanned aerial vehicle linear velocity

Angular velocity of propeller and coefficient matrix k _D Proportional, expressed as follows:

wherein,,

indicating the angular velocity of the propeller, 60 is 60s; to simulate cross-coupling, it is necessary to fit a matrix k containing 9 coefficients _D . Fitting requires some symmetry: the drag coefficient and the cross-coupling between the x and y axes should be the same. Due to symmetry, the wind speed in the z direction produces the same force in the x and y directions. In addition, the resistance in the z direction caused by the speed in the x direction should be the same as the resistance in the z direction caused by the speed in the y direction. Thus, the drag coefficient matrix k _D The specific expression is as follows:

k _D ＝diag(k _⊥ ,k _⊥ ,k _|| ) (5)

wherein k is _⊥ Representing vertical resistanceForce coefficient, k _|| Represents parallel drag coefficients, and matrix k _D The data is fitted using a least squares method.

When hovering at very low altitudes, there is a ground effect, i.e. the thrust caused by the interaction of the propeller airflow with the ground experienced by the quadrotor unmanned will increase, the ground effect will have an effect on each motor G _i Radius r to the propeller _P Speed P _i Height h _i And constant k _G The proportional equation is made as follows:

when two quadrotor unmanned aerial vehicles pass through paths at the same positions of different heights, a downward-washing effect exists, the downward-washing effect can cause the lifting force of the bottom aircraft to be reduced, the influence of the downward-washing effect is simplified into a single acting force applied to the mass center of the unmanned aerial vehicle, and the size W of the single acting force depends on the distance (delta) between the two unmanned aerial vehicles in a coordinate system x, y and z _x ,δ _y ,δ _z ) And a constant k determined experimentally _D1 ，k _D2 ，k _D3 The expression of W is as follows:

wherein N is ∈ [ 0..N.) ]Representing the number of the quadrotors; x is X _n ＝[x,y,z] _n Representing the position of the quadrotor unmanned aerial vehicle; q _n Representing quaternions for quad-rotor unmannedControlling the posture of the machine; r is (r) _n 、p _n 、y _n Respectively representing a Roll angle Roll, a Pitch angle Pitch and a Yaw angle Yaw, namely three angles for attitude estimation;

is->

As shown in fig. 3, in the present embodiment, the quadrotor unmanned aerial vehicle uses the lidar to detect the obstacle, k lidars are set in the model for the quadrotor unmanned aerial vehicle, and the environment is observed using the k lidars.

Wherein the k lidar scan angle ranges are pi (k=24 in this embodiment), and the angle between the two lasers is 2 pi/k; (d) ₁ ,...,d _k ) Representing the ray lengths of k radars on a horizontal plane; if a sensor does not detect any object within a limited distance, then the length of the ray is the maximum detectable distance. Otherwise, the length is the distance between the unmanned aerial vehicle and the point detected by the radar; d, d _i The ray length of the ith radar is expressed as follows:

then the environment information s _E The definition is as follows:

s _E ＝[ρ _i ,d _i ] ^T ,i＝1...k (10)

wherein ρ is _i A unique thermal code representing the ith radar, ρ if the radar detects a detectable object within a limited distance _i 1, otherwise 0, an expression such asThe following steps:

{n:[v _x ,v _y ,v _z ,v _M ] _n } (12)

{n:[P ₀ ,P ₁ ,P ₂ ,P ₃ ] _n } (13)

finally, the conversion of the input into pulse width modulation (Pulse Width Modulation, PWM) and motor speed is delegated to a controller consisting of position and attitude control subroutines.

In this embodiment, the step S2 is specifically as follows:

the states of quad-rotor unmanned helicopter n include: position, quaternion (attitude control for a quad-rotor unmanned aerial vehicle) q of a quad-rotor unmanned aerial vehicle _n Roll angle (Roll) r _n Pitch angle (Pitch) p _n Yaw angle (Yaw) y _n Linear velocity

Angular velocity omega _n Motor speed P of all unmanned aerial vehicles _n ＝[P ₀ ,P ₁ ,P ₂ ,P ₃ ]The method comprises the steps of carrying out a first treatment on the surface of the The angle beta between the first viewing angle direction of the drone n and the target link as shown in fig. 4 _n And the global coordinates (x, y, z) of the nth unmanned aerial vehicle and the global coordinates (x) of the target _t ,y _t ,z _t ) Difference d between _0n 。

In order to enable the unmanned aerial vehicle to reach the target faster, the convergence speed is improved, and the global position of the unmanned aerial vehicle is replaced by the relative position delta X of the unmanned aerial vehicle and the target in a state _n I.e. [ Deltax, deltay, deltaz ]] _n Then the status s of the unmanned aerial vehicle _U The following are provided:

a＝[v _x ,v _y ,v _z ,v _M ] ^T (16)

the setting of the reward function has a great influence on the performance of the deep reinforcement learning model and determines the strategy of the unmanned aerial vehicle. The reward function R (s, a) represents the environmental feedback resulting from taking action a in state s, used to evaluate the quality of the action taken in the current state. The method comprises the steps of carrying out a first treatment on the surface of the If R (s, a) is large, indicating that taking action a in current state s is beneficial to achieving the goal, the probability of taking action a in state s will increase in the next policy update, otherwise, the probability will be decreased.

In order to make the quadrotor unmanned aerial vehicle reach the gathering task target point as soon as possible, in this embodiment, a reward function composed of three parts is set to make the quadrotor unmanned aerial vehicle reach the gathering task target point as soon as possible, which is specifically as follows:

firstly, setting a distance between a quadrotor unmanned plane and a target pointLeave rewards R _t Forcing the quadrotor unmanned aerial vehicle to reach the target, R _t The settings were as follows: if the unmanned plane approaches the target, the rewards are positive, and the rewards are maximum when the unmanned plane reaches the target point; if the unmanned aerial vehicle is far away from the target, the reward is negative, and if the unmanned aerial vehicle does not reach the target for more than the preset time, the punishment is maximally-5. R is R _t The expression is as follows:

Secondly, distance between the quadrotor unmanned plane and the obstacle is set to be rewarded R _o Make unmanned aerial vehicle keep away from barrier setting, set up R _o The following are provided: if the distance between the drone and the nearest obstacle is less than d _safe The drone will be penalized; if the unmanned aerial vehicle collides with the obstacle, punishment is-3; if the distance between the drone and the nearest obstacle is less than d _safe The unmanned aerial vehicle is safe and cannot be punished. R is R _o The expression is as follows:

indicating the distance from the nth four-rotor unmanned plane to the target, and settingSafety distance d between unmanned aerial vehicle and obstacle _safe 。

Finally, setting an angle reward R between the quadrotor unmanned aerial vehicle and the target point _a To promote the unmanned plane to approach to the target direction, if beta _n The larger (as shown in FIG. 4) the larger the penalty, the more R _a The expression is as follows:

in this embodiment, the step S3 is specifically as follows:

in the embodiment, an ITD3 algorithm is used for realizing multi-unmanned aerial vehicle centralized task path planning in an unknown environment. The TD3 algorithm solves the problem of overestimation deviation of depth deterministic strategy gradients (Deep Deterministic Policy Gradient, DDPG), is a deterministic strategy reinforcement learning algorithm, and is suitable for high-dimensional continuous motion space.

The TD3 algorithm solves the problems of Q-value estimation errors and excessive variance by using a dual Q-network and delay update strategy. However, when the environment has delayed rewards, the TD3 algorithm may require more experience to learn how to make the correct decisions. To address this problem, the present embodiment introduces an N-step reward into the TD3 algorithm.

Firstly, introducing N-step returns into a TD3 algorithm, wherein the N-step returns add the returns of N time steps in the future to provide more comprehensive information than single-step returns; therefore, the algorithm can better utilize the delayed return and improve the learning efficiency.

In the case of sparse rewards, most state transitions (State Transitions) p (s' |s, a) have no rewards information, and the one-step rewards will not be valid; n-step rewards environmental rewards sparsity by sampling N transitions (here set to the instance value n=4).

This embodiment adds to TD3N-step rewards increase the chances of finding and learning from rewarded transitions, thus improving learning efficiency. Modifying the equation of the TD3 algorithm reviewer network by adding N-step rewards, and in the j-th round of sampling, modifying a time difference error (TimeDifference) function expression delta _j The following are provided:

Second, each experience in the original TD3 algorithm is uniformly sampled, but without the weight score, learning efficiency is low, and adding priority can solve this problem. Priority experience playback is a technique to enhance the performance of DRL, which prioritizes samples based on the importance of experience, allowing important samples to be sampled more frequently, thereby improving the learning efficiency and performance of the model. At the beginning of a sample, the sampling probability of the jth transition is defined as P (j), expressed as follows:

Then, the sampling weight w for each transition of the update network _j Calculated from the following formula, which represents the importance of each transfer data, M represents the size of a small batch (mini-batch), max _i w _i Representing the maximized sampling weight for normalization:

p _j ＝|δ _j |+∈ (23)

In this embodiment, the step S4 is specifically as follows:

In the ITD3 network, for both actor networks, the input is state and the output is action. The reviewer network takes as its input the state-action pairs and generates a state-action value function (Q-value). Then, as shown in fig. 5, the ITD3 algorithm training process is specifically as follows:

After obtaining two target Q-values

And->

And finally, updating all target networks by adopting a soft updating method.

The complexity of the neural network is related to the number of samples based on empirical risk minimization. Therefore, according to the state space and the action space (the size of the observed quantity) obtained by the above steps in the present embodiment, the network structure of the ITD3 is designed as fig. 6.

In this embodiment, the ITD3 actor network consists of three FC neural network layers, 512, 128, and 4 nodes. The first and second layers are activated by rectifying linear units (Rectified Linear Unit, reLU) and the third layer is activated by hyperbolic tangent (Hyperbolic Tangent, tanh) to ensure the output of the actor network is [ -1,1 ]Within a range of (2). After three FCs, the actor will input s _t Action instruction a mapped to unmanned aerial vehicle _t . For a reviewer network, it estimates the Q-value from four FCs. Input s _t First, input FC with ReLU activation function (node number 1024), then sum the vector with action a _t And are combined into a 1028-dimensional vector. The reviewer network transfers the vector into the Q-value via two ReLU activation functions and one tanh activation function.

In summary, the method of the invention firstly models Gym environment of the four-rotor unmanned aerial vehicle based on PyBullet, and improves portability of the invention. Secondly, N steps of return are utilized in a dual-delay depth deterministic strategy gradient (Twin Delayed Deep Deterministic PolicyGradient, TD 3) algorithm to obtain more accurate return estimation, faster learning speed and better generalization performance; meanwhile, the sampling deviation is reduced by using preferential experience playback, the model deviation caused by unbalanced sampling is reduced, and the stability of the algorithm is improved, so that the TD3 algorithm is more suitable for the continuous multidimensional decision problem. And finally, training an Improved TD3 (ITD 3) network, and controlling the angular speed and the linear speed of the quadrotor unmanned aerial vehicle to enable the quadrotor unmanned aerial vehicle to successfully reach an aggregation task target point in the shortest time under the condition of unknown environment.

The embodiments described above are intended to assist the reader in understanding the principles of the invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. A multi-quad-rotor unmanned helicopter collective task path planning method based on reinforcement learning comprises the following specific steps:

2. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein in the step S1, the method is specifically as follows:

using PyBullet to establish a force and torque model acting on each quadrotor unmanned aerial vehicle in Gym, and calculating and updating a kinetic equation of all the quadrotor unmanned aerial vehicles by using a physical engine;

setting the arm length of each four-rotor unmanned aerial vehicle to be L, the mass to be m, the inertial property to be J, the physical constant and the convex collision shape to be described through a separate URDF file, and configuring the 'x' -shaped four-rotor unmanned aerial vehicle;

F _i ＝k _F ·P _i ² (1)

wherein k is _F And k _T Representing a predetermined constant;

J ^T T _o ＝Ma+h (3)

wherein J represents a jacobian matrix, M represents an inertia matrix, a represents generalized acceleration, h represents Coriolis and the effect of gravity, and the superscript T represents a transposed operation;

In practice, flying near the ground or near other unmanned aerial vehicles creates additional aerodynamic effects, which are modeled separately and used in combination in the pybullets, including: propeller resistance D, ground effect G acting on single motor _i A wash-down effect W on the centroid;

wherein,,

k _D ＝diag(k _⊥ ,k _⊥ ,k _|| ) (5)

wherein k is _⊥ Represents the vertical drag coefficient, k _|| Represents parallel drag coefficients, and matrix k _D Fitting the data by using a least square method;

when two four rotor unmannedWhen the plane passes through the path at the same position with different heights, a downward washing effect exists, the influence of the downward washing effect is simplified into a single acting force applied to the mass center of the unmanned plane, and the size W of the downward washing effect depends on the distance (delta _x ,δ _y ,δ _z ) And a constant k determined experimentally _D1 ，k _D2 ，k _D3 The expression of W is as follows:

is->

Represents the linear velocity omega of the nth quad-rotor unmanned helicopter _n Is [ omega ] _x ,ω _y ,ω _z ] _n Representing the angular velocity of the nth quad-rotor unmanned helicopter; p (P) _n Is [ P ] ₀ ,P ₁ ,P ₂ ,P ₃ ] _n Representing motor speeds of all unmanned aerial vehicles;

in the invention, the four-rotor unmanned aerial vehicle uses the laser radar to detect the obstacle, k laser radars are set in the model, and the k laser radars are used for observing the environment;

Then the environment information s _E The definition is as follows:

s _E ＝[ρ _i ,d _i ] ^T ,i＝1...k (10)

{n:[v _x ,v _y ,v _z ,v _M ] _n } (12)

{n:[P ₀ ,P ₁ ,P ₂ ,P ₃ ] _n } (13)

3. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S2 is specifically as follows:

Angular velocity omega _n Motor speed P of all unmanned aerial vehicles _n ＝[P ₀ ,P ₁ ,P ₂ ,P ₃ ]Angle beta between first viewing angle direction of unmanned plane and target link _n And the global coordinates (x, y, z) of the nth unmanned aerial vehicle and the global coordinates (x) of the target _t ,y _t ,z _t ) Difference d between _0n ；

representing the distance of the nth quad-rotor unmanned helicopter from the target,

representing the distance from the quadrotor unmanned plane to the target at the nth next time;

indicating the distance between the nth four-rotor unmanned aerial vehicle and the target, and setting the safety distance d between the unmanned aerial vehicle and the obstacle _safe ；

4. the method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S3 is specifically as follows:

the method comprises the steps of N-step report and preferential experience playback, improving a TD3 algorithm to obtain an ITD3 algorithm, wherein the ITD3 algorithm consists of four sub-networks, namely two critique networks and two actor networks, and the improved deep reinforcement learning algorithm is realized by the ITD3 algorithm;

in the case of sparse rewards, most state transitions p (s' |s, a) have no rewards information, and the one-step rewards will not be valid; n-step rewards are used for relieving the problem of sparse rewards by sampling N transfers;

wherein phi and phi' represent parameters of the double commentator network, k represents the kth step return, r _k Representation ofReturn at step k, s and a represent current state and action, s _N And a _N Representing target states and actions, Q(s) _t ,a _t Phi) represents the value function of the critic network, Q'(s) _t+N ,a _t+N Phi') represents the value function of the target critique network, gamma represents the discount factor;

wherein p is _j Representing the priority of the j-th experience; α represents a constant for adjusting the sampling weight, α determines how much priority to use, and when α equals 0, uniform random sampling will be employed;

p _j ＝|δ _j |+∈ (23)

5. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S4 is specifically as follows:

the ITD3 network training network is implemented by a two-part neural network: mapping states to actions is accomplished by an actor network consisting of three fully connected layers, and a reviewer network using four fully connected layers to estimate Q-values;

in the ITD3 network, for both actor networks, the input is state and the output is action; the critics network takes the state-action pair as input and generates a state-action value function (Q-value); the ITD3 algorithm training process is specifically as follows:

firstly, preferentially extracting a batch of small samples (s, a, s ', r) from an experience playback buffer zone, and inputting s' into an actor target network; then, a ' is obtained next time, and the state-action pair (s ', a ') is input into the criticizing target network;

after obtaining two target Q-values

And->

wherein r represents the return, and the discount factor gamma is the same as the value of formula (20), phi _i Random parameters of the critics network;

on the other hand, (s, a) is input into the criticizing network to obtain two Q-values (Q ₁ (. Cndot.) and Q ₂ (. Cndot.); then, using them to calculate the mean square error of y (r, s'), and back-propagating the sum of the mean square errors to update the parameters of two critic networks, and adding N-step playback in the time difference error update;

next, inputting the Q-value obtained from the first criticizing home network into the actor model network, and updating parameters of the actor network in a direction in which the Q-value increases (once every two iterations);

finally, updating all target networks by adopting a soft updating method;