[go: up one dir, main page]

CN116301007A - A multi-quadrotor UAV assembly task path planning method based on reinforcement learning - Google Patents

A multi-quadrotor UAV assembly task path planning method based on reinforcement learning Download PDF

Info

Publication number
CN116301007A
CN116301007A CN202310454330.7A CN202310454330A CN116301007A CN 116301007 A CN116301007 A CN 116301007A CN 202310454330 A CN202310454330 A CN 202310454330A CN 116301007 A CN116301007 A CN 116301007A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
follows
rotor unmanned
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310454330.7A
Other languages
Chinese (zh)
Inventor
罗俊海
严泽成
田雨鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202310454330.7A priority Critical patent/CN116301007A/en
Publication of CN116301007A publication Critical patent/CN116301007A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/08Control of attitude, i.e. control of roll, pitch, or yaw
    • G05D1/0808Control of attitude, i.e. control of roll, pitch, or yaw specially adapted for aircraft
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses a multi-quad-rotor unmanned aerial vehicle (MPUAV) gathering task path planning method based on reinforcement learning, which comprises the steps of firstly constructing a PyBullet-based multi-quad-rotor unmanned aerial vehicle Gym environment, setting a reward function mechanism through abstracting a state space and an action space of the MPUAV, then carrying out path planning decision by using an improved deep reinforcement learning algorithm, finally training an improved deep reinforcement learning network, and controlling the MPUAVs through output action information, so that each MPUAV successfully reaches a specified target task in the shortest time. According to the method, N steps of return are utilized in the TD3 algorithm, so that more accurate return estimation, faster learning speed and better generalization performance are obtained, the sampling deviation is reduced by using preferential experience playback, the model deviation caused by unbalanced sampling is reduced, the stability of the algorithm is improved, the TD3 algorithm is more suitable for continuous multidimensional decision-making, and an optimal route can be planned in the shortest time under the condition of achieving specified instantaneity and accuracy.

Description

Intensive task path planning method for multi-quad-rotor unmanned helicopter based on reinforcement learning
Technical Field
The invention belongs to the technical field of path planning of multi-quad-rotor unmanned aerial vehicles, and particularly relates to a multi-quad-rotor unmanned aerial vehicle gathering task path planning method based on reinforcement learning.
Background
The four-rotor unmanned aerial vehicle balances the gravity of the aircraft through the lifting force generated by a plurality of rotors, can realize hovering and vertical lifting, has low requirements on a take-off field, and has relatively low flying speed. Therefore, the multi-rotor unmanned aerial vehicle is suitable for application scenes with complicated environments and small range, such as tasks of aerial photography, monitoring, building modeling and the like. With the continuous development of unmanned aerial vehicle technology, unmanned aerial vehicle technology has been widely applied to the civil field, and the complexity of tasks performed is also continuously increasing. Because the single unmanned aerial vehicle has limited load and flight capacity, cooperation of multiple unmanned aerial vehicles is required to improve the performance capacity and range of tasks.
Since most of the unmanned aerial vehicle tasks relate to shortest path planning, the shortest path planning problem is the key point and research difficulty of unmanned aerial vehicle path planning in recent years. In the shortest path planning problem, the tasks can be further divided into a clustered task and a distributed task according to different characteristics of the tasks. The staging task aims at planning the optimal path for each drone from the respective origin to the same destination point. The goal of such tasks is typically to have all unmanned aerial vehicles reach the target point simultaneously and complete the task as soon as possible. In this case, the goal is generally to minimize the total task time or total path length. The staging task has greater versatility than the distributed task.
Compared with the existing algorithm based on rules or heuristic search and the like, the path planning method based on reinforcement learning has better adaptability and expansibility. The existing method needs to manually design and adjust rules according to the environment, and the reinforcement learning method can adapt to the environment through autonomous learning of an intelligent body. Since the agent in reinforcement learning has an autonomous decision-making capability, the agent can learn optimal behavior by interacting with the environment. In addition, the deep learning algorithm has strong perceptibility, and the deep reinforcement learning algorithm combined with the deep learning algorithm can process higher-dimensional input and is more suitable for the topics of multiple unmanned aerial vehicles in the text. The deep reinforcement learning algorithm can better cope with unknowns and changes than existing methods, and the intelligent agent can perform continuous decision tasks in a complex environment.
The current solution technology for implementing the centralized task for multiple unmanned aerial vehicles still faces many challenges, including environmental modeling, low learning efficiency, and complex action space and state space. First, for the agent project based on the deep reinforcement learning algorithm, the construction of the simulation environment is the basis of the whole experiment, and the design of the unmanned aerial vehicle system must depend on simulation tools. Therefore, establishing an appropriate unmanned aerial vehicle simulator is critical to the development of academic research and safety critical applications. However, many current environments that conduct simulation experiments based on deep reinforcement learning algorithm models lack real world portability, with many reinforcement learning environments sacrificing reality in order to achieve high sample throughput. In addition, training efficiency for path planning for multiple unmanned aerial vehicles using deep reinforcement learning algorithms is generally low. In most simulation environments, path planning tasks are rewarded sparsely, agents can only acquire rewarding signals after the task is over, and training is difficult to begin at an early stage due to the difficulty of efficient exploration in complex environments. Finally, such multi-unmanned aerial vehicle path planning problems generally involve multiple agents and multiple obstacles, so that the state space, action space, rewarding function and the like of the problems often have the characteristics of high dimensionality and complexity, and the modeling and solving difficulty of the problems is increased. Because the action space of multi-unmanned aerial vehicle path planning is large, an effective search strategy is needed to solve the problem of high-dimensional action space. In summary, the collective path planning has important significance for task execution of multiple unmanned aerial vehicles.
Disclosure of Invention
In order to solve the technical problems, the invention provides a multi-quad-rotor unmanned aerial vehicle centralized task path planning method based on reinforcement learning, which aims at solving centralized tasks in multi-unmanned aerial vehicle path planning.
The technical scheme of the invention is as follows: a multi-quad-rotor unmanned helicopter collective task path planning method based on reinforcement learning comprises the following specific steps:
s1, constructing a PyBullet-based multi-quad-rotor unmanned aerial vehicle Gym environment;
s2, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle, and setting a reward function mechanism to enable the unmanned aerial vehicle to interact with the environment;
s3, performing path planning decision by using an improved deep reinforcement learning algorithm, and performing path planning for each four-rotor unmanned aerial vehicle under an aggregation task;
s4, training the improved deep reinforcement learning network, and controlling the angular speed and the linear speed of the four-rotor unmanned aerial vehicle through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time.
Further, in the step S1, the specific steps are as follows:
s11, constructing a dynamics simulation model of the multi-quad-rotor unmanned helicopter;
the four-rotor unmanned aerial vehicle dynamic equation is formed by the motion equation and the aerodynamic effect of the four-rotor unmanned aerial vehicle, so that the construction of the four-rotor unmanned aerial vehicle dynamic simulation model is completed, and the method is as follows:
The pybullets were used to build force and torque models for each quad-rotor drone in Gym, and the kinetic equations for all quad-rotor drones were calculated and updated using the physics engine.
The arm length of each quadrotor unmanned aerial vehicle is set to be L, the mass is set to be m, the inertial property is set to be J, the physical constant and the convex collision shape are described through separate URDF files, and the configuration of the 'x' -shaped quadrotor unmanned aerial vehicle is used.
First, set the gravitational acceleration g and the physical step frequency in PyBullet, force F applied to 4 motors i And a torque T about the Z-axis of the drone o With motor speed P i Is proportional to the square of F i And P i The expression is as follows:
F i =k F ·P i 2 (1)
Figure BDA0004198555500000021
wherein k is F And k T Indicating a predetermined constant.
Setting the real-time control of the model, the kinetic equation of the quadrotor unmanned plane is expressed as follows:
J T T o =Ma+h (3)
wherein J represents a jacobian matrix, M represents an inertial matrix, a represents a generalized acceleration, h represents Coriolis and gravitational effects, and superscript T represents a transposed operation.
In practice, flying near the ground or near other unmanned aerial vehicles creates additional aerodynamic effects, which are modeled separately and used in combination in the pybullets, including: propeller resistance D, ground effect G acting on single motor i A wash-down effect W on the centroid.
Resistance D is produced to four rotor unmanned aerial vehicle's rotatory screw, and resistance D and four rotor unmanned aerial vehicle linear velocity
Figure BDA0004198555500000035
Angular speed of propeller and constant drag coefficient matrix k D Proportional, expressed as follows:
Figure BDA0004198555500000031
wherein,,
Figure BDA0004198555500000032
indicating the angular velocity of the propeller, 60 is 60s; constant drag coefficient moment k D The specific expression is as follows:
k D =diag(k ,k ,k || ) (5)
wherein k is Represents the vertical drag coefficient, k || Represents parallel drag coefficients, and matrix k D The data is fitted using a least squares method.
When hovering at very low altitudes, there is a ground effect, the effect of which on each motor G i Radius r to the propeller P Speed P i Height h i And constant k G The proportional equation is made as follows:
Figure BDA0004198555500000033
when two quadrotor robots pass through a path at the same position at different heights, there is a down-wash effect, the effect of which is reduced to a single force applied to the centre of mass of the unmanned aerial vehicle, the magnitude W of which depends on the distance (delta) between the two robots in the coordinate system x, y, z xyz ) And a constant k determined experimentally D1 ,k D2 ,k D3 The expression of W is as follows:
Figure BDA0004198555500000034
s12, constructing an observation space and an action space of the multi-four-rotor unmanned aerial vehicle;
in the constructed Gym environment, when the quadrotor unmanned aerial vehicle executes each action and outputs an observation vector, the observation space expression of the quadrotor unmanned aerial vehicle is as follows:
Figure BDA0004198555500000041
Wherein N is ∈ [ 0..N.)]Representing the number of the quadrotors; x is X n =[x,y,z] n Representing the position of the quadrotor unmanned aerial vehicle; q n Representing quaternions for attitude control of a quad-rotor unmanned helicopter; r is (r) n 、p n 、y n Respectively representing a Roll angle Roll, a Pitch angle Pitch and a Yaw angle Yaw, namely three angles for attitude estimation;
Figure BDA0004198555500000042
is->
Figure BDA0004198555500000043
Represents the linear velocity omega of the nth quad-rotor unmanned helicopter n Is [ omega ] xyz ] n Representing the angular velocity of the nth quad-rotor unmanned helicopter; p (P) n Is [ P ] 0 ,P 1 ,P 2 ,P 3 ] n Indicating the motor speeds of all unmanned aerial vehicles.
In the invention, the four-rotor unmanned aerial vehicle uses the laser radars to detect the obstacle, k laser radars are set in the model, and the k laser radars are used for observing the environment.
Wherein the k laser radars have a scanning angle range of pi and an angle between two lasers of 2 pi/k; (d) 1 ,...,d k ) Representing the ray lengths of k radars on a horizontal plane; d, d i The ray length of the ith radar is expressed as follows:
Figure BDA0004198555500000044
then the environment information s E The definition is as follows:
s E =[ρ i ,d i ] T ,i=1...k (10)
wherein ρ is i The unique thermal code representing the ith radar is expressed as follows:
Figure BDA0004198555500000045
then for any one quadrotor unmanned aerial vehicle, the action space expression is as follows:
{n:[v x ,v y ,v z ,v M ] n } (12)
wherein [ v ] x ,v y ,v z ,v M ] n Representing the speed of input to a quadrotor drone, v x ,v y ,v z Representing the components of a unit vector, v M Indicating the magnitude of the desired velocity; and the action space can be represented by the rotating speeds of four motors, and the expression is as follows:
{n:[P 0 ,P 1 ,P 2 ,P 3 ] n } (13)
finally, the conversion of the input into pulse width modulation PWM and motor speed is delegated to a controller consisting of position and attitude control subroutines.
Further, the step S2 specifically includes the following steps:
s21, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle;
the states of the quad-rotor unmanned helicopter include: position and quaternion q of four-rotor unmanned aerial vehicle n Roll angle roller n Pitch angle pitch n Yaw angle Yawy n Linear velocity
Figure BDA0004198555500000056
Angular velocity omega n Motor speed P of all unmanned aerial vehicles n =[P 0 ,P 1 ,P 2 ,P 3 ]Angle beta between first viewing angle direction of unmanned plane and target link n And the global coordinates (x, y, z) of the nth unmanned aerial vehicle and the global coordinates (x) of the target t ,y t ,z t ) Difference d between 0n
Replacing the global position of the unmanned aerial vehicle with the relative position DeltaX of the unmanned aerial vehicle and the target in the state n I.e. [ Deltax, deltay, deltaz ]] n Then the status s of the unmanned aerial vehicle U The following are provided:
Figure BDA0004198555500000051
from the state s of the unmanned aerial vehicle U And the laser radar detected environmental state s E The state space s of the quadrotor unmanned aerial vehicle can be obtained, and the expression is as follows:
Figure BDA0004198555500000052
the action space of the four-rotor unmanned aerial vehicle environment is composed of the speeds input to the four-rotor unmanned aerial vehicle, and with reference to the reference formula (12), the action space expression for any one four-rotor unmanned aerial vehicle is as follows:
a=[v x ,v y ,v z ,v M ] T (16)
S22, setting a reward function mechanism to enable the quadrotor unmanned aerial vehicle to interact with the environment;
the reward function R (s, a) represents the environmental feedback resulting from taking action a in state s; setting a reward function consisting of three parts to enable the quadrotor unmanned aerial vehicle to reach an aggregation task target point as soon as possible, wherein the reward function is specifically as follows:
first, distance between the quadrotor unmanned plane and the target point is awarded R t Forcing the quadrotor unmanned aerial vehicle to reach the target, R t The expression is as follows:
Figure BDA0004198555500000053
wherein d 0 Representing the distance of the quadrotor unmanned from the target,
Figure BDA0004198555500000054
represents the distance of the nth quad-rotor unmanned helicopter from the target,/->
Figure BDA0004198555500000055
Indicating the distance from the target to the quadrotor drone at the nth next time.
Secondly, distance between the quadrotor unmanned plane and the obstacle is set to be rewarded R o Make unmanned aerial vehicle keep away from barrier setting, R o The expression is as follows:
Figure BDA0004198555500000061
wherein d i The ray length of the ith radar is represented, namely the detection distance from the quadrotor unmanned aerial vehicle to an obstacle or other quadrotor unmanned aerial vehicle,
Figure BDA0004198555500000065
indicating the distance between the nth four-rotor unmanned aerial vehicle and the target, and setting the unmanned aerial vehicle and the obstacleDistance d of safety safe
Finally, setting an angle reward R between the quadrotor unmanned aerial vehicle and the target point a To promote the unmanned plane to approach to the target direction, if beta n The larger the penalty the larger, R a The expression is as follows:
Figure BDA0004198555500000062
further, the step S3 specifically includes the following steps:
the ITD3 algorithm is obtained by N-step report and priority experience playback improvement TD3 algorithm, the ITD3 algorithm is composed of four sub-networks, namely two critics networks and two actors networks, and the improved deep reinforcement learning algorithm is realized by the ITD3 algorithm.
Firstly, introducing N-step returns into a TD3 algorithm, wherein the N-step returns add the returns of N time steps in the future to provide more comprehensive information than single-step returns;
in the case of sparse rewards, most state transitions p (s' |s, a) have no rewards information, and the one-step rewards will not be valid; n-step rewards alleviate the problem of rewarding sparseness by sampling N transfers.
Modifying the equation of the TD3 algorithm reviewer network by adding N-step returns, and in the j-th round of sampling, modifying the time difference error function expression delta j The following are provided:
Figure BDA0004198555500000063
wherein phi and phi' represent parameters of the double commentator network, k represents the kth step return, r k Representing the return at step k, s and a representing the current state and action, s N And a N Representing target states and actions, Q(s) t ,a t Phi) represents the value function of the critic network, Q'(s) t+N ,a t+N And phi') represents the value function of the target reviewer network, and gamma represents the discount factor.
Second, using preferential empirical playback in the original TD3 algorithm, at the beginning of the sample, the sampling probability of the jth transition is defined as P (j), expressed as follows:
Figure BDA0004198555500000064
wherein p is j Representing the priority of the j-th experience; alpha denotes a constant for adjusting the sampling weight, which determines how much priority to use, and when alpha is equal to 0, uniform random sampling will be employed.
Then, the sampling weight w for each transition of the update network j Calculated by the following formula, which represents the importance of each transfer data, M represents the size of the small batch, max i w i Representing the maximized sampling weight for normalization:
Figure BDA0004198555500000071
finally, using proportion prioritization, updating the transferred priority according to the time difference error, wherein the priority is shown in the following formula:
p j =|δ j |+∈ (23)
wherein delta j Representing a time difference error, e represents a small value preset to avoid a 0 priority.
Further, the step S4 specifically includes the following steps:
the ITD3 network training network is implemented by a two-part neural network: an actor network consisting of three fully connected layers performs mapping of states to actions, and a reviewer network that uses four fully connected layers to estimate Q-values.
In the ITD3 network, for both actor networks, the input is state and the output is action. The reviewer network takes as its input the state-action pairs and generates a state-action value function (Q-value). The ITD3 algorithm training process is specifically as follows:
First a small set of samples (s, a, s ', r) is preferentially extracted from the experience playback buffer, s ' is input into the actor's target network. Then, a ' is obtained next time, and the state-action pair (s ', a ') is input to the critique target network.
After obtaining two target Q-values
Figure BDA0004198555500000072
And->
Figure BDA0004198555500000073
) Then, the smaller one is selected to calculate the target value function y (r, s'), the target value function expression is as follows:
Figure BDA0004198555500000074
wherein r represents the return, and the discount factor gamma is the same as the value of formula (20), phi i Is a random parameter of the critics network.
On the other hand, (s, a) is input into the criticizing network to obtain two Q-values (Q 1 (. Cndot.) and Q 2 (. Cndot.)). Then, they are used to calculate the mean square error of y (r, s'), and the sum of the mean square errors is back-propagated to update the parameters of the two commentator networks, and N-step playback is added to the time difference error update.
Next, the Q-value obtained from the first reviewer network is input into the actor model network, and the parameters of the actor network are updated in the direction in which the Q-value increases (once per two iterations).
And finally, updating all target networks by adopting a soft updating method.
After training, the angular speed and the linear speed of the four-rotor unmanned aerial vehicle are controlled through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time, and the collective task path planning is completed.
The beneficial effects of the invention are as follows: according to the method, firstly, a multi-quad-rotor unmanned aerial vehicle Gym environment based on PyBullet is constructed, a reward function mechanism is set through abstracting a state space and an action space of the quad-rotor unmanned aerial vehicle, then an improved deep reinforcement learning algorithm is used for making a path planning decision, finally an improved deep reinforcement learning network is trained, and the quad-rotor unmanned aerial vehicles are controlled through output action information, so that each quad-rotor unmanned aerial vehicle can achieve a specified target task in the shortest time. According to the method, N steps of return are utilized in the TD3 algorithm, so that more accurate return estimation, faster learning speed and better generalization performance are obtained, the sampling deviation is reduced by using preferential experience playback, the model deviation caused by unbalanced sampling is reduced, the stability of the algorithm is improved, the TD3 algorithm is more suitable for continuous multidimensional decision-making, and an optimal route can be planned in the shortest time under the condition of achieving specified instantaneity and accuracy.
Drawings
Fig. 1 is a flow chart of a reinforcement learning-based multi-quad-rotor unmanned helicopter collective task path planning method.
Fig. 2 is a model view of an "x" type quad-rotor unmanned helicopter in an embodiment of the invention.
Fig. 3 is a schematic diagram of laser radar detecting environmental information in a horizontal direction in an embodiment of the present invention.
Fig. 4 is a state diagram of a quad-rotor unmanned helicopter in an embodiment of the invention.
Fig. 5 is a schematic diagram of an ITD3 algorithm in an embodiment of the invention.
Fig. 6 is a diagram of a neural network in ITD3 according to an embodiment of the present invention.
Detailed Description
The process according to the invention is further described below with reference to the figures and examples.
As shown in fig. 1, the invention relates to a multi-quad-rotor unmanned helicopter collective task path planning method based on reinforcement learning, which comprises the following specific steps:
s1, constructing a PyBullet-based multi-quad-rotor unmanned aerial vehicle Gym environment;
s2, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle, and setting a reward function mechanism to enable the unmanned aerial vehicle to interact with the environment;
s3, performing path planning decision by using an improved deep reinforcement learning algorithm, and performing path planning for each four-rotor unmanned aerial vehicle under an aggregation task;
s4, training the improved deep reinforcement learning network, and controlling the angular speed and the linear speed of the four-rotor unmanned aerial vehicle through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time.
In this embodiment, the step S1 is specifically as follows:
s11, constructing a dynamics simulation model of the multi-quad-rotor unmanned helicopter;
a pybull was used to build a force and torque model for each quad-rotor drone in Gym, and the kinetic equations for all drones were calculated and updated using the physics engine.
As shown in fig. 2, the power model of the simplified "x" type quadrotor unmanned aerial vehicle constructed in the present embodiment sets the arm length of each unmanned aerial vehicle to L, the mass to m, the inertial property to J, the physical constant and the convex collision shape to be described by separate URDF documents, and is used for configuration of the "x" type quadrotor unmanned aerial vehicle.
First, the gravitational acceleration g and the physical step frequency (finer than the control frequency of Gym steps) are set in pybullets; in addition to physical properties and constants, the URDF information may be used in PyBullet to load CAD models of quadrotors; force F exerted on 4 motors i And a torque T about the Z-axis of the drone o With motor speed P i Is proportional to the square of F i And P i The expression is as follows:
F i =k F ·P i 2 (1)
Figure BDA0004198555500000091
wherein k is F And k T Indicating a predetermined constant.
F i And P i In linear dependence on the input pulse width modulation (Pulse Width Modulation, PWM), the control of the model is set in real time, and the equation of motion of the quadrotor drone is expressed as follows:
J T T o =Ma+h (3)
Where J represents a jacobian matrix, M represents an inertial matrix, a represents a generalized acceleration, h represents a Coriolis (Coriolis) and the effect of gravity, and superscript T represents a transpose operation.
In practice, flying near the ground or near other unmanned aerial vehicles may create additional aerodynamic effects such as G in FIG. 2 i=0,1,2,3 The relevant forces are denoted D and W. They can be modeled separately and used in combination in a pybullets, including: propeller resistance D, ground effect G acting on single motor i And a wash-down effect W on the centroid.
The rotating propellers of the quadrotor unmanned create a drag D, which is a force acting in a direction opposite to the direction of motion. Resistance D and four rotor unmanned aerial vehicle linear velocity
Figure BDA0004198555500000094
Angular velocity of propeller and coefficient matrix k D Proportional, expressed as follows:
Figure BDA0004198555500000092
wherein,,
Figure BDA0004198555500000093
indicating the angular velocity of the propeller, 60 is 60s; to simulate cross-coupling, it is necessary to fit a matrix k containing 9 coefficients D . Fitting requires some symmetry: the drag coefficient and the cross-coupling between the x and y axes should be the same. Due to symmetry, the wind speed in the z direction produces the same force in the x and y directions. In addition, the resistance in the z direction caused by the speed in the x direction should be the same as the resistance in the z direction caused by the speed in the y direction. Thus, the drag coefficient matrix k D The specific expression is as follows:
k D =diag(k ,k ,k || ) (5)
wherein k is Representing vertical resistanceForce coefficient, k || Represents parallel drag coefficients, and matrix k D The data is fitted using a least squares method.
When hovering at very low altitudes, there is a ground effect, i.e. the thrust caused by the interaction of the propeller airflow with the ground experienced by the quadrotor unmanned will increase, the ground effect will have an effect on each motor G i Radius r to the propeller P Speed P i Height h i And constant k G The proportional equation is made as follows:
Figure BDA0004198555500000101
when two quadrotor unmanned aerial vehicles pass through paths at the same positions of different heights, a downward-washing effect exists, the downward-washing effect can cause the lifting force of the bottom aircraft to be reduced, the influence of the downward-washing effect is simplified into a single acting force applied to the mass center of the unmanned aerial vehicle, and the size W of the single acting force depends on the distance (delta) between the two unmanned aerial vehicles in a coordinate system x, y and z xyz ) And a constant k determined experimentally D1 ,k D2 ,k D3 The expression of W is as follows:
Figure BDA0004198555500000102
s12, constructing an observation space and an action space of the multi-four-rotor unmanned aerial vehicle;
in the constructed Gym environment, when the quadrotor unmanned aerial vehicle executes each action and outputs an observation vector, the observation space expression of the quadrotor unmanned aerial vehicle is as follows:
Figure BDA0004198555500000103
wherein N is ∈ [ 0..N.) ]Representing the number of the quadrotors; x is X n =[x,y,z] n Representing the position of the quadrotor unmanned aerial vehicle; q n Representing quaternions for quad-rotor unmannedControlling the posture of the machine; r is (r) n 、p n 、y n Respectively representing a Roll angle Roll, a Pitch angle Pitch and a Yaw angle Yaw, namely three angles for attitude estimation;
Figure BDA0004198555500000104
is->
Figure BDA0004198555500000105
Represents the linear velocity omega of the nth quad-rotor unmanned helicopter n Is [ omega ] xyz ] n Representing the angular velocity of the nth quad-rotor unmanned helicopter; p (P) n Is [ P ] 0 ,P 1 ,P 2 ,P 3 ] n Indicating the motor speeds of all unmanned aerial vehicles.
As shown in fig. 3, in the present embodiment, the quadrotor unmanned aerial vehicle uses the lidar to detect the obstacle, k lidars are set in the model for the quadrotor unmanned aerial vehicle, and the environment is observed using the k lidars.
Wherein the k lidar scan angle ranges are pi (k=24 in this embodiment), and the angle between the two lasers is 2 pi/k; (d) 1 ,...,d k ) Representing the ray lengths of k radars on a horizontal plane; if a sensor does not detect any object within a limited distance, then the length of the ray is the maximum detectable distance. Otherwise, the length is the distance between the unmanned aerial vehicle and the point detected by the radar; d, d i The ray length of the ith radar is expressed as follows:
Figure BDA0004198555500000111
then the environment information s E The definition is as follows:
s E =[ρ i ,d i ] T ,i=1...k (10)
wherein ρ is i A unique thermal code representing the ith radar, ρ if the radar detects a detectable object within a limited distance i 1, otherwise 0, an expression such asThe following steps:
Figure BDA0004198555500000112
then for any one quadrotor unmanned aerial vehicle, the action space expression is as follows:
{n:[v x ,v y ,v z ,v M ] n } (12)
wherein [ v ] x ,v y ,v z ,v M ] n Representing the speed of input to a quadrotor drone, v x ,v y ,v z Representing the components of a unit vector, v M Indicating the magnitude of the desired velocity; and the action space can be represented by the rotating speeds of four motors, and the expression is as follows:
{n:[P 0 ,P 1 ,P 2 ,P 3 ] n } (13)
finally, the conversion of the input into pulse width modulation (Pulse Width Modulation, PWM) and motor speed is delegated to a controller consisting of position and attitude control subroutines.
In this embodiment, the step S2 is specifically as follows:
s21, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle;
the states of quad-rotor unmanned helicopter n include: position, quaternion (attitude control for a quad-rotor unmanned aerial vehicle) q of a quad-rotor unmanned aerial vehicle n Roll angle (Roll) r n Pitch angle (Pitch) p n Yaw angle (Yaw) y n Linear velocity
Figure BDA0004198555500000113
Angular velocity omega n Motor speed P of all unmanned aerial vehicles n =[P 0 ,P 1 ,P 2 ,P 3 ]The method comprises the steps of carrying out a first treatment on the surface of the The angle beta between the first viewing angle direction of the drone n and the target link as shown in fig. 4 n And the global coordinates (x, y, z) of the nth unmanned aerial vehicle and the global coordinates (x) of the target t ,y t ,z t ) Difference d between 0n
In order to enable the unmanned aerial vehicle to reach the target faster, the convergence speed is improved, and the global position of the unmanned aerial vehicle is replaced by the relative position delta X of the unmanned aerial vehicle and the target in a state n I.e. [ Deltax, deltay, deltaz ]] n Then the status s of the unmanned aerial vehicle U The following are provided:
Figure BDA0004198555500000121
from the state s of the unmanned aerial vehicle U And the laser radar detected environmental state s E The state space s of the quadrotor unmanned aerial vehicle can be obtained, and the expression is as follows:
Figure BDA0004198555500000122
the action space of the four-rotor unmanned aerial vehicle environment is composed of the speeds input to the four-rotor unmanned aerial vehicle, and with reference to the reference formula (12), the action space expression for any one four-rotor unmanned aerial vehicle is as follows:
a=[v x ,v y ,v z ,v M ] T (16)
s22, setting a reward function mechanism to enable the quadrotor unmanned aerial vehicle to interact with the environment;
the setting of the reward function has a great influence on the performance of the deep reinforcement learning model and determines the strategy of the unmanned aerial vehicle. The reward function R (s, a) represents the environmental feedback resulting from taking action a in state s, used to evaluate the quality of the action taken in the current state. The method comprises the steps of carrying out a first treatment on the surface of the If R (s, a) is large, indicating that taking action a in current state s is beneficial to achieving the goal, the probability of taking action a in state s will increase in the next policy update, otherwise, the probability will be decreased.
In order to make the quadrotor unmanned aerial vehicle reach the gathering task target point as soon as possible, in this embodiment, a reward function composed of three parts is set to make the quadrotor unmanned aerial vehicle reach the gathering task target point as soon as possible, which is specifically as follows:
firstly, setting a distance between a quadrotor unmanned plane and a target pointLeave rewards R t Forcing the quadrotor unmanned aerial vehicle to reach the target, R t The settings were as follows: if the unmanned plane approaches the target, the rewards are positive, and the rewards are maximum when the unmanned plane reaches the target point; if the unmanned aerial vehicle is far away from the target, the reward is negative, and if the unmanned aerial vehicle does not reach the target for more than the preset time, the punishment is maximally-5. R is R t The expression is as follows:
Figure BDA0004198555500000123
wherein d 0 Representing the distance of the quadrotor unmanned from the target,
Figure BDA0004198555500000124
represents the distance of the nth quad-rotor unmanned helicopter from the target,/->
Figure BDA0004198555500000125
Indicating the distance from the target to the quadrotor drone at the nth next time.
Secondly, distance between the quadrotor unmanned plane and the obstacle is set to be rewarded R o Make unmanned aerial vehicle keep away from barrier setting, set up R o The following are provided: if the distance between the drone and the nearest obstacle is less than d safe The drone will be penalized; if the unmanned aerial vehicle collides with the obstacle, punishment is-3; if the distance between the drone and the nearest obstacle is less than d safe The unmanned aerial vehicle is safe and cannot be punished. R is R o The expression is as follows:
Figure BDA0004198555500000131
wherein d i The ray length of the ith radar is represented, namely the detection distance from the quadrotor unmanned aerial vehicle to an obstacle or other quadrotor unmanned aerial vehicle,
Figure BDA0004198555500000132
indicating the distance from the nth four-rotor unmanned plane to the target, and settingSafety distance d between unmanned aerial vehicle and obstacle safe
Finally, setting an angle reward R between the quadrotor unmanned aerial vehicle and the target point a To promote the unmanned plane to approach to the target direction, if beta n The larger (as shown in FIG. 4) the larger the penalty, the more R a The expression is as follows:
Figure BDA0004198555500000133
in this embodiment, the step S3 is specifically as follows:
in the embodiment, an ITD3 algorithm is used for realizing multi-unmanned aerial vehicle centralized task path planning in an unknown environment. The TD3 algorithm solves the problem of overestimation deviation of depth deterministic strategy gradients (Deep Deterministic Policy Gradient, DDPG), is a deterministic strategy reinforcement learning algorithm, and is suitable for high-dimensional continuous motion space.
The TD3 algorithm solves the problems of Q-value estimation errors and excessive variance by using a dual Q-network and delay update strategy. However, when the environment has delayed rewards, the TD3 algorithm may require more experience to learn how to make the correct decisions. To address this problem, the present embodiment introduces an N-step reward into the TD3 algorithm.
The ITD3 algorithm is obtained by N-step report and priority experience playback improvement TD3 algorithm, the ITD3 algorithm is composed of four sub-networks, namely two critics networks and two actors networks, and the improved deep reinforcement learning algorithm is realized by the ITD3 algorithm.
Firstly, introducing N-step returns into a TD3 algorithm, wherein the N-step returns add the returns of N time steps in the future to provide more comprehensive information than single-step returns; therefore, the algorithm can better utilize the delayed return and improve the learning efficiency.
In the case of sparse rewards, most state transitions (State Transitions) p (s' |s, a) have no rewards information, and the one-step rewards will not be valid; n-step rewards environmental rewards sparsity by sampling N transitions (here set to the instance value n=4).
This embodiment adds to TD3N-step rewards increase the chances of finding and learning from rewarded transitions, thus improving learning efficiency. Modifying the equation of the TD3 algorithm reviewer network by adding N-step rewards, and in the j-th round of sampling, modifying a time difference error (TimeDifference) function expression delta j The following are provided:
Figure BDA0004198555500000141
wherein phi and phi' represent parameters of the double commentator network, k represents the kth step return, r k Representing the return at step k, s and a representing the current state and action, s N And a N Representing target states and actions, Q(s) t ,a t Phi) represents the value function of the critic network, Q'(s) t+N ,a t+N And phi') represents the value function of the target reviewer network, and gamma represents the discount factor.
Second, each experience in the original TD3 algorithm is uniformly sampled, but without the weight score, learning efficiency is low, and adding priority can solve this problem. Priority experience playback is a technique to enhance the performance of DRL, which prioritizes samples based on the importance of experience, allowing important samples to be sampled more frequently, thereby improving the learning efficiency and performance of the model. At the beginning of a sample, the sampling probability of the jth transition is defined as P (j), expressed as follows:
Figure BDA0004198555500000142
wherein p is j Representing the priority of the j-th experience; alpha denotes a constant for adjusting the sampling weight, which determines how much priority to use, and when alpha is equal to 0, uniform random sampling will be employed.
Then, the sampling weight w for each transition of the update network j Calculated from the following formula, which represents the importance of each transfer data, M represents the size of a small batch (mini-batch), max i w i Representing the maximized sampling weight for normalization:
Figure BDA0004198555500000143
Finally, using proportion prioritization, updating the transferred priority according to the time difference error, wherein the priority is shown in the following formula:
p j =|δ j |+∈ (23)
wherein delta j Representing a time difference error, e represents a small value preset to avoid a 0 priority.
In this embodiment, the step S4 is specifically as follows:
the ITD3 network training network is implemented by a two-part neural network: an actor network consisting of three fully connected layers performs mapping of states to actions, and a reviewer network that uses four fully connected layers to estimate Q-values.
In the ITD3 network, for both actor networks, the input is state and the output is action. The reviewer network takes as its input the state-action pairs and generates a state-action value function (Q-value). Then, as shown in fig. 5, the ITD3 algorithm training process is specifically as follows:
first a small set of samples (s, a, s ', r) is preferentially extracted from the experience playback buffer, s ' is input into the actor's target network. Then, a ' is obtained next time, and the state-action pair (s ', a ') is input to the critique target network.
After obtaining two target Q-values
Figure BDA0004198555500000151
And->
Figure BDA0004198555500000152
) Then, the smaller one is selected to calculate the target value function y (r, s'), the target value function expression is as follows:
Figure BDA0004198555500000153
Wherein r represents the return, and the discount factor gamma is the same as the value of formula (20), phi i Is a random parameter of the critics network.
On the other hand, (s, a) is input into the criticizing network to obtain two Q-values (Q 1 (. Cndot.) and Q 2 (. Cndot.)). Then, they are used to calculate the mean square error of y (r, s'), and the sum of the mean square errors is back-propagated to update the parameters of the two commentator networks, and N-step playback is added to the time difference error update.
Next, the Q-value obtained from the first reviewer network is input into the actor model network, and the parameters of the actor network are updated in the direction in which the Q-value increases (once per two iterations).
And finally, updating all target networks by adopting a soft updating method.
The complexity of the neural network is related to the number of samples based on empirical risk minimization. Therefore, according to the state space and the action space (the size of the observed quantity) obtained by the above steps in the present embodiment, the network structure of the ITD3 is designed as fig. 6.
In this embodiment, the ITD3 actor network consists of three FC neural network layers, 512, 128, and 4 nodes. The first and second layers are activated by rectifying linear units (Rectified Linear Unit, reLU) and the third layer is activated by hyperbolic tangent (Hyperbolic Tangent, tanh) to ensure the output of the actor network is [ -1,1 ]Within a range of (2). After three FCs, the actor will input s t Action instruction a mapped to unmanned aerial vehicle t . For a reviewer network, it estimates the Q-value from four FCs. Input s t First, input FC with ReLU activation function (node number 1024), then sum the vector with action a t And are combined into a 1028-dimensional vector. The reviewer network transfers the vector into the Q-value via two ReLU activation functions and one tanh activation function.
After training, the angular speed and the linear speed of the four-rotor unmanned aerial vehicle are controlled through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time, and the collective task path planning is completed.
In summary, the method of the invention firstly models Gym environment of the four-rotor unmanned aerial vehicle based on PyBullet, and improves portability of the invention. Secondly, N steps of return are utilized in a dual-delay depth deterministic strategy gradient (Twin Delayed Deep Deterministic PolicyGradient, TD 3) algorithm to obtain more accurate return estimation, faster learning speed and better generalization performance; meanwhile, the sampling deviation is reduced by using preferential experience playback, the model deviation caused by unbalanced sampling is reduced, and the stability of the algorithm is improved, so that the TD3 algorithm is more suitable for the continuous multidimensional decision problem. And finally, training an Improved TD3 (ITD 3) network, and controlling the angular speed and the linear speed of the quadrotor unmanned aerial vehicle to enable the quadrotor unmanned aerial vehicle to successfully reach an aggregation task target point in the shortest time under the condition of unknown environment.
The embodiments described above are intended to assist the reader in understanding the principles of the invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims (5)

1. A multi-quad-rotor unmanned helicopter collective task path planning method based on reinforcement learning comprises the following specific steps:
s1, constructing a PyBullet-based multi-quad-rotor unmanned aerial vehicle Gym environment;
s2, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle, and setting a reward function mechanism to enable the unmanned aerial vehicle to interact with the environment;
s3, performing path planning decision by using an improved deep reinforcement learning algorithm, and performing path planning for each four-rotor unmanned aerial vehicle under an aggregation task;
s4, training the improved deep reinforcement learning network, and controlling the angular speed and the linear speed of the four-rotor unmanned aerial vehicle through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time.
2. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein in the step S1, the method is specifically as follows:
S11, constructing a dynamics simulation model of the multi-quad-rotor unmanned helicopter;
the four-rotor unmanned aerial vehicle dynamic equation is formed by the motion equation and the aerodynamic effect of the four-rotor unmanned aerial vehicle, so that the construction of the four-rotor unmanned aerial vehicle dynamic simulation model is completed, and the method is as follows:
using PyBullet to establish a force and torque model acting on each quadrotor unmanned aerial vehicle in Gym, and calculating and updating a kinetic equation of all the quadrotor unmanned aerial vehicles by using a physical engine;
setting the arm length of each four-rotor unmanned aerial vehicle to be L, the mass to be m, the inertial property to be J, the physical constant and the convex collision shape to be described through a separate URDF file, and configuring the 'x' -shaped four-rotor unmanned aerial vehicle;
first, set the gravitational acceleration g and the physical step frequency in PyBullet, force F applied to 4 motors i And a torque T about the Z-axis of the drone o With motor speed P i Is proportional to the square of F i And P i The expression is as follows:
F i =k F ·P i 2 (1)
Figure FDA0004198555480000011
wherein k is F And k T Representing a predetermined constant;
setting the real-time control of the model, the kinetic equation of the quadrotor unmanned plane is expressed as follows:
J T T o =Ma+h (3)
wherein J represents a jacobian matrix, M represents an inertia matrix, a represents generalized acceleration, h represents Coriolis and the effect of gravity, and the superscript T represents a transposed operation;
In practice, flying near the ground or near other unmanned aerial vehicles creates additional aerodynamic effects, which are modeled separately and used in combination in the pybullets, including: propeller resistance D, ground effect G acting on single motor i A wash-down effect W on the centroid;
resistance D is produced to four rotor unmanned aerial vehicle's rotatory screw, and resistance D and four rotor unmanned aerial vehicle linear velocity
Figure FDA0004198555480000026
Angular speed of propeller and constant drag coefficient matrix k D Proportional, expressed as follows:
Figure FDA0004198555480000021
wherein,,
Figure FDA0004198555480000022
indicating the angular velocity of the propeller, 60 is 60s; constant drag coefficient moment k D The specific expression is as follows:
k D =diag(k ,k ,k || ) (5)
wherein k is Represents the vertical drag coefficient, k || Represents parallel drag coefficients, and matrix k D Fitting the data by using a least square method;
when hovering at very low altitudes, there is a ground effect, the effect of which on each motor G i Radius r to the propeller P Speed P i Height h i And constant k G The proportional equation is made as follows:
Figure FDA0004198555480000023
when two four rotor unmannedWhen the plane passes through the path at the same position with different heights, a downward washing effect exists, the influence of the downward washing effect is simplified into a single acting force applied to the mass center of the unmanned plane, and the size W of the downward washing effect depends on the distance (delta xyz ) And a constant k determined experimentally D1 ,k D2 ,k D3 The expression of W is as follows:
Figure FDA0004198555480000024
s12, constructing an observation space and an action space of the multi-four-rotor unmanned aerial vehicle;
in the constructed Gym environment, when the quadrotor unmanned aerial vehicle executes each action and outputs an observation vector, the observation space expression of the quadrotor unmanned aerial vehicle is as follows:
Figure FDA0004198555480000025
wherein N is ∈ [ 0..N.)]Representing the number of the quadrotors; x is X n =[x,y,z] n Representing the position of the quadrotor unmanned aerial vehicle; q n Representing quaternions for attitude control of a quad-rotor unmanned helicopter; r is (r) n 、p n 、y n Respectively representing a Roll angle Roll, a Pitch angle Pitch and a Yaw angle Yaw, namely three angles for attitude estimation;
Figure FDA0004198555480000031
is->
Figure FDA0004198555480000032
Represents the linear velocity omega of the nth quad-rotor unmanned helicopter n Is [ omega ] xyz ] n Representing the angular velocity of the nth quad-rotor unmanned helicopter; p (P) n Is [ P ] 0 ,P 1 ,P 2 ,P 3 ] n Representing motor speeds of all unmanned aerial vehicles;
in the invention, the four-rotor unmanned aerial vehicle uses the laser radar to detect the obstacle, k laser radars are set in the model, and the k laser radars are used for observing the environment;
wherein the k laser radars have a scanning angle range of pi and an angle between two lasers of 2 pi/k; (d) 1 ,...,d k ) Representing the ray lengths of k radars on a horizontal plane; d, d i The ray length of the ith radar is expressed as follows:
Figure FDA0004198555480000033
Then the environment information s E The definition is as follows:
s E =[ρ i ,d i ] T ,i=1...k (10)
wherein ρ is i The unique thermal code representing the ith radar is expressed as follows:
Figure FDA0004198555480000035
then for any one quadrotor unmanned aerial vehicle, the action space expression is as follows:
{n:[v x ,v y ,v z ,v M ] n } (12)
wherein [ v ] x ,v y ,v z ,v M ] n Representing the speed of input to a quadrotor drone, v x ,v y ,v z Representing the components of a unit vector, v M Indicating the magnitude of the desired velocity; and the action space can be represented by the rotating speeds of four motors, and the expression is as follows:
{n:[P 0 ,P 1 ,P 2 ,P 3 ] n } (13)
finally, the conversion of the input into pulse width modulation PWM and motor speed is delegated to a controller consisting of position and attitude control subroutines.
3. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S2 is specifically as follows:
s21, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle;
the states of the quad-rotor unmanned helicopter include: position and quaternion q of four-rotor unmanned aerial vehicle n Roll angle roller n Pitch angle pitch n Yaw angle Yawy n Linear velocity
Figure FDA0004198555480000034
Angular velocity omega n Motor speed P of all unmanned aerial vehicles n =[P 0 ,P 1 ,P 2 ,P 3 ]Angle beta between first viewing angle direction of unmanned plane and target link n And the global coordinates (x, y, z) of the nth unmanned aerial vehicle and the global coordinates (x) of the target t ,y t ,z t ) Difference d between 0n
Replacing the global position of the unmanned aerial vehicle with the relative position DeltaX of the unmanned aerial vehicle and the target in the state n I.e. [ Deltax, deltay, deltaz ]] n Then the status s of the unmanned aerial vehicle U The following are provided:
Figure FDA0004198555480000041
from the state s of the unmanned aerial vehicle U And the laser radar detected environmental state s E The state space s of the quadrotor unmanned aerial vehicle can be obtained, and the expression is as follows:
Figure FDA0004198555480000042
the action space of the four-rotor unmanned aerial vehicle environment is composed of the speeds input to the four-rotor unmanned aerial vehicle, and with reference to the reference formula (12), the action space expression for any one four-rotor unmanned aerial vehicle is as follows:
Figure FDA0004198555480000043
s22, setting a reward function mechanism to enable the quadrotor unmanned aerial vehicle to interact with the environment;
the reward function R (s, a) represents the environmental feedback resulting from taking action a in state s; setting a reward function consisting of three parts to enable the quadrotor unmanned aerial vehicle to reach an aggregation task target point as soon as possible, wherein the reward function is specifically as follows:
first, distance between the quadrotor unmanned plane and the target point is awarded R t Forcing the quadrotor unmanned aerial vehicle to reach the target, R t The expression is as follows:
Figure FDA0004198555480000044
wherein d 0 Representing the distance of the quadrotor unmanned from the target,
Figure FDA0004198555480000045
representing the distance of the nth quad-rotor unmanned helicopter from the target,
Figure FDA0004198555480000046
representing the distance from the quadrotor unmanned plane to the target at the nth next time;
Secondly, distance between the quadrotor unmanned plane and the obstacle is set to be rewarded R o Make unmanned aerial vehicle keep away from barrier setting, R o The expression is as follows:
Figure FDA0004198555480000047
wherein d i The ray length of the ith radar is represented, namely the detection distance from the quadrotor unmanned aerial vehicle to an obstacle or other quadrotor unmanned aerial vehicle,
Figure FDA0004198555480000051
indicating the distance between the nth four-rotor unmanned aerial vehicle and the target, and setting the safety distance d between the unmanned aerial vehicle and the obstacle safe
Finally, setting an angle reward R between the quadrotor unmanned aerial vehicle and the target point a To promote the unmanned plane to approach to the target direction, if beta n The larger the penalty the larger, R a The expression is as follows:
Figure FDA0004198555480000052
4. the method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S3 is specifically as follows:
the method comprises the steps of N-step report and preferential experience playback, improving a TD3 algorithm to obtain an ITD3 algorithm, wherein the ITD3 algorithm consists of four sub-networks, namely two critique networks and two actor networks, and the improved deep reinforcement learning algorithm is realized by the ITD3 algorithm;
firstly, introducing N-step returns into a TD3 algorithm, wherein the N-step returns add the returns of N time steps in the future to provide more comprehensive information than single-step returns;
in the case of sparse rewards, most state transitions p (s' |s, a) have no rewards information, and the one-step rewards will not be valid; n-step rewards are used for relieving the problem of sparse rewards by sampling N transfers;
Modifying the equation of the TD3 algorithm reviewer network by adding N-step returns, and in the j-th round of sampling, modifying the time difference error function expression delta j The following are provided:
Figure FDA0004198555480000053
wherein phi and phi' represent parameters of the double commentator network, k represents the kth step return, r k Representation ofReturn at step k, s and a represent current state and action, s N And a N Representing target states and actions, Q(s) t ,a t Phi) represents the value function of the critic network, Q'(s) t+N ,a t+N Phi') represents the value function of the target critique network, gamma represents the discount factor;
second, using preferential empirical playback in the original TD3 algorithm, at the beginning of the sample, the sampling probability of the jth transition is defined as P (j), expressed as follows:
Figure FDA0004198555480000054
wherein p is j Representing the priority of the j-th experience; α represents a constant for adjusting the sampling weight, α determines how much priority to use, and when α equals 0, uniform random sampling will be employed;
then, the sampling weight w for each transition of the update network j Calculated by the following formula, which represents the importance of each transfer data, M represents the size of the small batch, max i w i Representing the maximized sampling weight for normalization:
Figure FDA0004198555480000061
finally, using proportion prioritization, updating the transferred priority according to the time difference error, wherein the priority is shown in the following formula:
p j =|δ j |+∈ (23)
Wherein delta j Representing a time difference error, e represents a small value preset to avoid a 0 priority.
5. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S4 is specifically as follows:
the ITD3 network training network is implemented by a two-part neural network: mapping states to actions is accomplished by an actor network consisting of three fully connected layers, and a reviewer network using four fully connected layers to estimate Q-values;
in the ITD3 network, for both actor networks, the input is state and the output is action; the critics network takes the state-action pair as input and generates a state-action value function (Q-value); the ITD3 algorithm training process is specifically as follows:
firstly, preferentially extracting a batch of small samples (s, a, s ', r) from an experience playback buffer zone, and inputting s' into an actor target network; then, a ' is obtained next time, and the state-action pair (s ', a ') is input into the criticizing target network;
after obtaining two target Q-values
Figure FDA0004198555480000062
And->
Figure FDA0004198555480000063
) Then, the smaller one is selected to calculate the target value function y (r, s'), the target value function expression is as follows:
Figure FDA0004198555480000064
wherein r represents the return, and the discount factor gamma is the same as the value of formula (20), phi i Random parameters of the critics network;
on the other hand, (s, a) is input into the criticizing network to obtain two Q-values (Q 1 (. Cndot.) and Q 2 (. Cndot.); then, using them to calculate the mean square error of y (r, s'), and back-propagating the sum of the mean square errors to update the parameters of two critic networks, and adding N-step playback in the time difference error update;
next, inputting the Q-value obtained from the first criticizing home network into the actor model network, and updating parameters of the actor network in a direction in which the Q-value increases (once every two iterations);
finally, updating all target networks by adopting a soft updating method;
after training, the angular speed and the linear speed of the four-rotor unmanned aerial vehicle are controlled through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time, and the collective task path planning is completed.
CN202310454330.7A 2023-04-25 2023-04-25 A multi-quadrotor UAV assembly task path planning method based on reinforcement learning Pending CN116301007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310454330.7A CN116301007A (en) 2023-04-25 2023-04-25 A multi-quadrotor UAV assembly task path planning method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310454330.7A CN116301007A (en) 2023-04-25 2023-04-25 A multi-quadrotor UAV assembly task path planning method based on reinforcement learning

Publications (1)

Publication Number Publication Date
CN116301007A true CN116301007A (en) 2023-06-23

Family

ID=86788905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310454330.7A Pending CN116301007A (en) 2023-04-25 2023-04-25 A multi-quadrotor UAV assembly task path planning method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN116301007A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117215197A (en) * 2023-10-23 2023-12-12 南开大学 Four-rotor aircraft online track planning method, four-rotor aircraft online track planning system, electronic equipment and medium
CN118733901A (en) * 2024-06-25 2024-10-01 中国人民解放军32133部队 PTZ control path search method based on big data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117215197A (en) * 2023-10-23 2023-12-12 南开大学 Four-rotor aircraft online track planning method, four-rotor aircraft online track planning system, electronic equipment and medium
CN117215197B (en) * 2023-10-23 2024-03-29 南开大学 Quadrotor aircraft online trajectory planning method, system, electronic equipment and media
CN118733901A (en) * 2024-06-25 2024-10-01 中国人民解放军32133部队 PTZ control path search method based on big data

Similar Documents

Publication Publication Date Title
Abdelmaksoud et al. Control strategies and novel techniques for autonomous rotorcraft unmanned aerial vehicles: A review
CN111596684B (en) Fixed-wing unmanned aerial vehicle dense formation and anti-collision obstacle avoidance semi-physical simulation system and method
CN112947572B (en) Terrain following-based four-rotor aircraft self-adaptive motion planning method
Kose et al. Simultaneous design of morphing hexarotor and autopilot system by using deep neural network and SPSA
CN111538255B (en) Aircraft control method and system for an anti-swarm drone
CN113885549B (en) Quadrotor attitude trajectory control method based on dimensionally clipped PPO algorithm
Patel et al. An intelligent hybrid artificial neural network-based approach for control of aerial robots
CN112161626B (en) High-flyability route planning method based on route tracking mapping network
CN115857546B (en) A modular reconfigurable flight array dynamics model and fixed-time sliding mode control method
CN116301007A (en) A multi-quadrotor UAV assembly task path planning method based on reinforcement learning
Houghton et al. Path planning: Differential dynamic programming and model predictive path integral control on VTOL aircraft
Chen Research on AI application in the field of quadcopter UAVs
Basescu et al. Precision post-stall landing using NMPC with learned aerodynamics
Shen et al. Multibody-dynamic modeling and stability analysis for a bird-scale flapping-wing aerial vehicle
Hamissi et al. A new nonlinear control design strategy for fixed wing aircrafts piloting
Flores et al. Implementation of a neural network for nonlinearities estimation in a tail-sitter aircraft
Annamalai et al. Design, modeling and simulation of a control surface-less tri-tilt-rotor UAV
De Almeida et al. Controlling tiltrotors unmanned aerial vehicles (UAVs) with deep reinforcement learning
Abrougui et al. Flight Controller Design Based on Sliding Mode Control for Quadcopter Waypoints Tracking
Adamski Development and deployment of a dynamic soaring capable UAV using reinforcement learning
Kadhim et al. Improving the Size of the Propellers of the Parrot Mini-Drone and an Impact Study on its Flight Controller System.
Yu et al. A novel Brain-inspired architecture and flight experiments for autonomous maneuvering flight of unmanned aerial vehicles
Ekechi Intelligent control of a Swarm of Unmanned Aerial Vehicles in Turbulent Environments Using Clustering-PPO Algorithm
Guo et al. Comparative Study and Airspeed Sensitivity Analysis of Full‐Wing Solar‐Powered UAVs Using Rigid‐Body, Multibody, and Rigid‐Flexible Combo Models
Adamski et al. Towards development of a dynamic soaring capable UAV using reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination