CN115903887B

CN115903887B - A hypersonic vehicle trajectory planning method based on reinforcement learning

Info

Publication number: CN115903887B
Application number: CN202211400265.1A
Authority: CN
Inventors: 彭雯
Original assignee: Beijing Jiutian Aoxiang Technology Co ltd
Current assignee: Beijing Jiutian Aoxiang Technology Co ltd
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2025-02-14
Anticipated expiration: 2042-11-09
Also published as: CN115903887A

Abstract

The invention belongs to the technical field of hypersonic aircraft trajectory planning, and discloses a hypersonic aircraft trajectory planning method based on reinforcement learning, which comprises the steps of constructing a trajectory planning scene; designing an action set, a state space and a rewarding form, establishing an DQN network, inputting a current state to the DQN network, outputting evaluation values of elements in the current state, screening infeasible actions in the current state based on a track prediction result, modifying the evaluation values of the infeasible elements, selecting an action element with the highest evaluation value as an input of an aircraft kinematics equation, updating positions, speeds and states based on the aircraft kinematics, judging whether a track planning task is finished according to the latest position, and starting training after the completion of the track planning task. The method improves the track planning precision and the rapidity of the hypersonic aircraft and provides powerful support for the safe flight of the hypersonic aircraft.

Description

Hypersonic aircraft trajectory planning method based on reinforcement learning

Technical Field

The invention relates to the technical field of hypersonic aircraft trajectory planning, in particular to a hypersonic aircraft trajectory planning method based on reinforcement learning.

Background

In the flying process of the hypersonic aircraft, due to the problems of political margin factors, anti-air defense reverse guiding strategic deployment and the like, the positions become no-fly zones, the no-fly zones are required to be flexibly avoided, the hypersonic aircraft has weak maneuverability, and once a new no-fly zone temporarily appears, the no-fly zones are difficult to avoid. Therefore, the rapid and accurate trajectory planning can improve the evading effect of the aircraft.

Hypersonic aircraft trajectory planning is typically by way of path planning and trajectory optimization methods. The hypersonic speed aircraft has long flight distance, high flight speed and weak maneuverability, and under the background, the traditional method is difficult to simultaneously consider maneuverability constraint, track planning precision and track planning speed.

Therefore, how to provide a hypersonic aircraft trajectory planning method for improving the accuracy and the rapidity of hypersonic aircraft trajectory planning under the condition of meeting the constraint of the maneuverability becomes a technical problem which needs to be solved by the technicians in the field.

Disclosure of Invention

The invention aims to provide a hypersonic aircraft trajectory planning method based on reinforcement learning so as to solve the problems.

The invention solves the technical problems by adopting the following technical scheme:

A hypersonic aircraft trajectory planning method based on reinforcement learning, the method comprising;

S1, constructing a track planning scene, wherein the track planning scene plans according to the flight state of the acquired aircraft and the constraint of a no-fly zone;

S2, constructing a state space, a design space action set and a rewarding form;

Step S3, an DQN network is established, input variables of the DQN network are the state space constructed in the step S2, output variables of the DQN network are the action spaces obtained in the step S2, and prediction results are the evaluation values of elements in the step S2 action space sets;

Step S4, element evaluation value determination, namely inputting a current state space into the DQN network and outputting evaluation values of elements in an action space set in the current state;

Step S5, updating element evaluation values, screening out infeasible actions in the current state based on track prediction results, and modifying the evaluation values of the infeasible actions elements;

And S6, inputting trajectory planning control parameters, selecting an action element with the highest evaluation value as an aircraft kinematics equation input, and updating the position, speed and state based on the aircraft kinematics.

And S7, judging whether the track planning task is ended or not according to the latest position in the step S6, if the task ending condition is met, storing the track and introducing the track into a training data set, otherwise, iteratively executing the steps S4, S5, S6 and S7.

A hypersonic aircraft trajectory planning system based on reinforcement learning, the system comprising the following units;

The track planning scene construction unit is used for constructing a track planning scene which is planned according to the acquired flight state and the restricted flying zone of the aircraft;

the learning environment construction unit is used for constructing a state space, a design space action set and a rewarding form;

The DQN network construction unit is used for building a DQN network, wherein the input variable of the DQN network is a state space constructed in the step S2, the output variable of the DQN network is an action space obtained in the step S2, and the prediction result is an evaluation value of each element in the set of the action spaces in the step S2;

the first execution unit is used for determining element evaluation values, inputting a current state space into the DQN network and outputting the evaluation values of all elements in the action space set under the current state;

The second execution unit is used for updating the element evaluation value, screening the infeasible action in the current state based on the track prediction result, and modifying the evaluation value of the infeasible action element;

And the third execution unit is used for inputting the trajectory planning control parameters, selecting the action element with the highest evaluation value as the input of an aircraft kinematics equation, and updating the position, the speed and the state based on the aircraft kinematics.

The judging iteration unit is used for judging the task, judging whether the track planning task is ended according to the latest position of the aircraft, if the task ending condition is reached, storing the track and introducing the track into the training data set, otherwise, iteratively executing the execution steps from the first execution unit to the third execution unit.

An information storage device stores a hypersonic aircraft trajectory planning method based on reinforcement learning.

The beneficial effects are that:

The invention discloses a hypersonic aircraft track planning method based on reinforcement learning, which is used for rapidly evaluating element values in an action set based on a DQN network, and eliminating actions which can enter a no-fly zone by modifying an inoperable evaluation value so as to realize rapid track planning under the condition of avoiding the no-fly zone.

Drawings

Fig. 1 is a flowchart of a hypersonic aircraft trajectory planning method based on reinforcement learning.

Fig. 2 is a state diagram of a scenario of the present invention.

Fig. 3 is a schematic view of a scene state of the present invention.

Fig. 4 is a view showing a structure of the DQN network of the invention.

FIG. 5 is a diagram illustrating ineffective action screening according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The invention discloses a hypersonic aircraft trajectory planning method based on reinforcement learning, which comprises the following steps:

S1, constructing a track planning scene based on the flight state of an aircraft and the constraint of a no-fly zone;

s2, designing an action set, a state space and a rewarding form, and extracting pictures, initial positions of the aircrafts and position information of target points from a track planning scene to construct the state space;

step S3, an DQN network is established, input variables are state spaces obtained in the step S2, output variables are action spaces obtained in the step S2, and prediction results are evaluation values of elements in the step S2 action space sets;

S4, inputting a current state into the DQN network, and outputting an evaluation value of an element in the current state;

Step S5, based on the track prediction result, screening out the infeasible action in the current state, and modifying the evaluation value of the infeasible action element;

S6, selecting an action element with the highest evaluation value as an input of an aircraft kinematics equation, and updating the position, the speed and the state based on the aircraft kinematics;

And S7, judging whether the track planning task is finished according to the latest position in the step S6, if the task finishing condition is met, storing the track and introducing the track into a training data set, otherwise, iteratively executing the steps S4, S5, S6 and S7.

In this embodiment, the step S2 specifically includes:

designing action sets, state spaces, rewards forms:

performing binarization processing on the safety area and the no-fly area, wherein the binarization picture is one part of the state, and the other part of the state is the relative position of the aircraft and the target;

The action space is used for uniformly taking a limited acceleration value in the lateral acceleration range in the maneuvering capability of the aircraft;

the rewarding form is that positive rewards are obtained when the rewarding form enters the target area range, negative rewards are obtained when the rewarding form enters the no-fly zone, and punishment effects are produced in a negative rewarding mode through flight time, overload change and the like;

According to the steps, a learning environment for strengthening learning interaction is obtained.

In this embodiment, the step S3 specifically includes:

The DQN network structure satisfies the process state input;

the input of the network is a fixed-size picture three-dimensional tensor and the vector of the relative position of the target and the aircraft, and the network structure can extract the sight line input characteristics and reduce the dimension;

The network consists of a convolution layer, a pooling layer and a full connection layer.

In this embodiment, the step S5 specifically includes:

calculating the infeasible action in the current state;

Classifying according to the distance and direction of each no-fly zone relative to the aircraft, and judging whether normal flying enters the no-fly zone;

if normal flight enters the no-fly zone, screening out the most likely no-fly zone and the no-fly zone bordered by the most likely no-fly zone.

Under the condition of calculated acceleration, the acceleration range of the screened no-fly zone can be entered.

And comparing the acceleration range with elements in the action set, screening out actions in the range, and reducing the evaluation value of the actions.

In this embodiment, the step S7 specifically includes:

If the latest position in the step S6 reaches the target area or enters the no-fly zone, adding the state, action and rewarding result in the path planning process into training data.

If the latest position of step S6 is in the safe area outside the target area, the iterative execution of steps S4, S5, S6, S7 is continued. .

Example 2

As shown in fig. 1, the invention provides a hypersonic aircraft trajectory planning method based on reinforcement learning, which comprises the following steps:

And step 1, constructing a track planning scene based on the flight state of the aircraft and the constraint of the no-fly zone. Acquiring the position of the no-fly zone, the radius of the no-fly zone a start point position and an end point position, and the no-fly zone is drawn in the form of a circle in the scene graph.

And 2, designing an action set, a state space and a rewarding form, and extracting pictures, the initial position of the aircraft and the position information of the target point from the track planning scene to construct the state space.

Specifically, the steps of designing the action set, the state space and the rewards form are described in detail as follows:

Step 2.1, designing an action space:

The highest maneuvering acceleration of the hypersonic aircraft is 20m/s ², and 9 numerical values are uniformly obtained in the space -20m/s²、-15m/s²、-10m/s²、-5m/s²、0m/s²、5m/s²、10m/s²、15m/s²、20m/s².

Step 2.2, design State space

The state space is a binary image and a vector, the image is a thumbnail of the combat scene, as shown in fig. 2, the image size is a rectangle with the size of 2:1, the long side is twice the furthest flight distance of the aircraft, the short side is the furthest flight distance of the aircraft, the position of the aircraft corresponds to the midpoint of the long side at the bottom side, and the heading of the aircraft points to the center of the image. The meaning of the elements in fig. 2 is shown in fig. 3. The vector state is the direction and distance of the target relative to the aircraft.

And 2.3, setting a reward value, namely rewarding +1000 for reaching a target point, rewarding-10000 for entering a no-fly zone, rewarding-1 for every 1s of track flight time, and rewarding the opposite number of the absolute value of the difference between the selected action and the last selected action when the selected action is performed.

And step 3, establishing a DQN evaluation network according to the state space, the action space and the rewarding value obtained in the step 2, and outputting each action rewarding value.

Specifically, as shown in fig. 4, the detailed description of the structure of the established DQN network prediction model is as follows:

And 3.1, the input of the network is divided into two parts, wherein one part is a picture part and the other part is a vector part.

And 3.2, inputting the picture with m multiplied by n, the vector length being 2, and inputting the picture into the DQN network.

And 3.3, inputting the picture into a convolution pooling layer to extract the characteristics, and inputting the picture into a fully-connected network by using the characteristics to replace picture pixels. The method comprises the steps of inputting an m multiplied by n-dimensional picture into a convolution pooling layer, extracting k features, combining the k features and 2 elements of a state vector into a vector with the length of k+2, and outputting 9 values by a full-connection network with the length of k+2.

And 4, inputting the current state into the DQN network, and outputting the evaluation value of the element in the current state.

And 5, screening the infeasible actions in the current state based on the track prediction result, and modifying the evaluation values of the infeasible action elements.

Specifically, the action screening method is as follows:

And 5.1, firstly checking whether a no-fly zone exists within the 800km range, if not, not screening, otherwise, executing the step 5.2.

Step 5.2, calculating a circular motion track tangential to the no-fly zone, as shown in fig. 5, to obtain an inoperable range:

The V represents the flying speed of the aircraft, R _d represents the radius of the no-fly zone, R _d represents the distance between the no-fly zone and the aircraft, and eta _d represents the angle between the speed direction of the aircraft and the no-fly zone.

Step 5.3, changing the action evaluation within the range into- ≡;

And 6, selecting the action element with the highest evaluation value as the input of an aircraft kinematics equation, and updating the position, the speed and the state based on the aircraft kinematics.

The saidIndicating the speed of the aircraft in the X-axis direction,Indicating the speed of the aircraft in the Y direction,Representing the movement angle of the aircraft, and V represents the flight speed of the aircraft;

And 7, judging whether the track planning task is finished according to the latest position in the step 6, if the task finishing condition is met, storing the track and introducing the track into a training data set, otherwise, iteratively executing the steps 4, 5, 6 and 7.

Example III

The invention discloses a hypersonic aircraft track planning method based on reinforcement learning, which is used for rapidly evaluating element values in an action set based on a DQN network, and eliminating actions which can enter a no-fly zone by modifying an inoperable evaluation value so as to realize rapid track planning under the condition of avoiding the no-fly zone. Providing beneficial support for the safe flight of hypersonic aircrafts.

Aiming at the defects of the existing method, the invention provides a hypersonic aircraft trajectory planning method based on reinforcement learning. On the basis of modeling a track planning scene, a convolution pooling layer and a full connection layer are adopted to evaluate the current action value, and on the basis of action screening and non-action value modification, the track planning safety is improved, the track planning efficiency is improved, and real-time track planning is realized.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims

1. A hypersonic aircraft trajectory planning method based on reinforcement learning, the method comprising;

S2, constructing a state space, a design space action set and a rewarding form;

s6, inputting trajectory planning control parameters, selecting an action element with the highest evaluation value as an aircraft kinematics equation input, and updating the position, speed and state based on the aircraft kinematics;

step S7, task judgment, wherein the task judgment judges whether the track planning task is ended according to the latest position in the step S6, if the task ending condition is met, the track is saved and introduced into a training data set, otherwise, the steps S4, S5, S6 and S7 are iteratively executed;

The construction of the state space comprises the following specific steps of carrying out binarization processing on a safe area and a no-fly area for a fight scene screenshot, wherein the binarization picture is a part of the state;

The space action set is a set which uniformly takes a limited acceleration value in the lateral acceleration range in the aircraft maneuvering capability;

And the judgment of the rewarding form is that positive rewards are obtained when the rewards enter the target area range, negative rewards are obtained when the rewards enter the no-fly zone, and punishment effects are generated in a negative rewarding mode by the flight time and overload change.

2. The hypersonic aircraft trajectory planning method based on reinforcement learning as claimed in claim 1, wherein the aircraft flight status includes a start point position and an end point position, and the no-fly zone constraint includes a no-fly zone position and a no-fly zone radius.

3. The hypersonic aircraft trajectory planning method based on reinforcement learning according to claim 1, wherein the DQN network is composed of a convolution layer, a pooling layer and a full connection layer, the input of the DQN network is a fixed-size picture three-dimensional tensor, and the vector of the relative position of the target and the aircraft, and the network structure can extract the sight input characteristics and reduce the dimensions.

4. The hypersonic aircraft trajectory planning method based on reinforcement learning according to claim 1, wherein the infeasible actions are categorized according to the distance and direction of each no-fly zone relative to the aircraft, and whether normal flight enters the no-fly zone is determined;

If normal flight enters the no-fly zone, screening out the most likely no-fly zone and the no-fly zone bordered by the most likely no-fly zone, calculating the acceleration range in which the selected no-fly zone can be entered under the acceleration, comparing the acceleration range with elements in the space action set, screening out actions in the range, and reducing the evaluation value.

5. The hypersonic aircraft trajectory planning method based on reinforcement learning as claimed in claim 1, wherein the evaluation value of the infeasible action is infinitely small.

6. Hypersonic aircraft trajectory planning system based on reinforcement learning, characterized in that it comprises the following units;

A learning environment construction unit for constructing a state space, a design space action set and a rewarding form,

The judgment of the rewarding form is that positive rewards are obtained when the rewarding form enters the target area range, negative rewards are obtained when the rewards enter the no-fly zone, and punishment effects are generated in a negative rewarding mode due to the flight time and overload change;

The third execution unit is used for inputting trajectory planning control parameters, selecting an action element with the highest evaluation value as the input of an aircraft kinematics equation, and updating the position, the speed and the state based on the aircraft kinematics;

7. An information storage device, characterized in that it stores a hypersonic aircraft trajectory planning method based on reinforcement learning according to one of claims 1-5.