CN113628458B

CN113628458B - Traffic signal lamp optimization method based on group intelligent reinforcement learning

Info

Publication number: CN113628458B
Application number: CN202110914300.0A
Authority: CN
Inventors: 刘双侨; 王茂帆; 郑皎凌
Original assignee: Sichuan Yifang Intelligent Technology Co ltd
Current assignee: Sichuan Yifang Intelligent Technology Co ltd
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2022-10-04
Anticipated expiration: 2041-08-10
Also published as: CN113628458A

Abstract

The invention discloses a traffic signal lamp optimization method based on group intelligent reinforcement learning, which comprises the following steps of: s1, together form Actor-Critic _global (ii) a S2, initializing parameters of n agents; s3, initializing Actor-criticic networks corresponding to the n agents and global Actor-criticic _global A network; s4, respectively inputting S into respective Actor networks based on the parameters of the current n agents; and the like. Under the environment of multiple intersections, a model is designed by controlling traffic lights, an algorithm framework of Actor-criticic is used, and meanwhile, a centralized learning and distributed execution method among intelligent agents is used, so that the convergence rate of the algorithm is greatly improved. The invention improves the traffic state and lays a foundation for the application of traffic signal control of later-stage group intelligent reinforcement learning.

Description

Traffic signal lamp optimization method based on group intelligent reinforcement learning

Technical Field

The invention belongs to the field of artificial intelligence (reinforcement learning), and particularly relates to a traffic signal lamp optimization method based on group intelligence reinforcement learning.

Background

Description of the general State of the Art

In the traffic scheduling process, traffic lights become the key to control traffic. The traditional traffic signal lamp is in a static state, and the time length and the switching speed of the signal lamp cannot be dynamically changed. And as the complexity of the traffic increases, traffic lights often produce a counterproductive effect. Therefore, the reinforcement learning decision process is added into the signal lamp control, the environment feedback is dynamically acquired through the road detection device, the state and the reward in the decision model are in a dynamic state, and appropriate change is made along with the environment feedback. Through cooperation and gaming among group intelligence, an appropriate decision-making method is made. In recent years, with the deepening of group intelligence and game theory research, group intelligence has been used in traffic decision. The group intelligent information interaction is transmitted through a traffic topology network, and the instant information interaction enables the intelligent agent to have a forecasting effect on the upcoming traffic flow, so that a proper decision can be taken in advance to relieve traffic jam. Three key points in group intelligent reinforcement learning: the states, behaviors and rewards, how they are formulated, need to be obtained through continuous traffic simulation close to the real state.

Prior art one closest to the creation of the present invention

Technical contents of prior art one

The strengthening learning development of a single intelligent agent is mature, the intelligent agent is respectively arranged at each road intersection by adopting a distributed framework, and the signal lamp can be independently scheduled and controlled. Because the independence and the resource occupancy of the intelligent agent are higher, certain efficiency improvement is obtained. Deep reinforcement learning then ensues, and the technology combines reinforcement learning with deep learning with perception capability.

The defects of the first prior art

The single agent reinforcement learning has poor harmony due to the distributed structure, and the information has closure and cannot form effective cooperation. When an emergency occurs, the single agent stops working, which may cause the work of the whole system to be stopped or even crashed. The learning of Q learning is suitable for processing discrete states, and the Q learning is deployed in the current traffic environment, and in the case of a single intersection environment, the number of intersections is thousands, the capacity of a Q table is limited, tens of thousands of states cannot be counted, and the Q table is not suitable for the traffic environment.

The second prior art closest to the creation of the present invention

Technical contents of the second prior art

Group intelligence reinforcement learning is performed to minimize vehicle travel time or the number of sites at multiple intersections, such as literature. Coordination of the time intervals between the start of a green light by setting all intersections of the road network can be achieved in a conventional multi-intersection environment. There are also optimization methods, such as literature, to minimize the travel time of the vehicle and/or the number of stations at multiple intersections, instead of optimizing the offset or maximum pressure, aiming to maximize the throughput of the network and thus minimize the travel time. Many of these approaches still rely on simplified traffic conditions constructed from static environments or assumptions and do not guarantee that actual operation will improve.

The second prior art

As the number of agents grows, the computational effort of centralized training is too large; during the test. Each agent acts independently, and the change of the agent in a dynamic environment needs to be coordinated up and down according to the combination with other agents around.

Disclosure of Invention

Aiming at the defects existing in the existing method for optimizing the traffic organization by utilizing centralized reinforcement learning, a distributed reinforcement learning agent is used for controlling multiple interfaces to carry out interaction. Decentralized communication is more practical and does not require a good scalability in centralized decision-making, but is often very unstable in model convergence and speed.

The purpose of the invention is realized by the following technical scheme:

the traffic signal lamp optimization method based on group intelligent reinforcement learning comprises the following steps:

s1, dividing a current traffic signal lamp timing scheme of an area to be optimized into n intelligent agents with complete cooperative relation; wherein S is a joint state, S ₁ ，S ₂ ，…，S _n Is the state corresponding to the current moment of the agent, S _{1_next} ，S _{2_next} ，…，S _{n_next} The state of the agent corresponding to the next moment, O ₁ ，O ₂ ，…，O _n Observed values for n agents, A ₁ ，A ₂ ，…，A _n Behavior for agent, R ₀ ，R ₁ ，…，R _n For the corresponding reward of n agents, actor ₁ ，Actor ₂ ，…，Actor _n Actor local network, ctritic, constructed for n agents ₁ ，Critic ₂ ，…，Critic _n The Critic local networks corresponding to the Actor local networks of the n agents jointly form Actor-Critic ₁ ，Actor-Critic ₂ ，…，Actor-Critic _n ；Actor _global Is a global Actor network, criticic _global Jointly form Actor-Critic for a global Critic network _global Subscripts 1,2, …, n is the agent number;

s2, initializing parameters of n agents;

the parameters of the agent include S, behavior A, TD _ error;

s3, initializing Actor-criticic networks corresponding to the n agents and global Actor-criticic _global A network;

s4, respectively inputting S into respective Actor networks based on the parameters of the current n agents; the action networks of the actors respectively select the corresponding action A of the agent, so that the environment gives corresponding return R according to the state and the action of the agent and the determined return function, and the state is transferred to the next state S _next ；

S5, the S, A and S obtained in the step S3 _next As the input of the Critic network, calculating to obtain TD _ error;

s6, updating parameters and weights of the local Actor-Crytic network;

s7, updating the global Actor-criticic _global Participation and weight of the network;

s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent-criticic is finished by the agent _global A training target preset by a network is used for obtaining a traffic signal lamp optimization model which is perfectly trained;

and S9, optimizing the current traffic light scheme through the traffic light optimization model to obtain the optimized traffic light scheme.

Preferably, the setting of the state S in step S2 includes: the state S is obtained by comprehensively calculating three values of a current phase serial number idPhase, a current phase corresponding timing duration and a current traffic light intersection road vehicle arrangement length queue;

each index is weighted by corresponding factorThe principle is favorable for the convergence of the training result ₁ Is idPhase weight, factor ₂ Is a duration weight, factor ₃ For the queue weight, the specific state space value formula is as follows:

S＝idPhase*factor ₁ +duration*factor ₂ +queue*factor ₃ ；

wherein the factor ₁ = len (Green _ list), where Green _ list means all traffic lights in the environment, and len (Green _ list) means the number of all traffic lights in the environment;

factor ₁ get the phase of green light in the phase ₂ ，factor ₃ And taking an integer according to the test result.

As a preferred mode, certain discretization processing needs to be carried out on the timing data corresponding to the current phase, so that later convergence is facilitated; the specific dispersion treatment is as follows:

preferably, the setting of the behavior a in the step S2 includes:

and acquiring action a, wherein the action a represents the phase to which the traffic light in the next state is to be changed, the length is the number of the independent traffic light phases, and the state space A can completely represent each phase by using a One-hot coding mode.

Preferably, the performing, by the n agents, actor-critical network training in step S4 includes the following steps:

a1, an initialization state S, an action matrix A and TD _ error;

a2, transmitting the S, A and TD _ error into the Actor network, and outputting act _ prob, wherein the act _ prob is probability distribution for selecting all behaviors under the current S because the Actor network selects actions based on probability distribution; and the act _ prob probability distribution is subjected to logarithm conversion as follows, which is beneficial to achieving convergence more quickly:

log_prob＝log(act_prob)

a3: calculating the TD _ error transmitted from the Critic network and the log _ prob calculated in the step A2 as follows to obtain a benefit-oriented loss value exp _ v;

exp_v＝reduce_mean(log_prob*td_error)

wherein, reduce _ mean is the average value in the neural network.

A4: the Actor extracts the behavior a with the maximum probability based on the act _ prob calculated in the step A2; the agent performs action a and gets corresponding reward feedback from the environment, and the agent state switches to state S _next 。

A5: training and updating parameters and weights of the Actor network of the agent by using a gradient descent maximization benefit-oriented loss value exp _ v;

a6: the current state S and the state S _next Transmitting into Critic network, respectively obtaining current state value V and next state value V _next ；

A7: using the value R of the reward obtained from the environment, and V, V obtained from step A6 _next And calculating to obtain Td _ error, wherein the calculation formula is as follows:

TD_error＝R+GAMMA*V _next -V

GAMMA: the learning rate is indicated in reinforcement learning, and the attenuation is smaller as the GAMMA is larger in the reinforcement learning process. This means that the learning process of the agent is more focused on long-term returns. On the other hand, smaller gama, will give greater attenuation. This means that the agent is more concerned with short-term rewards.

A8: and the TD _ error obtained by the step A7 is reversely transmitted to the Critic network for updating the parameters and the weights of the Critic network of the intelligent agent.

A9: using behavior a and State S of step A5 _next And transmitting the TD _ error acquired in the step A7 to the Actor network, and training and updating parameters and weights of the Actor network of the agent by using a gradient descent maximization benefit-oriented loss value exp _ v.

Preferably, the obtaining of the corresponding reward R in step S4 includes the following steps:

R＝RNCR(t)+CR*0.3

(1) Road network open traffic rate

The road network smooth traffic rate is defined as the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network within a certain time period T, describes the overall smooth traffic degree of the road network, is a measure of the overall traffic quality of the road network, and can be used for evaluating the traffic management effect;

wherein RNCR (T) represents road network open-road rate in T time period (T can be 5min or 3 min), n is the number of road segments contained in the road network, and l _ij Is the length of the ith road segment, k _i K is a binary function when the traffic state level of the section i belongs to the acceptable traffic state _i =1, otherwise k _i =0; when the average speed meanspeed of the road section is more than or equal to 20km/h, the road section is in an acceptable traffic state; when the mean speed is less than 20km/h, the state is unacceptable; the value range of RNCR (t) is [0,1]The larger the value, the better the road network state, and conversely, the worse the road network state.

(2) Congestion mileage duty ratio

The congestion mileage occupation ratio is the proportion of the length of a congested road section to the length of the whole road network, and describes the operation state of the whole road network:

wherein CR represents the road network congestion mileage ratio, jamlenthInMetersSum represents the congestion mileage ratio, and l _ij For road section l _ij Length of (d).

The invention has the beneficial effects that:

under the environment of multiple intersections, a model is designed by controlling traffic lights, an algorithm framework of Actor-Critic is used, meanwhile, a centralized learning and distributed execution method among intelligent agents is used, and the advantages of centralized learning and distributed learning are combined, so that the convergence speed of the algorithm is greatly improved. The invention improves the traffic state and lays a foundation for the application of traffic signal control of later-stage group intelligent reinforcement learning.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of the embodiment.

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following descriptions.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Examples

s1, dividing a current traffic signal lamp timing scheme of an area to be optimized into n intelligent agents with complete cooperative relation; wherein S is the state corresponding to the current time of n agents, S ₁ ，S ₂ ，…，S _n Is the state corresponding to the current moment of the agent, S _{1_next} ，S _{2_next} ，…，S _{n_next} The state corresponding to the next moment of the agent, O ₁ ，O ₂ ，…，O _n For n observations corresponding to agents, A ₁ ，A ₂ ，…，A _n Behavior for agent, R ₀ ，R ₁ ，…，R _n For the corresponding reward of n agents, actor ₁ ，Actor ₂ ，…，Actor _n Actor local network, ct, constructed for n agents ₁ ，Critic ₂ ，…，Critic _n The Critic local networks corresponding to the Actor local networks of the n agents jointly form Actor-Critic ₁ ，Actor-Critic ₂ ，…，Actor-Critic _n ；Actor _global Is a global Actor network, critic _global Jointly form Actor-criticic for a global criticic network _global Subscripts 1,2, …, n is the agent number;

s2, initializing parameters of n agents;

the parameters of the agent include S, behavior A, TD _ error;

TD _ error is used for measuring the difference between reward feedback obtained by the behavior from the environment and reward feedback brought by the last action selection after the agent finishes the action A each time, and is used for measuring whether the action selection performed by the Actor network is more reasonable and effective; the role of the Actor network is similar to that of an deductive, action selection is performed based on a policy, and the criticc network evaluates whether action selection performed by the Actor network is more effective by using TD _ error.

S3, initializing Actor-Critic networks corresponding to the n agents and global Actor-Critic _global A network;

S5, mixingS, A, S obtained in step S3 _next As the input of the Critic network, calculating to obtain TD _ error;

s6, updating parameters and weights of the local Actor-Crytic network;

s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent-Critic is finished by the agent _global A training target preset by the network (the training target is that the road network unblocked rate and the congestion mileage proportion index reach a better state or the training model reaches a convergence state.) is obtained to obtain a well-trained traffic signal lamp optimization model;

In a preferred embodiment, the setting of the state S in the step S2 includes: the state S is obtained by comprehensively calculating three values of the current phase serial number idPhase, the corresponding time distribution duration of the current phase and the length queue of the vehicle row length of the road converged at the current traffic light intersection;

each index takes corresponding factor (weight) to carry out data weighting processing, which is favorable for convergence of training results ₁ Is idPhase weight, factor ₂ Is a duration weight, factor ₃ For the queue weight, the specific state space value formula is as follows:

S＝idPhase*factor ₁ +duration*factor ₂ +queue*factor ₃ ；

wherein the factor ₁ ＝len(green_list)，

Green _ list means all traffic lights in the environment, and len (Green _ list) means the number of all traffic lights in the environment.

factor ₁ Get the phase of green light in the phase ₂ ，factor ₃ And taking an integer according to the test result. factor ₂ ＝[factor ₁ ÷3]，factor ₃ ＝[factor ₁ ×0.7+factor ₂ ÷2]Wherein]Is a rounded symbol.

In a preferred embodiment, the current phase corresponds to the timing data and needs to be subjected to certain discretization processing, so that later convergence is facilitated; the specific dispersion treatment is as follows:

in a preferred embodiment, the setting of the behavior a in the step S2 includes:

and acquiring action a, wherein the action a represents the phase to which the traffic light in the next state is to be changed, the length is the number of the phases of the independent traffic light, and the state space A can completely represent each phase by using a One-hot coding mode. (e.g., [1,0,0,0,0] indicates that the traffic light has 5 sets of phases, and the current one-hot code indicates the 0 th set of phases, all starting with 0).

In a preferred embodiment, the performing, by the n agents, actor-critical network training in step S4 respectively includes the following steps:

a1, an initialization state S, an action A and a TD _ error;

a2, transmitting the S, A and TD _ error into the Actor network, and outputting act _ prob, wherein the act _ prob is probability distribution for selecting all behaviors under the current S because the Actor network selects actions based on probability distribution; and the act _ prob probability distribution is subjected to logarithm taking conversion as follows, so that convergence can be achieved more quickly:

log_prob＝log(act_prob)

a3: calculating TD _ error transmitted from a Critic network and log _ prob calculated in the step A2 as follows to obtain a benefit-oriented loss value exp _ v;

exp_v＝reduce_mean(log_prob*td_error)

wherein, reduce _ mean is the average value in the neural network.

A5: training and updating parameters and weights of an Actor network of the agent by utilizing a gradient descent maximization benefit guide loss value exp _ v;

TD_error＝R+GAMMA*V _next -V

In a preferred embodiment, the obtaining of the corresponding reward R in step S4 includes the following steps:

R＝RNCR(t)+CR*0.3

(1) Road network open traffic rate

wherein RNCR (T) represents road network open-road rate in T time period (T can be 5min or 3 min), n is the number of road segments contained in the road network, and l _ij Is the length of the ith road segment, k _i K is a binary function when the traffic state grade of the road section i belongs to the acceptable traffic state _i =1, otherwise k _i =0; when the average speed meanspeed of the road section is more than or equal to 20km/h, the road section is in an acceptable traffic state; when the mean speed is less than 20km/h, the state is unacceptable; the value range of RNCR (t) is [0,1]The larger the value, the better the road network state, and conversely, the worse the road network state.

(2) Congestion mileage duty ratio

The congestion mileage occupancy is the proportion of the length of a congested road section to the length of the whole road network, and describes the overall operation state of the road network:

wherein CR represents the road network congestion mileage ratio, jamlenthInMetersSum represents the congestion mileage ratio, and l _ij For a section of road l _ij Of the length of (c). A smaller CR indicates a better traffic state.

Experiments show that the setting of the return function is beneficial to help distinguish the optimization degree, so that the model has better identification capability.

In a preferred embodiment, the traffic light a is matched with the traffic light:

0：<phase duration＝″36″state＝〞rrrGGgrrrGGg〞/>

1：<phase duration＝″4″state＝〞rrryygrrryyg″/>

2：<phase duration＝″6″state＝″rrrrrGrrrrrG″/>

3：<phase duration＝″4″state＝″rrrrryrrrrry″/>

4：<phase duration＝″36″state＝″GGgrrrGGgrrr″/>

5：<phase duration＝″4″state＝″yyyrrryyyrrr″/>

the timing shows that the traffic lights have 6 phases in total, and the phase corresponding time duration is 36S,4S,6S,4S,36S,4S.

The State thereof represents the road connection State of the traffic light control at each phase, for example:

as shown in fig. 1, this figure shows the State of traffic light a in phase 0.

R and R represent that the traffic signal lamp control road connection is in a red light state, G and G represent that the traffic signal lamp control road connection is in a green light state, and Y and Y represent that the traffic signal lamp control road connection is in a yellow light state. Capital and lowercase letter differentiation also means a pass priority, with capital letters being more prioritized than lowercase letters.

state = "rrrggrrrggg" means: in phase, (1) (2) (3) (7) (8) (9) road connection is red light state, (4) (5) (6) in red (R) road connection

The road connection is in a green state.

After the optimization is carried out by the invention;

the traffic signal lamp timing is after the traffic light a is correspondingly optimized

0：<phase duration＝″27″state＝″rrrGGgrrrGGg″/>

1：<phase duration＝″2″state＝″rrryygrrryyg″/>

2：<phase duration＝″4″state＝″rrrrrGrrrrrG″/>

3：<phase duration＝″7″state＝″rrrrryrrrrry″/>

4：<phase duration＝″43″state＝″GGgrrrGGgrrr″/>

5：<phase duration＝〞6″state＝″yyyrrryyyrrr″/>

The main optimization is embodied on the optimization of duration (timing) corresponding to each Phase (Phase).

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention. The present invention should be considered as limited only by the preferred embodiments and not by the specific details, but rather as limited only by the accompanying drawings, and as used herein, is intended to cover all modifications, equivalents and improvements falling within the spirit and scope of the invention.

Claims

1. The traffic signal lamp optimization method based on group intelligent reinforcement learning is characterized by comprising the following steps of:

s1, optimizing the current traffic signal lamp of the area to be optimizedThe timing scheme is divided into n agents with complete cooperative relationship;

,

,…,

is the state corresponding to the current moment of the agent,

,

,…,

is the corresponding state of the agent at the next moment,

，

,…,

for the observed values corresponding to n agents,

，

，…,

in order for the behavior of the agent to correspond,

,

,…,

for the rewards corresponding to the n agents,

,

,…,

constructed for n agents

The local network(s) are,

,

,…,

for n agents

Local network corresponds to

Local areaNetwork, co-forming

,

,…,

；

Is a whole world

The network(s) of the network(s),

is a whole world

Network, co-forming

Subscripts 1,2, …, n is the number of the agent;

s2, initializing parameters of n agents;

the parameters of the agent include

Behavior of

、

；

SFor the states corresponding to the current time of n agents,

for measuring the completion of an agent's behavior each timeAThe difference between the reward feedback obtained from the environment and the reward feedback from the last action selection is then used as a measure of the difference

Whether the action selection performed by the network is more reasonable and effective;

s3, initializing correspondence of n agents

Network, and global

A network;

s4, based on the parameters of the current n agents, the method carries out the steps of

Are respectively input to each

In a network; each of which is

Network behavior for selecting respective agents

The environment gives corresponding return according to the state and the behavior of the agent and the determined return function

Transition to the next state

；

S5, the product obtained in the step S4

,

As

Input to the network, calculation obtaining

；

S6, updating the local

Parameters and weights of the network;

s7, updating the whole situation

Participation and weight of the network;

s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent finishes

Obtaining a traffic signal lamp optimization model which is well trained by a training target preset by a network;

s9, optimizing the current traffic light scheme through the traffic light optimization model to obtain an optimized traffic light scheme;

the n agents in the step S4 are respectively carried out

The network training comprises the following steps:

a1, initialization state

And actionsAAnd

；

a2, mixing

，

，

Is conducted into

In the network, output

Because of

The network selects an action based on the probability distribution,

is that it is the current

Then, carrying out probability distribution of all behavior selections; and will beact_probThe probability distribution is subjected to the following logarithmic transformation, which is beneficial to faster convergence:

a3: will be composed of

Network-in

Calculated in step A2

The value of the benefit-oriented loss is calculated as follows

；

Wherein,reduce_meannamely, the average value is calculated in the neural network;

A4：

calculated based on step A2

Extracting the behavior with the maximum probability

；

A5, comparing the current state

And the state obtained in step A4

Is transmitted into

In the network, current state values are respectively obtained

And the next state

；

A6 utilizing the value of the reward obtained from the environment

And obtained from step A5

,

Is calculated to obtainTd_ errorThe calculation formula is as follows:

GAMMA: expressing a learning rate in reinforcement learning;

a7 obtained by the step A6

Is transmitted in the reverse direction to

Network for updating agents

Parameters and weights of the network;

a8 behavior Using the A4 step

And states of

Obtained in step A6

Is transmitted to

Network, maximizing value of benefit-oriented loss using gradient descent

Training and updating agents

Parameters and weights of the network.

2. The traffic signal lamp optimization method based on group intelligent reinforcement learning as claimed in claim 1, wherein the state in the step S2SThe setting includes: status of stateSComprehensively calculating and obtaining three values of a current phase serial number idPhase, a current phase corresponding timing duration and a current traffic light intersection road vehicle arrangement length queue;

the data weighting processing is carried out on each index by taking a corresponding factor so as to be beneficial to the convergence of the training result,

in order to be the idPhase weight,

in order to be the weight of the duration,

for the queue weight, the specific state space value formula is as follows:

；

wherein

，

green_listMeaning all traffic lights in the environment, len: (green_list) Namely, the number of all traffic lights in the environment is represented;

taking the number of the green light phases in the phase positions,

and taking an integer according to the test result.

3. The traffic signal lamp optimization method based on swarm intelligence reinforcement learning as claimed in claim 2, wherein the current phase corresponds to the timing data and needs to be discretized to some extent, so as to facilitate later convergence; the specific dispersion treatment is as follows:

。

4. the traffic signal lamp optimization method based on group intelligent reinforcement learning as claimed in claim 1, wherein the behavior in the step S2 is

The setting includes:

acquiring action a, wherein the action a represents the phase to which the traffic light in the next state is to be changed, the length is the number of the phases of the independent traffic lights, and the state space

Each phase can be completely represented by using a One-hot coding mode.

5. The traffic signal lamp optimization method based on swarm intelligence reinforcement learning of claim 1, wherein the corresponding reward is obtained in step S4

The method comprises the following steps:

(1) Road network open road rate

The road network smooth traffic rate is defined as the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network in a certain time period T, describes the overall smooth traffic degree of the road network, is a measure of the overall traffic quality of the road network, and can be used for evaluating the traffic management effect;