[go: up one dir, main page]

CN113628458B - Traffic signal lamp optimization method based on group intelligent reinforcement learning - Google Patents

Traffic signal lamp optimization method based on group intelligent reinforcement learning Download PDF

Info

Publication number
CN113628458B
CN113628458B CN202110914300.0A CN202110914300A CN113628458B CN 113628458 B CN113628458 B CN 113628458B CN 202110914300 A CN202110914300 A CN 202110914300A CN 113628458 B CN113628458 B CN 113628458B
Authority
CN
China
Prior art keywords
network
state
traffic
agents
road
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110914300.0A
Other languages
Chinese (zh)
Other versions
CN113628458A (en
Inventor
刘双侨
王茂帆
郑皎凌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yifang Intelligent Technology Co ltd
Original Assignee
Sichuan Yifang Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yifang Intelligent Technology Co ltd filed Critical Sichuan Yifang Intelligent Technology Co ltd
Priority to CN202110914300.0A priority Critical patent/CN113628458B/en
Publication of CN113628458A publication Critical patent/CN113628458A/en
Application granted granted Critical
Publication of CN113628458B publication Critical patent/CN113628458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/081Plural intersections under common control
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Traffic Control Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a traffic signal lamp optimization method based on group intelligent reinforcement learning, which comprises the following steps of: s1, together form Actor-Critic global (ii) a S2, initializing parameters of n agents; s3, initializing Actor-criticic networks corresponding to the n agents and global Actor-criticic global A network; s4, respectively inputting S into respective Actor networks based on the parameters of the current n agents; and the like. Under the environment of multiple intersections, a model is designed by controlling traffic lights, an algorithm framework of Actor-criticic is used, and meanwhile, a centralized learning and distributed execution method among intelligent agents is used, so that the convergence rate of the algorithm is greatly improved. The invention improves the traffic state and lays a foundation for the application of traffic signal control of later-stage group intelligent reinforcement learning.

Description

Traffic signal lamp optimization method based on group intelligent reinforcement learning
Technical Field
The invention belongs to the field of artificial intelligence (reinforcement learning), and particularly relates to a traffic signal lamp optimization method based on group intelligence reinforcement learning.
Background
Description of the general State of the Art
In the traffic scheduling process, traffic lights become the key to control traffic. The traditional traffic signal lamp is in a static state, and the time length and the switching speed of the signal lamp cannot be dynamically changed. And as the complexity of the traffic increases, traffic lights often produce a counterproductive effect. Therefore, the reinforcement learning decision process is added into the signal lamp control, the environment feedback is dynamically acquired through the road detection device, the state and the reward in the decision model are in a dynamic state, and appropriate change is made along with the environment feedback. Through cooperation and gaming among group intelligence, an appropriate decision-making method is made. In recent years, with the deepening of group intelligence and game theory research, group intelligence has been used in traffic decision. The group intelligent information interaction is transmitted through a traffic topology network, and the instant information interaction enables the intelligent agent to have a forecasting effect on the upcoming traffic flow, so that a proper decision can be taken in advance to relieve traffic jam. Three key points in group intelligent reinforcement learning: the states, behaviors and rewards, how they are formulated, need to be obtained through continuous traffic simulation close to the real state.
Prior art one closest to the creation of the present invention
Technical contents of prior art one
The strengthening learning development of a single intelligent agent is mature, the intelligent agent is respectively arranged at each road intersection by adopting a distributed framework, and the signal lamp can be independently scheduled and controlled. Because the independence and the resource occupancy of the intelligent agent are higher, certain efficiency improvement is obtained. Deep reinforcement learning then ensues, and the technology combines reinforcement learning with deep learning with perception capability.
The defects of the first prior art
The single agent reinforcement learning has poor harmony due to the distributed structure, and the information has closure and cannot form effective cooperation. When an emergency occurs, the single agent stops working, which may cause the work of the whole system to be stopped or even crashed. The learning of Q learning is suitable for processing discrete states, and the Q learning is deployed in the current traffic environment, and in the case of a single intersection environment, the number of intersections is thousands, the capacity of a Q table is limited, tens of thousands of states cannot be counted, and the Q table is not suitable for the traffic environment.
The second prior art closest to the creation of the present invention
Technical contents of the second prior art
Group intelligence reinforcement learning is performed to minimize vehicle travel time or the number of sites at multiple intersections, such as literature. Coordination of the time intervals between the start of a green light by setting all intersections of the road network can be achieved in a conventional multi-intersection environment. There are also optimization methods, such as literature, to minimize the travel time of the vehicle and/or the number of stations at multiple intersections, instead of optimizing the offset or maximum pressure, aiming to maximize the throughput of the network and thus minimize the travel time. Many of these approaches still rely on simplified traffic conditions constructed from static environments or assumptions and do not guarantee that actual operation will improve.
The second prior art
As the number of agents grows, the computational effort of centralized training is too large; during the test. Each agent acts independently, and the change of the agent in a dynamic environment needs to be coordinated up and down according to the combination with other agents around.
Disclosure of Invention
Aiming at the defects existing in the existing method for optimizing the traffic organization by utilizing centralized reinforcement learning, a distributed reinforcement learning agent is used for controlling multiple interfaces to carry out interaction. Decentralized communication is more practical and does not require a good scalability in centralized decision-making, but is often very unstable in model convergence and speed.
The purpose of the invention is realized by the following technical scheme:
the traffic signal lamp optimization method based on group intelligent reinforcement learning comprises the following steps:
s1, dividing a current traffic signal lamp timing scheme of an area to be optimized into n intelligent agents with complete cooperative relation; wherein S is a joint state, S 1 ,S 2 ,…,S n Is the state corresponding to the current moment of the agent, S 1_next ,S 2_next ,…,S n_next The state of the agent corresponding to the next moment, O 1 ,O 2 ,…,O n Observed values for n agents, A 1 ,A 2 ,…,A n Behavior for agent, R 0 ,R 1 ,…,R n For the corresponding reward of n agents, actor 1 ,Actor 2 ,…,Actor n Actor local network, ctritic, constructed for n agents 1 ,Critic 2 ,…,Critic n The Critic local networks corresponding to the Actor local networks of the n agents jointly form Actor-Critic 1 ,Actor-Critic 2 ,…,Actor-Critic n ;Actor global Is a global Actor network, criticic global Jointly form Actor-Critic for a global Critic network global Subscripts 1,2, …, n is the agent number;
s2, initializing parameters of n agents;
the parameters of the agent include S, behavior A, TD _ error;
s3, initializing Actor-criticic networks corresponding to the n agents and global Actor-criticic global A network;
s4, respectively inputting S into respective Actor networks based on the parameters of the current n agents; the action networks of the actors respectively select the corresponding action A of the agent, so that the environment gives corresponding return R according to the state and the action of the agent and the determined return function, and the state is transferred to the next state S next
S5, the S, A and S obtained in the step S3 next As the input of the Critic network, calculating to obtain TD _ error;
s6, updating parameters and weights of the local Actor-Crytic network;
s7, updating the global Actor-criticic global Participation and weight of the network;
s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent-criticic is finished by the agent global A training target preset by a network is used for obtaining a traffic signal lamp optimization model which is perfectly trained;
and S9, optimizing the current traffic light scheme through the traffic light optimization model to obtain the optimized traffic light scheme.
Preferably, the setting of the state S in step S2 includes: the state S is obtained by comprehensively calculating three values of a current phase serial number idPhase, a current phase corresponding timing duration and a current traffic light intersection road vehicle arrangement length queue;
each index is weighted by corresponding factorThe principle is favorable for the convergence of the training result 1 Is idPhase weight, factor 2 Is a duration weight, factor 3 For the queue weight, the specific state space value formula is as follows:
S=idPhase*factor 1 +duration*factor 2 +queue*factor 3
wherein the factor 1 = len (Green _ list), where Green _ list means all traffic lights in the environment, and len (Green _ list) means the number of all traffic lights in the environment;
factor 1 get the phase of green light in the phase 2 ,factor 3 And taking an integer according to the test result.
As a preferred mode, certain discretization processing needs to be carried out on the timing data corresponding to the current phase, so that later convergence is facilitated; the specific dispersion treatment is as follows:
Figure BDA0003204864910000051
preferably, the setting of the behavior a in the step S2 includes:
and acquiring action a, wherein the action a represents the phase to which the traffic light in the next state is to be changed, the length is the number of the independent traffic light phases, and the state space A can completely represent each phase by using a One-hot coding mode.
Preferably, the performing, by the n agents, actor-critical network training in step S4 includes the following steps:
a1, an initialization state S, an action matrix A and TD _ error;
a2, transmitting the S, A and TD _ error into the Actor network, and outputting act _ prob, wherein the act _ prob is probability distribution for selecting all behaviors under the current S because the Actor network selects actions based on probability distribution; and the act _ prob probability distribution is subjected to logarithm conversion as follows, which is beneficial to achieving convergence more quickly:
log_prob=log(act_prob)
a3: calculating the TD _ error transmitted from the Critic network and the log _ prob calculated in the step A2 as follows to obtain a benefit-oriented loss value exp _ v;
exp_v=reduce_mean(log_prob*td_error)
wherein, reduce _ mean is the average value in the neural network.
A4: the Actor extracts the behavior a with the maximum probability based on the act _ prob calculated in the step A2; the agent performs action a and gets corresponding reward feedback from the environment, and the agent state switches to state S next
A5: training and updating parameters and weights of the Actor network of the agent by using a gradient descent maximization benefit-oriented loss value exp _ v;
a6: the current state S and the state S next Transmitting into Critic network, respectively obtaining current state value V and next state value V next
A7: using the value R of the reward obtained from the environment, and V, V obtained from step A6 next And calculating to obtain Td _ error, wherein the calculation formula is as follows:
TD_error=R+GAMMA*V next -V
GAMMA: the learning rate is indicated in reinforcement learning, and the attenuation is smaller as the GAMMA is larger in the reinforcement learning process. This means that the learning process of the agent is more focused on long-term returns. On the other hand, smaller gama, will give greater attenuation. This means that the agent is more concerned with short-term rewards.
A8: and the TD _ error obtained by the step A7 is reversely transmitted to the Critic network for updating the parameters and the weights of the Critic network of the intelligent agent.
A9: using behavior a and State S of step A5 next And transmitting the TD _ error acquired in the step A7 to the Actor network, and training and updating parameters and weights of the Actor network of the agent by using a gradient descent maximization benefit-oriented loss value exp _ v.
Preferably, the obtaining of the corresponding reward R in step S4 includes the following steps:
R=RNCR(t)+CR*0.3
(1) Road network open traffic rate
The road network smooth traffic rate is defined as the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network within a certain time period T, describes the overall smooth traffic degree of the road network, is a measure of the overall traffic quality of the road network, and can be used for evaluating the traffic management effect;
Figure BDA0003204864910000071
wherein RNCR (T) represents road network open-road rate in T time period (T can be 5min or 3 min), n is the number of road segments contained in the road network, and l ij Is the length of the ith road segment, k i K is a binary function when the traffic state level of the section i belongs to the acceptable traffic state i =1, otherwise k i =0; when the average speed meanspeed of the road section is more than or equal to 20km/h, the road section is in an acceptable traffic state; when the mean speed is less than 20km/h, the state is unacceptable; the value range of RNCR (t) is [0,1]The larger the value, the better the road network state, and conversely, the worse the road network state.
(2) Congestion mileage duty ratio
The congestion mileage occupation ratio is the proportion of the length of a congested road section to the length of the whole road network, and describes the operation state of the whole road network:
Figure BDA0003204864910000072
wherein CR represents the road network congestion mileage ratio, jamlenthInMetersSum represents the congestion mileage ratio, and l ij For road section l ij Length of (d).
The invention has the beneficial effects that:
under the environment of multiple intersections, a model is designed by controlling traffic lights, an algorithm framework of Actor-Critic is used, meanwhile, a centralized learning and distributed execution method among intelligent agents is used, and the advantages of centralized learning and distributed learning are combined, so that the convergence speed of the algorithm is greatly improved. The invention improves the traffic state and lays a foundation for the application of traffic signal control of later-stage group intelligent reinforcement learning.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of the embodiment.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following descriptions.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Examples
The traffic signal lamp optimization method based on group intelligent reinforcement learning comprises the following steps:
s1, dividing a current traffic signal lamp timing scheme of an area to be optimized into n intelligent agents with complete cooperative relation; wherein S is the state corresponding to the current time of n agents, S 1 ,S 2 ,…,S n Is the state corresponding to the current moment of the agent, S 1_next ,S 2_next ,…,S n_next The state corresponding to the next moment of the agent, O 1 ,O 2 ,…,O n For n observations corresponding to agents, A 1 ,A 2 ,…,A n Behavior for agent, R 0 ,R 1 ,…,R n For the corresponding reward of n agents, actor 1 ,Actor 2 ,…,Actor n Actor local network, ct, constructed for n agents 1 ,Critic 2 ,…,Critic n The Critic local networks corresponding to the Actor local networks of the n agents jointly form Actor-Critic 1 ,Actor-Critic 2 ,…,Actor-Critic n ;Actor global Is a global Actor network, critic global Jointly form Actor-criticic for a global criticic network global Subscripts 1,2, …, n is the agent number;
s2, initializing parameters of n agents;
the parameters of the agent include S, behavior A, TD _ error;
TD _ error is used for measuring the difference between reward feedback obtained by the behavior from the environment and reward feedback brought by the last action selection after the agent finishes the action A each time, and is used for measuring whether the action selection performed by the Actor network is more reasonable and effective; the role of the Actor network is similar to that of an deductive, action selection is performed based on a policy, and the criticc network evaluates whether action selection performed by the Actor network is more effective by using TD _ error.
S3, initializing Actor-Critic networks corresponding to the n agents and global Actor-Critic global A network;
s4, respectively inputting S into respective Actor networks based on the parameters of the current n agents; the action networks of the actors respectively select the corresponding action A of the agent, so that the environment gives corresponding return R according to the state and the action of the agent and the determined return function, and the state is transferred to the next state S next
S5, mixingS, A, S obtained in step S3 next As the input of the Critic network, calculating to obtain TD _ error;
s6, updating parameters and weights of the local Actor-Crytic network;
s7, updating the global Actor-criticic global Participation and weight of the network;
s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent-Critic is finished by the agent global A training target preset by the network (the training target is that the road network unblocked rate and the congestion mileage proportion index reach a better state or the training model reaches a convergence state.) is obtained to obtain a well-trained traffic signal lamp optimization model;
and S9, optimizing the current traffic light scheme through the traffic light optimization model to obtain the optimized traffic light scheme.
In a preferred embodiment, the setting of the state S in the step S2 includes: the state S is obtained by comprehensively calculating three values of the current phase serial number idPhase, the corresponding time distribution duration of the current phase and the length queue of the vehicle row length of the road converged at the current traffic light intersection;
each index takes corresponding factor (weight) to carry out data weighting processing, which is favorable for convergence of training results 1 Is idPhase weight, factor 2 Is a duration weight, factor 3 For the queue weight, the specific state space value formula is as follows:
S=idPhase*factor 1 +duration*factor 2 +queue*factor 3
wherein the factor 1 =len(green_list),
Green _ list means all traffic lights in the environment, and len (Green _ list) means the number of all traffic lights in the environment.
factor 1 Get the phase of green light in the phase 2 ,factor 3 And taking an integer according to the test result. factor 2 =[factor 1 ÷3],factor 3 =[factor 1 ×0.7+factor 2 ÷2]Wherein]Is a rounded symbol.
In a preferred embodiment, the current phase corresponds to the timing data and needs to be subjected to certain discretization processing, so that later convergence is facilitated; the specific dispersion treatment is as follows:
Figure BDA0003204864910000121
in a preferred embodiment, the setting of the behavior a in the step S2 includes:
and acquiring action a, wherein the action a represents the phase to which the traffic light in the next state is to be changed, the length is the number of the phases of the independent traffic light, and the state space A can completely represent each phase by using a One-hot coding mode. (e.g., [1,0,0,0,0] indicates that the traffic light has 5 sets of phases, and the current one-hot code indicates the 0 th set of phases, all starting with 0).
In a preferred embodiment, the performing, by the n agents, actor-critical network training in step S4 respectively includes the following steps:
a1, an initialization state S, an action A and a TD _ error;
a2, transmitting the S, A and TD _ error into the Actor network, and outputting act _ prob, wherein the act _ prob is probability distribution for selecting all behaviors under the current S because the Actor network selects actions based on probability distribution; and the act _ prob probability distribution is subjected to logarithm taking conversion as follows, so that convergence can be achieved more quickly:
log_prob=log(act_prob)
a3: calculating TD _ error transmitted from a Critic network and log _ prob calculated in the step A2 as follows to obtain a benefit-oriented loss value exp _ v;
exp_v=reduce_mean(log_prob*td_error)
wherein, reduce _ mean is the average value in the neural network.
A4: the Actor extracts the behavior a with the maximum probability based on the act _ prob calculated in the step A2; the agent performs action a and gets corresponding reward feedback from the environment, and the agent state switches to state S next
A5: training and updating parameters and weights of an Actor network of the agent by utilizing a gradient descent maximization benefit guide loss value exp _ v;
a6: the current state S and the state S next Transmitting into Critic network, respectively obtaining current state value V and next state value V next
A7: using the value R of the reward obtained from the environment, and V, V obtained from step A6 next And calculating to obtain Td _ error, wherein the calculation formula is as follows:
TD_error=R+GAMMA*V next -V
a8: and the TD _ error obtained by the step A7 is reversely transmitted to the Critic network for updating the parameters and the weights of the Critic network of the intelligent agent.
In a preferred embodiment, the obtaining of the corresponding reward R in step S4 includes the following steps:
R=RNCR(t)+CR*0.3
(1) Road network open traffic rate
The road network smooth traffic rate is defined as the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network within a certain time period T, describes the overall smooth traffic degree of the road network, is a measure of the overall traffic quality of the road network, and can be used for evaluating the traffic management effect;
Figure BDA0003204864910000131
wherein RNCR (T) represents road network open-road rate in T time period (T can be 5min or 3 min), n is the number of road segments contained in the road network, and l ij Is the length of the ith road segment, k i K is a binary function when the traffic state grade of the road section i belongs to the acceptable traffic state i =1, otherwise k i =0; when the average speed meanspeed of the road section is more than or equal to 20km/h, the road section is in an acceptable traffic state; when the mean speed is less than 20km/h, the state is unacceptable; the value range of RNCR (t) is [0,1]The larger the value, the better the road network state, and conversely, the worse the road network state.
(2) Congestion mileage duty ratio
The congestion mileage occupancy is the proportion of the length of a congested road section to the length of the whole road network, and describes the overall operation state of the road network:
Figure BDA0003204864910000141
wherein CR represents the road network congestion mileage ratio, jamlenthInMetersSum represents the congestion mileage ratio, and l ij For a section of road l ij Of the length of (c). A smaller CR indicates a better traffic state.
Experiments show that the setting of the return function is beneficial to help distinguish the optimization degree, so that the model has better identification capability.
In a preferred embodiment, the traffic light a is matched with the traffic light:
0:<phase duration=″36″state=〞rrrGGgrrrGGg〞/>
1:<phase duration=″4″state=〞rrryygrrryyg″/>
2:<phase duration=″6″state=″rrrrrGrrrrrG″/>
3:<phase duration=″4″state=″rrrrryrrrrry″/>
4:<phase duration=″36″state=″GGgrrrGGgrrr″/>
5:<phase duration=″4″state=″yyyrrryyyrrr″/>
the timing shows that the traffic lights have 6 phases in total, and the phase corresponding time duration is 36S,4S,6S,4S,36S,4S.
The State thereof represents the road connection State of the traffic light control at each phase, for example:
as shown in fig. 1, this figure shows the State of traffic light a in phase 0.
R and R represent that the traffic signal lamp control road connection is in a red light state, G and G represent that the traffic signal lamp control road connection is in a green light state, and Y and Y represent that the traffic signal lamp control road connection is in a yellow light state. Capital and lowercase letter differentiation also means a pass priority, with capital letters being more prioritized than lowercase letters.
state = "rrrggrrrggg" means: in phase, (1) (2) (3) (7) (8) (9) road connection is red light state, (4) (5) (6) in red (R) road connection
Figure BDA0003204864910000151
The road connection is in a green state.
After the optimization is carried out by the invention;
the traffic signal lamp timing is after the traffic light a is correspondingly optimized
0:<phase duration=″27″state=″rrrGGgrrrGGg″/>
1:<phase duration=″2″state=″rrryygrrryyg″/>
2:<phase duration=″4″state=″rrrrrGrrrrrG″/>
3:<phase duration=″7″state=″rrrrryrrrrry″/>
4:<phase duration=″43″state=″GGgrrrGGgrrr″/>
5:<phase duration=〞6″state=″yyyrrryyyrrr″/>
The main optimization is embodied on the optimization of duration (timing) corresponding to each Phase (Phase).
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention. The present invention should be considered as limited only by the preferred embodiments and not by the specific details, but rather as limited only by the accompanying drawings, and as used herein, is intended to cover all modifications, equivalents and improvements falling within the spirit and scope of the invention.

Claims (5)

1. The traffic signal lamp optimization method based on group intelligent reinforcement learning is characterized by comprising the following steps of:
s1, optimizing the current traffic signal lamp of the area to be optimizedThe timing scheme is divided into n agents with complete cooperative relationship;
Figure DEST_PATH_IMAGE002
,
Figure DEST_PATH_IMAGE004
,…,
Figure DEST_PATH_IMAGE006
is the state corresponding to the current moment of the agent,
Figure DEST_PATH_IMAGE008
,
Figure DEST_PATH_IMAGE010
,…,
Figure DEST_PATH_IMAGE012
is the corresponding state of the agent at the next moment,
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE016
,…,
Figure DEST_PATH_IMAGE018
for the observed values corresponding to n agents,
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE022
,…,
Figure DEST_PATH_IMAGE024
in order for the behavior of the agent to correspond,
Figure DEST_PATH_IMAGE026
,
Figure DEST_PATH_IMAGE028
,…,
Figure DEST_PATH_IMAGE030
for the rewards corresponding to the n agents,
Figure DEST_PATH_IMAGE032
,
Figure DEST_PATH_IMAGE034
,…,
Figure DEST_PATH_IMAGE036
constructed for n agents
Figure DEST_PATH_IMAGE038
The local network(s) are,
Figure DEST_PATH_IMAGE040
,
Figure DEST_PATH_IMAGE042
,…,
Figure DEST_PATH_IMAGE044
for n agents
Figure 730084DEST_PATH_IMAGE038
Local network corresponds to
Figure DEST_PATH_IMAGE046
Local areaNetwork, co-forming
Figure DEST_PATH_IMAGE048
,
Figure DEST_PATH_IMAGE050
,…,
Figure DEST_PATH_IMAGE052
Figure DEST_PATH_IMAGE054
Is a whole world
Figure 90568DEST_PATH_IMAGE038
The network(s) of the network(s),
Figure DEST_PATH_IMAGE056
is a whole world
Figure 932622DEST_PATH_IMAGE046
Network, co-forming
Figure DEST_PATH_IMAGE058
Subscripts 1,2, …, n is the number of the agent;
s2, initializing parameters of n agents;
the parameters of the agent include
Figure DEST_PATH_IMAGE060
Behavior of
Figure DEST_PATH_IMAGE062
Figure DEST_PATH_IMAGE064
SFor the states corresponding to the current time of n agents,
Figure DEST_PATH_IMAGE066
for measuring the completion of an agent's behavior each timeAThe difference between the reward feedback obtained from the environment and the reward feedback from the last action selection is then used as a measure of the difference
Figure DEST_PATH_IMAGE068
Whether the action selection performed by the network is more reasonable and effective;
s3, initializing correspondence of n agents
Figure DEST_PATH_IMAGE070
Network, and global
Figure DEST_PATH_IMAGE072
A network;
s4, based on the parameters of the current n agents, the method carries out the steps of
Figure 672039DEST_PATH_IMAGE060
Are respectively input to each
Figure 17570DEST_PATH_IMAGE068
In a network; each of which is
Figure 869726DEST_PATH_IMAGE068
Network behavior for selecting respective agents
Figure 667918DEST_PATH_IMAGE062
The environment gives corresponding return according to the state and the behavior of the agent and the determined return function
Figure DEST_PATH_IMAGE074
Transition to the next state
Figure DEST_PATH_IMAGE076
S5, the product obtained in the step S4
Figure 194714DEST_PATH_IMAGE060
,
Figure DEST_PATH_IMAGE078
,
Figure 191489DEST_PATH_IMAGE076
As
Figure DEST_PATH_IMAGE080
Input to the network, calculation obtaining
Figure 748634DEST_PATH_IMAGE064
S6, updating the local
Figure DEST_PATH_IMAGE082
Parameters and weights of the network;
s7, updating the whole situation
Figure DEST_PATH_IMAGE084
Participation and weight of the network;
s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent finishes
Figure 96439DEST_PATH_IMAGE084
Obtaining a traffic signal lamp optimization model which is well trained by a training target preset by a network;
s9, optimizing the current traffic light scheme through the traffic light optimization model to obtain an optimized traffic light scheme;
the n agents in the step S4 are respectively carried out
Figure DEST_PATH_IMAGE086
The network training comprises the following steps:
a1, initialization state
Figure 895768DEST_PATH_IMAGE060
And actionsAAnd
Figure 950312DEST_PATH_IMAGE064
a2, mixing
Figure 911314DEST_PATH_IMAGE060
Figure DEST_PATH_IMAGE088
Figure 713792DEST_PATH_IMAGE064
Is conducted into
Figure 520074DEST_PATH_IMAGE068
In the network, output
Figure DEST_PATH_IMAGE090
Because of
Figure 225862DEST_PATH_IMAGE068
The network selects an action based on the probability distribution,
Figure 357766DEST_PATH_IMAGE090
is that it is the current
Figure DEST_PATH_IMAGE092
Then, carrying out probability distribution of all behavior selections; and will beact_probThe probability distribution is subjected to the following logarithmic transformation, which is beneficial to faster convergence:
Figure DEST_PATH_IMAGE094
a3: will be composed of
Figure DEST_PATH_IMAGE096
Network-in
Figure 243944DEST_PATH_IMAGE064
Calculated in step A2
Figure DEST_PATH_IMAGE098
The value of the benefit-oriented loss is calculated as follows
Figure DEST_PATH_IMAGE100
Figure DEST_PATH_IMAGE102
Wherein,reduce_meannamely, the average value is calculated in the neural network;
A4:
Figure 742666DEST_PATH_IMAGE068
calculated based on step A2
Figure 771801DEST_PATH_IMAGE090
Extracting the behavior with the maximum probability
Figure DEST_PATH_IMAGE104
A5, comparing the current state
Figure 605765DEST_PATH_IMAGE092
And the state obtained in step A4
Figure 87562DEST_PATH_IMAGE076
Is transmitted into
Figure 501226DEST_PATH_IMAGE096
In the network, current state values are respectively obtained
Figure DEST_PATH_IMAGE106
And the next state
Figure DEST_PATH_IMAGE108
A6 utilizing the value of the reward obtained from the environment
Figure 214229DEST_PATH_IMAGE074
And obtained from step A5
Figure 687936DEST_PATH_IMAGE106
,
Figure 922608DEST_PATH_IMAGE108
Is calculated to obtainTd_ errorThe calculation formula is as follows:
Figure DEST_PATH_IMAGE110
GAMMA: expressing a learning rate in reinforcement learning;
a7 obtained by the step A6
Figure 936701DEST_PATH_IMAGE064
Is transmitted in the reverse direction to
Figure 940429DEST_PATH_IMAGE096
Network for updating agents
Figure 83572DEST_PATH_IMAGE096
Parameters and weights of the network;
a8 behavior Using the A4 step
Figure 539961DEST_PATH_IMAGE104
And states of
Figure 295427DEST_PATH_IMAGE076
Obtained in step A6
Figure 153662DEST_PATH_IMAGE064
Is transmitted to
Figure 703592DEST_PATH_IMAGE068
Network, maximizing value of benefit-oriented loss using gradient descent
Figure 912856DEST_PATH_IMAGE100
Training and updating agents
Figure 472014DEST_PATH_IMAGE068
Parameters and weights of the network.
2. The traffic signal lamp optimization method based on group intelligent reinforcement learning as claimed in claim 1, wherein the state in the step S2SThe setting includes: status of stateSComprehensively calculating and obtaining three values of a current phase serial number idPhase, a current phase corresponding timing duration and a current traffic light intersection road vehicle arrangement length queue;
the data weighting processing is carried out on each index by taking a corresponding factor so as to be beneficial to the convergence of the training result,
Figure DEST_PATH_IMAGE112
in order to be the idPhase weight,
Figure DEST_PATH_IMAGE114
in order to be the weight of the duration,
Figure DEST_PATH_IMAGE116
for the queue weight, the specific state space value formula is as follows:
Figure DEST_PATH_IMAGE118
wherein
Figure DEST_PATH_IMAGE120
green_listMeaning all traffic lights in the environment, len: (green_list) Namely, the number of all traffic lights in the environment is represented;
Figure 342012DEST_PATH_IMAGE112
taking the number of the green light phases in the phase positions,
Figure DEST_PATH_IMAGE122
and taking an integer according to the test result.
3. The traffic signal lamp optimization method based on swarm intelligence reinforcement learning as claimed in claim 2, wherein the current phase corresponds to the timing data and needs to be discretized to some extent, so as to facilitate later convergence; the specific dispersion treatment is as follows:
Figure DEST_PATH_IMAGE124
4. the traffic signal lamp optimization method based on group intelligent reinforcement learning as claimed in claim 1, wherein the behavior in the step S2 is
Figure 951591DEST_PATH_IMAGE078
The setting includes:
acquiring action a, wherein the action a represents the phase to which the traffic light in the next state is to be changed, the length is the number of the phases of the independent traffic lights, and the state space
Figure 116993DEST_PATH_IMAGE078
Each phase can be completely represented by using a One-hot coding mode.
5. The traffic signal lamp optimization method based on swarm intelligence reinforcement learning of claim 1, wherein the corresponding reward is obtained in step S4
Figure 479842DEST_PATH_IMAGE074
The method comprises the following steps:
Figure DEST_PATH_IMAGE126
(1) Road network open road rate
The road network smooth traffic rate is defined as the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network in a certain time period T, describes the overall smooth traffic degree of the road network, is a measure of the overall traffic quality of the road network, and can be used for evaluating the traffic management effect;
Figure DEST_PATH_IMAGE128
wherein,
Figure DEST_PATH_IMAGE130
representing the road network open rate in the T time period;
for the number of segments included in the road network,
Figure DEST_PATH_IMAGE132
is the length of the ith road segment,
Figure DEST_PATH_IMAGE134
as a binary function, when the traffic status level of the section i belongs to the acceptable traffic status,
Figure 797822DEST_PATH_IMAGE134
=1, otherwise
Figure 689554DEST_PATH_IMAGE134
=0;
Figure 607832DEST_PATH_IMAGE130
Has a value range of [0,1];
(2) Congestion mileage duty ratio
The congestion mileage occupation ratio is the proportion of the length of a congested road section to the length of the whole road network, and describes the operation state of the whole road network:
Figure DEST_PATH_IMAGE136
wherein, CR represents the road network congestion mileage occupation ratio,
Figure DEST_PATH_IMAGE138
the result shows the proportion of the crowded mileage,
Figure 866381DEST_PATH_IMAGE132
for road sections
Figure 288135DEST_PATH_IMAGE132
Of the length of (c).
CN202110914300.0A 2021-08-10 2021-08-10 Traffic signal lamp optimization method based on group intelligent reinforcement learning Active CN113628458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110914300.0A CN113628458B (en) 2021-08-10 2021-08-10 Traffic signal lamp optimization method based on group intelligent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110914300.0A CN113628458B (en) 2021-08-10 2021-08-10 Traffic signal lamp optimization method based on group intelligent reinforcement learning

Publications (2)

Publication Number Publication Date
CN113628458A CN113628458A (en) 2021-11-09
CN113628458B true CN113628458B (en) 2022-10-04

Family

ID=78384203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110914300.0A Active CN113628458B (en) 2021-08-10 2021-08-10 Traffic signal lamp optimization method based on group intelligent reinforcement learning

Country Status (1)

Country Link
CN (1) CN113628458B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112325897A (en) * 2020-11-19 2021-02-05 东北大学 Path Planning Method Based on Heuristic Deep Reinforcement Learning
CN112700664A (en) * 2020-12-19 2021-04-23 北京工业大学 Traffic signal timing optimization method based on deep reinforcement learning
CN112863206A (en) * 2021-01-07 2021-05-28 北京大学 Traffic signal lamp control method and system based on reinforcement learning

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667339B (en) * 2009-09-23 2011-11-09 北京交通大学 Modeling system and method based on measured traffic service level of urban road area
US9037519B2 (en) * 2012-10-18 2015-05-19 Enjoyor Company Limited Urban traffic state detection based on support vector machine and multilayer perceptron
US10187098B1 (en) * 2017-06-30 2019-01-22 At&T Intellectual Property I, L.P. Facilitation of passive intermodulation cancelation via machine learning
US12307376B2 (en) * 2018-06-06 2025-05-20 Deepmind Technologies Limited Training spectral inference neural networks using bilevel optimization
CN109472984A (en) * 2018-12-27 2019-03-15 苏州科技大学 Signal light control method, system and storage medium based on deep reinforcement learning
CN111243299B (en) * 2020-01-20 2020-12-15 浙江工业大学 A Single Intersection Signal Control Method Based on 3DQN_PSER Algorithm
CN112201060B (en) * 2020-09-27 2022-05-20 航天科工广信智能技术有限公司 Actor-Critic-based single-intersection traffic signal control method
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm
CN112700663A (en) * 2020-12-23 2021-04-23 大连理工大学 Multi-agent intelligent signal lamp road network control method based on deep reinforcement learning strategy
CN112949933B (en) * 2021-03-23 2022-08-02 成都信息工程大学 A traffic organization scheme optimization method based on multi-agent reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112325897A (en) * 2020-11-19 2021-02-05 东北大学 Path Planning Method Based on Heuristic Deep Reinforcement Learning
CN112700664A (en) * 2020-12-19 2021-04-23 北京工业大学 Traffic signal timing optimization method based on deep reinforcement learning
CN112863206A (en) * 2021-01-07 2021-05-28 北京大学 Traffic signal lamp control method and system based on reinforcement learning

Also Published As

Publication number Publication date
CN113628458A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
Xu et al. Hierarchically and cooperatively learning traffic signal control
Liang et al. Deep reinforcement learning for traffic light control in vehicular networks
Wu et al. Multi-agent deep reinforcement learning for urban traffic light control in vehicular networks
Zang et al. Metalight: Value-based meta-reinforcement learning for traffic signal control
CN110060475B (en) A collaborative control method for multi-intersection signal lights based on deep reinforcement learning
Liang et al. A deep q learning network for traffic lights’ cycle control in vehicular networks
CN115310775B (en) Multi-agent reinforcement learning rolling scheduling method, device, equipment and storage medium
CN107665230A (en) Training method and device for the users&#39; behavior model of Intelligent housing
CN103744733B (en) Method for calling and configuring imaging satellite resources
García-Galán et al. Rules discovery in fuzzy classifier systems with PSO for scheduling in grid computational infrastructures
Li et al. A Bayesian optimization algorithm for the nurse scheduling problem
CN112990485A (en) Knowledge strategy selection method and device based on reinforcement learning
Daeichian et al. Fuzzy Q-learning-based multi-agent system for intelligent traffic control by a game theory approach
CN117872763B (en) Multi-unmanned aerial vehicle road network traffic flow monitoring path optimization method
CN113891401A (en) Heterogeneous network slice scheduling method based on deep reinforcement learning
Ikidid et al. A Fuzzy Logic Supported Multi-Agent System For Urban Traffic And Priority Link Control.
Du et al. Felight: Fairness-aware traffic signal control via sample-efficient reinforcement learning
Borges et al. Traffic light control using hierarchical reinforcement learning and options framework
Zeng et al. Halight: Hierarchical deep reinforcement learning for cooperative arterial traffic signal control with cycle strategy
CN113628458B (en) Traffic signal lamp optimization method based on group intelligent reinforcement learning
CN112884148A (en) Hybrid reinforcement learning training method and device embedded with multi-step rules and storage medium
Zhu et al. Multi-Task Multi-Agent Reinforcement Learning With Task-Entity Transformers and Value Decomposition Training
CN113628442B (en) A traffic organization scheme optimization method based on multi-signal reinforcement learning
Lu et al. A multi-agent adaptive traffic signal control system using swarm intelligence and neuro-fuzzy reinforcement learning
Yu et al. Minimize pressure difference traffic signal control based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant