CN113628458B - Traffic signal lamp optimization method based on group intelligent reinforcement learning - Google Patents
Traffic signal lamp optimization method based on group intelligent reinforcement learning Download PDFInfo
- Publication number
- CN113628458B CN113628458B CN202110914300.0A CN202110914300A CN113628458B CN 113628458 B CN113628458 B CN 113628458B CN 202110914300 A CN202110914300 A CN 202110914300A CN 113628458 B CN113628458 B CN 113628458B
- Authority
- CN
- China
- Prior art keywords
- network
- state
- traffic
- agents
- road
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/07—Controlling traffic signals
- G08G1/081—Plural intersections under common control
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Traffic Control Systems (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a traffic signal lamp optimization method based on group intelligent reinforcement learning, which comprises the following steps of: s1, together form Actor-Critic global (ii) a S2, initializing parameters of n agents; s3, initializing Actor-criticic networks corresponding to the n agents and global Actor-criticic global A network; s4, respectively inputting S into respective Actor networks based on the parameters of the current n agents; and the like. Under the environment of multiple intersections, a model is designed by controlling traffic lights, an algorithm framework of Actor-criticic is used, and meanwhile, a centralized learning and distributed execution method among intelligent agents is used, so that the convergence rate of the algorithm is greatly improved. The invention improves the traffic state and lays a foundation for the application of traffic signal control of later-stage group intelligent reinforcement learning.
Description
Technical Field
The invention belongs to the field of artificial intelligence (reinforcement learning), and particularly relates to a traffic signal lamp optimization method based on group intelligence reinforcement learning.
Background
Description of the general State of the Art
In the traffic scheduling process, traffic lights become the key to control traffic. The traditional traffic signal lamp is in a static state, and the time length and the switching speed of the signal lamp cannot be dynamically changed. And as the complexity of the traffic increases, traffic lights often produce a counterproductive effect. Therefore, the reinforcement learning decision process is added into the signal lamp control, the environment feedback is dynamically acquired through the road detection device, the state and the reward in the decision model are in a dynamic state, and appropriate change is made along with the environment feedback. Through cooperation and gaming among group intelligence, an appropriate decision-making method is made. In recent years, with the deepening of group intelligence and game theory research, group intelligence has been used in traffic decision. The group intelligent information interaction is transmitted through a traffic topology network, and the instant information interaction enables the intelligent agent to have a forecasting effect on the upcoming traffic flow, so that a proper decision can be taken in advance to relieve traffic jam. Three key points in group intelligent reinforcement learning: the states, behaviors and rewards, how they are formulated, need to be obtained through continuous traffic simulation close to the real state.
Prior art one closest to the creation of the present invention
Technical contents of prior art one
The strengthening learning development of a single intelligent agent is mature, the intelligent agent is respectively arranged at each road intersection by adopting a distributed framework, and the signal lamp can be independently scheduled and controlled. Because the independence and the resource occupancy of the intelligent agent are higher, certain efficiency improvement is obtained. Deep reinforcement learning then ensues, and the technology combines reinforcement learning with deep learning with perception capability.
The defects of the first prior art
The single agent reinforcement learning has poor harmony due to the distributed structure, and the information has closure and cannot form effective cooperation. When an emergency occurs, the single agent stops working, which may cause the work of the whole system to be stopped or even crashed. The learning of Q learning is suitable for processing discrete states, and the Q learning is deployed in the current traffic environment, and in the case of a single intersection environment, the number of intersections is thousands, the capacity of a Q table is limited, tens of thousands of states cannot be counted, and the Q table is not suitable for the traffic environment.
The second prior art closest to the creation of the present invention
Technical contents of the second prior art
Group intelligence reinforcement learning is performed to minimize vehicle travel time or the number of sites at multiple intersections, such as literature. Coordination of the time intervals between the start of a green light by setting all intersections of the road network can be achieved in a conventional multi-intersection environment. There are also optimization methods, such as literature, to minimize the travel time of the vehicle and/or the number of stations at multiple intersections, instead of optimizing the offset or maximum pressure, aiming to maximize the throughput of the network and thus minimize the travel time. Many of these approaches still rely on simplified traffic conditions constructed from static environments or assumptions and do not guarantee that actual operation will improve.
The second prior art
As the number of agents grows, the computational effort of centralized training is too large; during the test. Each agent acts independently, and the change of the agent in a dynamic environment needs to be coordinated up and down according to the combination with other agents around.
Disclosure of Invention
Aiming at the defects existing in the existing method for optimizing the traffic organization by utilizing centralized reinforcement learning, a distributed reinforcement learning agent is used for controlling multiple interfaces to carry out interaction. Decentralized communication is more practical and does not require a good scalability in centralized decision-making, but is often very unstable in model convergence and speed.
The purpose of the invention is realized by the following technical scheme:
the traffic signal lamp optimization method based on group intelligent reinforcement learning comprises the following steps:
s1, dividing a current traffic signal lamp timing scheme of an area to be optimized into n intelligent agents with complete cooperative relation; wherein S is a joint state, S 1 ,S 2 ,…,S n Is the state corresponding to the current moment of the agent, S 1_next ,S 2_next ,…,S n_next The state of the agent corresponding to the next moment, O 1 ,O 2 ,…,O n Observed values for n agents, A 1 ,A 2 ,…,A n Behavior for agent, R 0 ,R 1 ,…,R n For the corresponding reward of n agents, actor 1 ,Actor 2 ,…,Actor n Actor local network, ctritic, constructed for n agents 1 ,Critic 2 ,…,Critic n The Critic local networks corresponding to the Actor local networks of the n agents jointly form Actor-Critic 1 ,Actor-Critic 2 ,…,Actor-Critic n ;Actor global Is a global Actor network, criticic global Jointly form Actor-Critic for a global Critic network global Subscripts 1,2, …, n is the agent number;
s2, initializing parameters of n agents;
the parameters of the agent include S, behavior A, TD _ error;
s3, initializing Actor-criticic networks corresponding to the n agents and global Actor-criticic global A network;
s4, respectively inputting S into respective Actor networks based on the parameters of the current n agents; the action networks of the actors respectively select the corresponding action A of the agent, so that the environment gives corresponding return R according to the state and the action of the agent and the determined return function, and the state is transferred to the next state S next ;
S5, the S, A and S obtained in the step S3 next As the input of the Critic network, calculating to obtain TD _ error;
s6, updating parameters and weights of the local Actor-Crytic network;
s7, updating the global Actor-criticic global Participation and weight of the network;
s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent-criticic is finished by the agent global A training target preset by a network is used for obtaining a traffic signal lamp optimization model which is perfectly trained;
and S9, optimizing the current traffic light scheme through the traffic light optimization model to obtain the optimized traffic light scheme.
Preferably, the setting of the state S in step S2 includes: the state S is obtained by comprehensively calculating three values of a current phase serial number idPhase, a current phase corresponding timing duration and a current traffic light intersection road vehicle arrangement length queue;
each index is weighted by corresponding factorThe principle is favorable for the convergence of the training result 1 Is idPhase weight, factor 2 Is a duration weight, factor 3 For the queue weight, the specific state space value formula is as follows:
S=idPhase*factor 1 +duration*factor 2 +queue*factor 3 ;
wherein the factor 1 = len (Green _ list), where Green _ list means all traffic lights in the environment, and len (Green _ list) means the number of all traffic lights in the environment;
factor 1 get the phase of green light in the phase 2 ,factor 3 And taking an integer according to the test result.
As a preferred mode, certain discretization processing needs to be carried out on the timing data corresponding to the current phase, so that later convergence is facilitated; the specific dispersion treatment is as follows:
preferably, the setting of the behavior a in the step S2 includes:
and acquiring action a, wherein the action a represents the phase to which the traffic light in the next state is to be changed, the length is the number of the independent traffic light phases, and the state space A can completely represent each phase by using a One-hot coding mode.
Preferably, the performing, by the n agents, actor-critical network training in step S4 includes the following steps:
a1, an initialization state S, an action matrix A and TD _ error;
a2, transmitting the S, A and TD _ error into the Actor network, and outputting act _ prob, wherein the act _ prob is probability distribution for selecting all behaviors under the current S because the Actor network selects actions based on probability distribution; and the act _ prob probability distribution is subjected to logarithm conversion as follows, which is beneficial to achieving convergence more quickly:
log_prob=log(act_prob)
a3: calculating the TD _ error transmitted from the Critic network and the log _ prob calculated in the step A2 as follows to obtain a benefit-oriented loss value exp _ v;
exp_v=reduce_mean(log_prob*td_error)
wherein, reduce _ mean is the average value in the neural network.
A4: the Actor extracts the behavior a with the maximum probability based on the act _ prob calculated in the step A2; the agent performs action a and gets corresponding reward feedback from the environment, and the agent state switches to state S next 。
A5: training and updating parameters and weights of the Actor network of the agent by using a gradient descent maximization benefit-oriented loss value exp _ v;
a6: the current state S and the state S next Transmitting into Critic network, respectively obtaining current state value V and next state value V next ;
A7: using the value R of the reward obtained from the environment, and V, V obtained from step A6 next And calculating to obtain Td _ error, wherein the calculation formula is as follows:
TD_error=R+GAMMA*V next -V
GAMMA: the learning rate is indicated in reinforcement learning, and the attenuation is smaller as the GAMMA is larger in the reinforcement learning process. This means that the learning process of the agent is more focused on long-term returns. On the other hand, smaller gama, will give greater attenuation. This means that the agent is more concerned with short-term rewards.
A8: and the TD _ error obtained by the step A7 is reversely transmitted to the Critic network for updating the parameters and the weights of the Critic network of the intelligent agent.
A9: using behavior a and State S of step A5 next And transmitting the TD _ error acquired in the step A7 to the Actor network, and training and updating parameters and weights of the Actor network of the agent by using a gradient descent maximization benefit-oriented loss value exp _ v.
Preferably, the obtaining of the corresponding reward R in step S4 includes the following steps:
R=RNCR(t)+CR*0.3
(1) Road network open traffic rate
The road network smooth traffic rate is defined as the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network within a certain time period T, describes the overall smooth traffic degree of the road network, is a measure of the overall traffic quality of the road network, and can be used for evaluating the traffic management effect;
wherein RNCR (T) represents road network open-road rate in T time period (T can be 5min or 3 min), n is the number of road segments contained in the road network, and l ij Is the length of the ith road segment, k i K is a binary function when the traffic state level of the section i belongs to the acceptable traffic state i =1, otherwise k i =0; when the average speed meanspeed of the road section is more than or equal to 20km/h, the road section is in an acceptable traffic state; when the mean speed is less than 20km/h, the state is unacceptable; the value range of RNCR (t) is [0,1]The larger the value, the better the road network state, and conversely, the worse the road network state.
(2) Congestion mileage duty ratio
The congestion mileage occupation ratio is the proportion of the length of a congested road section to the length of the whole road network, and describes the operation state of the whole road network:
wherein CR represents the road network congestion mileage ratio, jamlenthInMetersSum represents the congestion mileage ratio, and l ij For road section l ij Length of (d).
The invention has the beneficial effects that:
under the environment of multiple intersections, a model is designed by controlling traffic lights, an algorithm framework of Actor-Critic is used, meanwhile, a centralized learning and distributed execution method among intelligent agents is used, and the advantages of centralized learning and distributed learning are combined, so that the convergence speed of the algorithm is greatly improved. The invention improves the traffic state and lays a foundation for the application of traffic signal control of later-stage group intelligent reinforcement learning.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of the embodiment.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following descriptions.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Examples
The traffic signal lamp optimization method based on group intelligent reinforcement learning comprises the following steps:
s1, dividing a current traffic signal lamp timing scheme of an area to be optimized into n intelligent agents with complete cooperative relation; wherein S is the state corresponding to the current time of n agents, S 1 ,S 2 ,…,S n Is the state corresponding to the current moment of the agent, S 1_next ,S 2_next ,…,S n_next The state corresponding to the next moment of the agent, O 1 ,O 2 ,…,O n For n observations corresponding to agents, A 1 ,A 2 ,…,A n Behavior for agent, R 0 ,R 1 ,…,R n For the corresponding reward of n agents, actor 1 ,Actor 2 ,…,Actor n Actor local network, ct, constructed for n agents 1 ,Critic 2 ,…,Critic n The Critic local networks corresponding to the Actor local networks of the n agents jointly form Actor-Critic 1 ,Actor-Critic 2 ,…,Actor-Critic n ;Actor global Is a global Actor network, critic global Jointly form Actor-criticic for a global criticic network global Subscripts 1,2, …, n is the agent number;
s2, initializing parameters of n agents;
the parameters of the agent include S, behavior A, TD _ error;
TD _ error is used for measuring the difference between reward feedback obtained by the behavior from the environment and reward feedback brought by the last action selection after the agent finishes the action A each time, and is used for measuring whether the action selection performed by the Actor network is more reasonable and effective; the role of the Actor network is similar to that of an deductive, action selection is performed based on a policy, and the criticc network evaluates whether action selection performed by the Actor network is more effective by using TD _ error.
S3, initializing Actor-Critic networks corresponding to the n agents and global Actor-Critic global A network;
s4, respectively inputting S into respective Actor networks based on the parameters of the current n agents; the action networks of the actors respectively select the corresponding action A of the agent, so that the environment gives corresponding return R according to the state and the action of the agent and the determined return function, and the state is transferred to the next state S next ;
S5, mixingS, A, S obtained in step S3 next As the input of the Critic network, calculating to obtain TD _ error;
s6, updating parameters and weights of the local Actor-Crytic network;
s7, updating the global Actor-criticic global Participation and weight of the network;
s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent-Critic is finished by the agent global A training target preset by the network (the training target is that the road network unblocked rate and the congestion mileage proportion index reach a better state or the training model reaches a convergence state.) is obtained to obtain a well-trained traffic signal lamp optimization model;
and S9, optimizing the current traffic light scheme through the traffic light optimization model to obtain the optimized traffic light scheme.
In a preferred embodiment, the setting of the state S in the step S2 includes: the state S is obtained by comprehensively calculating three values of the current phase serial number idPhase, the corresponding time distribution duration of the current phase and the length queue of the vehicle row length of the road converged at the current traffic light intersection;
each index takes corresponding factor (weight) to carry out data weighting processing, which is favorable for convergence of training results 1 Is idPhase weight, factor 2 Is a duration weight, factor 3 For the queue weight, the specific state space value formula is as follows:
S=idPhase*factor 1 +duration*factor 2 +queue*factor 3 ;
wherein the factor 1 =len(green_list),
Green _ list means all traffic lights in the environment, and len (Green _ list) means the number of all traffic lights in the environment.
factor 1 Get the phase of green light in the phase 2 ,factor 3 And taking an integer according to the test result. factor 2 =[factor 1 ÷3],factor 3 =[factor 1 ×0.7+factor 2 ÷2]Wherein]Is a rounded symbol.
In a preferred embodiment, the current phase corresponds to the timing data and needs to be subjected to certain discretization processing, so that later convergence is facilitated; the specific dispersion treatment is as follows:
in a preferred embodiment, the setting of the behavior a in the step S2 includes:
and acquiring action a, wherein the action a represents the phase to which the traffic light in the next state is to be changed, the length is the number of the phases of the independent traffic light, and the state space A can completely represent each phase by using a One-hot coding mode. (e.g., [1,0,0,0,0] indicates that the traffic light has 5 sets of phases, and the current one-hot code indicates the 0 th set of phases, all starting with 0).
In a preferred embodiment, the performing, by the n agents, actor-critical network training in step S4 respectively includes the following steps:
a1, an initialization state S, an action A and a TD _ error;
a2, transmitting the S, A and TD _ error into the Actor network, and outputting act _ prob, wherein the act _ prob is probability distribution for selecting all behaviors under the current S because the Actor network selects actions based on probability distribution; and the act _ prob probability distribution is subjected to logarithm taking conversion as follows, so that convergence can be achieved more quickly:
log_prob=log(act_prob)
a3: calculating TD _ error transmitted from a Critic network and log _ prob calculated in the step A2 as follows to obtain a benefit-oriented loss value exp _ v;
exp_v=reduce_mean(log_prob*td_error)
wherein, reduce _ mean is the average value in the neural network.
A4: the Actor extracts the behavior a with the maximum probability based on the act _ prob calculated in the step A2; the agent performs action a and gets corresponding reward feedback from the environment, and the agent state switches to state S next 。
A5: training and updating parameters and weights of an Actor network of the agent by utilizing a gradient descent maximization benefit guide loss value exp _ v;
a6: the current state S and the state S next Transmitting into Critic network, respectively obtaining current state value V and next state value V next ;
A7: using the value R of the reward obtained from the environment, and V, V obtained from step A6 next And calculating to obtain Td _ error, wherein the calculation formula is as follows:
TD_error=R+GAMMA*V next -V
a8: and the TD _ error obtained by the step A7 is reversely transmitted to the Critic network for updating the parameters and the weights of the Critic network of the intelligent agent.
In a preferred embodiment, the obtaining of the corresponding reward R in step S4 includes the following steps:
R=RNCR(t)+CR*0.3
(1) Road network open traffic rate
The road network smooth traffic rate is defined as the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network within a certain time period T, describes the overall smooth traffic degree of the road network, is a measure of the overall traffic quality of the road network, and can be used for evaluating the traffic management effect;
wherein RNCR (T) represents road network open-road rate in T time period (T can be 5min or 3 min), n is the number of road segments contained in the road network, and l ij Is the length of the ith road segment, k i K is a binary function when the traffic state grade of the road section i belongs to the acceptable traffic state i =1, otherwise k i =0; when the average speed meanspeed of the road section is more than or equal to 20km/h, the road section is in an acceptable traffic state; when the mean speed is less than 20km/h, the state is unacceptable; the value range of RNCR (t) is [0,1]The larger the value, the better the road network state, and conversely, the worse the road network state.
(2) Congestion mileage duty ratio
The congestion mileage occupancy is the proportion of the length of a congested road section to the length of the whole road network, and describes the overall operation state of the road network:
wherein CR represents the road network congestion mileage ratio, jamlenthInMetersSum represents the congestion mileage ratio, and l ij For a section of road l ij Of the length of (c). A smaller CR indicates a better traffic state.
Experiments show that the setting of the return function is beneficial to help distinguish the optimization degree, so that the model has better identification capability.
In a preferred embodiment, the traffic light a is matched with the traffic light:
0:<phase duration=″36″state=〞rrrGGgrrrGGg〞/>
1:<phase duration=″4″state=〞rrryygrrryyg″/>
2:<phase duration=″6″state=″rrrrrGrrrrrG″/>
3:<phase duration=″4″state=″rrrrryrrrrry″/>
4:<phase duration=″36″state=″GGgrrrGGgrrr″/>
5:<phase duration=″4″state=″yyyrrryyyrrr″/>
the timing shows that the traffic lights have 6 phases in total, and the phase corresponding time duration is 36S,4S,6S,4S,36S,4S.
The State thereof represents the road connection State of the traffic light control at each phase, for example:
as shown in fig. 1, this figure shows the State of traffic light a in phase 0.
R and R represent that the traffic signal lamp control road connection is in a red light state, G and G represent that the traffic signal lamp control road connection is in a green light state, and Y and Y represent that the traffic signal lamp control road connection is in a yellow light state. Capital and lowercase letter differentiation also means a pass priority, with capital letters being more prioritized than lowercase letters.
state = "rrrggrrrggg" means: in phase, (1) (2) (3) (7) (8) (9) road connection is red light state, (4) (5) (6) in red (R) road connectionThe road connection is in a green state.
After the optimization is carried out by the invention;
the traffic signal lamp timing is after the traffic light a is correspondingly optimized
0:<phase duration=″27″state=″rrrGGgrrrGGg″/>
1:<phase duration=″2″state=″rrryygrrryyg″/>
2:<phase duration=″4″state=″rrrrrGrrrrrG″/>
3:<phase duration=″7″state=″rrrrryrrrrry″/>
4:<phase duration=″43″state=″GGgrrrGGgrrr″/>
5:<phase duration=〞6″state=″yyyrrryyyrrr″/>
The main optimization is embodied on the optimization of duration (timing) corresponding to each Phase (Phase).
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention. The present invention should be considered as limited only by the preferred embodiments and not by the specific details, but rather as limited only by the accompanying drawings, and as used herein, is intended to cover all modifications, equivalents and improvements falling within the spirit and scope of the invention.
Claims (5)
1. The traffic signal lamp optimization method based on group intelligent reinforcement learning is characterized by comprising the following steps of:
s1, optimizing the current traffic signal lamp of the area to be optimizedThe timing scheme is divided into n agents with complete cooperative relationship;,,…,is the state corresponding to the current moment of the agent,,,…,is the corresponding state of the agent at the next moment,,,…,for the observed values corresponding to n agents,,,…,in order for the behavior of the agent to correspond,,,…,for the rewards corresponding to the n agents,,,…,constructed for n agentsThe local network(s) are,,,…,for n agentsLocal network corresponds to Local areaNetwork, co-forming,,…,;Is a whole worldThe network(s) of the network(s),is a whole worldNetwork, co-formingSubscripts 1,2, …, n is the number of the agent;
s2, initializing parameters of n agents;
SFor the states corresponding to the current time of n agents,
for measuring the completion of an agent's behavior each timeAThe difference between the reward feedback obtained from the environment and the reward feedback from the last action selection is then used as a measure of the differenceWhether the action selection performed by the network is more reasonable and effective;
s4, based on the parameters of the current n agents, the method carries out the steps ofAre respectively input to eachIn a network; each of which isNetwork behavior for selecting respective agentsThe environment gives corresponding return according to the state and the behavior of the agent and the determined return functionTransition to the next state;
s8, repeating the steps from S4 to S7 until the set number of rounds is reached or the agent finishesObtaining a traffic signal lamp optimization model which is well trained by a training target preset by a network;
s9, optimizing the current traffic light scheme through the traffic light optimization model to obtain an optimized traffic light scheme;
the n agents in the step S4 are respectively carried outThe network training comprises the following steps:
a2, mixing,,Is conducted intoIn the network, outputBecause ofThe network selects an action based on the probability distribution,is that it is the currentThen, carrying out probability distribution of all behavior selections; and will beact_probThe probability distribution is subjected to the following logarithmic transformation, which is beneficial to faster convergence:
a3: will be composed ofNetwork-inCalculated in step A2The value of the benefit-oriented loss is calculated as follows;
Wherein,reduce_meannamely, the average value is calculated in the neural network;
A5, comparing the current stateAnd the state obtained in step A4Is transmitted intoIn the network, current state values are respectively obtainedAnd the next state;
A6 utilizing the value of the reward obtained from the environmentAnd obtained from step A5,Is calculated to obtainTd_ errorThe calculation formula is as follows:
GAMMA: expressing a learning rate in reinforcement learning;
a7 obtained by the step A6Is transmitted in the reverse direction toNetwork for updating agentsParameters and weights of the network;
2. The traffic signal lamp optimization method based on group intelligent reinforcement learning as claimed in claim 1, wherein the state in the step S2SThe setting includes: status of stateSComprehensively calculating and obtaining three values of a current phase serial number idPhase, a current phase corresponding timing duration and a current traffic light intersection road vehicle arrangement length queue;
the data weighting processing is carried out on each index by taking a corresponding factor so as to be beneficial to the convergence of the training result,in order to be the idPhase weight,in order to be the weight of the duration,for the queue weight, the specific state space value formula is as follows:
green_listMeaning all traffic lights in the environment, len: (green_list) Namely, the number of all traffic lights in the environment is represented;
3. The traffic signal lamp optimization method based on swarm intelligence reinforcement learning as claimed in claim 2, wherein the current phase corresponds to the timing data and needs to be discretized to some extent, so as to facilitate later convergence; the specific dispersion treatment is as follows:
4. the traffic signal lamp optimization method based on group intelligent reinforcement learning as claimed in claim 1, wherein the behavior in the step S2 isThe setting includes:
5. The traffic signal lamp optimization method based on swarm intelligence reinforcement learning of claim 1, wherein the corresponding reward is obtained in step S4The method comprises the following steps:
(1) Road network open road rate
The road network smooth traffic rate is defined as the ratio of the road mileage of the road network in a good traffic state to the mileage of all the road sections in the road network in a certain time period T, describes the overall smooth traffic degree of the road network, is a measure of the overall traffic quality of the road network, and can be used for evaluating the traffic management effect;
for the number of segments included in the road network,is the length of the ith road segment,as a binary function, when the traffic status level of the section i belongs to the acceptable traffic status,=1, otherwise=0;Has a value range of [0,1];
(2) Congestion mileage duty ratio
The congestion mileage occupation ratio is the proportion of the length of a congested road section to the length of the whole road network, and describes the operation state of the whole road network:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110914300.0A CN113628458B (en) | 2021-08-10 | 2021-08-10 | Traffic signal lamp optimization method based on group intelligent reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110914300.0A CN113628458B (en) | 2021-08-10 | 2021-08-10 | Traffic signal lamp optimization method based on group intelligent reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113628458A CN113628458A (en) | 2021-11-09 |
CN113628458B true CN113628458B (en) | 2022-10-04 |
Family
ID=78384203
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110914300.0A Active CN113628458B (en) | 2021-08-10 | 2021-08-10 | Traffic signal lamp optimization method based on group intelligent reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113628458B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112325897A (en) * | 2020-11-19 | 2021-02-05 | 东北大学 | Path Planning Method Based on Heuristic Deep Reinforcement Learning |
CN112700664A (en) * | 2020-12-19 | 2021-04-23 | 北京工业大学 | Traffic signal timing optimization method based on deep reinforcement learning |
CN112863206A (en) * | 2021-01-07 | 2021-05-28 | 北京大学 | Traffic signal lamp control method and system based on reinforcement learning |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667339B (en) * | 2009-09-23 | 2011-11-09 | 北京交通大学 | Modeling system and method based on measured traffic service level of urban road area |
US9037519B2 (en) * | 2012-10-18 | 2015-05-19 | Enjoyor Company Limited | Urban traffic state detection based on support vector machine and multilayer perceptron |
US10187098B1 (en) * | 2017-06-30 | 2019-01-22 | At&T Intellectual Property I, L.P. | Facilitation of passive intermodulation cancelation via machine learning |
US12307376B2 (en) * | 2018-06-06 | 2025-05-20 | Deepmind Technologies Limited | Training spectral inference neural networks using bilevel optimization |
CN109472984A (en) * | 2018-12-27 | 2019-03-15 | 苏州科技大学 | Signal light control method, system and storage medium based on deep reinforcement learning |
CN111243299B (en) * | 2020-01-20 | 2020-12-15 | 浙江工业大学 | A Single Intersection Signal Control Method Based on 3DQN_PSER Algorithm |
CN112201060B (en) * | 2020-09-27 | 2022-05-20 | 航天科工广信智能技术有限公司 | Actor-Critic-based single-intersection traffic signal control method |
CN112632858A (en) * | 2020-12-23 | 2021-04-09 | 浙江工业大学 | Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm |
CN112700663A (en) * | 2020-12-23 | 2021-04-23 | 大连理工大学 | Multi-agent intelligent signal lamp road network control method based on deep reinforcement learning strategy |
CN112949933B (en) * | 2021-03-23 | 2022-08-02 | 成都信息工程大学 | A traffic organization scheme optimization method based on multi-agent reinforcement learning |
-
2021
- 2021-08-10 CN CN202110914300.0A patent/CN113628458B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112325897A (en) * | 2020-11-19 | 2021-02-05 | 东北大学 | Path Planning Method Based on Heuristic Deep Reinforcement Learning |
CN112700664A (en) * | 2020-12-19 | 2021-04-23 | 北京工业大学 | Traffic signal timing optimization method based on deep reinforcement learning |
CN112863206A (en) * | 2021-01-07 | 2021-05-28 | 北京大学 | Traffic signal lamp control method and system based on reinforcement learning |
Also Published As
Publication number | Publication date |
---|---|
CN113628458A (en) | 2021-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xu et al. | Hierarchically and cooperatively learning traffic signal control | |
Liang et al. | Deep reinforcement learning for traffic light control in vehicular networks | |
Wu et al. | Multi-agent deep reinforcement learning for urban traffic light control in vehicular networks | |
Zang et al. | Metalight: Value-based meta-reinforcement learning for traffic signal control | |
CN110060475B (en) | A collaborative control method for multi-intersection signal lights based on deep reinforcement learning | |
Liang et al. | A deep q learning network for traffic lights’ cycle control in vehicular networks | |
CN115310775B (en) | Multi-agent reinforcement learning rolling scheduling method, device, equipment and storage medium | |
CN107665230A (en) | Training method and device for the users' behavior model of Intelligent housing | |
CN103744733B (en) | Method for calling and configuring imaging satellite resources | |
García-Galán et al. | Rules discovery in fuzzy classifier systems with PSO for scheduling in grid computational infrastructures | |
Li et al. | A Bayesian optimization algorithm for the nurse scheduling problem | |
CN112990485A (en) | Knowledge strategy selection method and device based on reinforcement learning | |
Daeichian et al. | Fuzzy Q-learning-based multi-agent system for intelligent traffic control by a game theory approach | |
CN117872763B (en) | Multi-unmanned aerial vehicle road network traffic flow monitoring path optimization method | |
CN113891401A (en) | Heterogeneous network slice scheduling method based on deep reinforcement learning | |
Ikidid et al. | A Fuzzy Logic Supported Multi-Agent System For Urban Traffic And Priority Link Control. | |
Du et al. | Felight: Fairness-aware traffic signal control via sample-efficient reinforcement learning | |
Borges et al. | Traffic light control using hierarchical reinforcement learning and options framework | |
Zeng et al. | Halight: Hierarchical deep reinforcement learning for cooperative arterial traffic signal control with cycle strategy | |
CN113628458B (en) | Traffic signal lamp optimization method based on group intelligent reinforcement learning | |
CN112884148A (en) | Hybrid reinforcement learning training method and device embedded with multi-step rules and storage medium | |
Zhu et al. | Multi-Task Multi-Agent Reinforcement Learning With Task-Entity Transformers and Value Decomposition Training | |
CN113628442B (en) | A traffic organization scheme optimization method based on multi-signal reinforcement learning | |
Lu et al. | A multi-agent adaptive traffic signal control system using swarm intelligence and neuro-fuzzy reinforcement learning | |
Yu et al. | Minimize pressure difference traffic signal control based on deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |