CN118629234A

CN118629234A - Coordinated control method, device and readable storage medium for traffic lights at multiple intersections

Info

Publication number: CN118629234A
Application number: CN202410643565.5A
Authority: CN
Inventors: 许鋆; 罗芷悦
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2024-05-23
Filing date: 2024-05-23
Publication date: 2024-09-10
Anticipated expiration: 2044-05-23
Also published as: CN118629234B

Abstract

The invention relates to a method and equipment for cooperatively controlling traffic signal lamps at multiple intersections and a readable storage medium. The method comprises the following steps: acquiring traffic flow information, and inputting the traffic flow information into a traffic signal lamp control model based on a multi-agent reinforcement learning algorithm to acquire and store the state, action and rewarding function of each agent; adopting an interpretable influence mechanism based on a high-efficiency linked neural network to calculate importance coefficients of input data on different traffic routes and calculate traffic flow hidden variables after weighted aggregation; adopting a bias ReLU neural network to approximate a valence function and a strategy function in actor-critic reinforcement learning algorithm so as to construct a frame of the piecewise linearity actor-critic; adopting a centralized training distributed execution method, wherein actor of each intelligent agent trains through traffic flow information to obtain respective strategy functions; and training the centralized critic according to the traffic flow hidden variable to obtain a joint cost function to obtain the optimal solution of the multi-agent reinforcement learning algorithm. The invention can improve the efficiency and the safety of road traffic flow.

Description

Coordinated control method, device and readable storage medium for traffic lights at multiple intersections

技术领域Technical Field

本发明涉及多路口交通信号灯协同控制方法及设备、可读存储介质，属于智慧交通信号灯控制技术领域。The present invention relates to a method and device for collaboratively controlling traffic lights at multiple intersections, and a readable storage medium, and belongs to the technical field of intelligent traffic light control.

背景技术Background Art

交通运输是经济社会发展的关键动力，是城市经济竞争力的表现之一。交通基础设施的发展给人们的生活带来了许多便利，但同时也带来了一系列问题。城市交通最常见的是拥堵问题，它给人们的出行带来了许多不便，增加了行车风险，降低了物流效率，带来了能耗浪费，造成了社会整体的经济损失，给环境带来了巨大压力。Transportation is a key driving force for economic and social development and one of the manifestations of urban economic competitiveness. The development of transportation infrastructure has brought many conveniences to people's lives, but it has also brought a series of problems. The most common problem in urban transportation is congestion, which brings many inconveniences to people's travel, increases driving risks, reduces logistics efficiency, causes energy waste, causes overall economic losses to society, and brings huge pressure on the environment.

交通信号灯可以通过控制路口不同方向的红绿灯相位，从而调整不同道路上的车流量，达到交通疏导、减少拥堵、减少车辆路口等待时间的目的。一般来说，交通信号灯控制方法主要分为三种类型：定时控制、感应控制和自适应控制。其中，定时控制基于预定的时间表更新路口信号灯相位，感应控制使用传感器来获取当前的交通流状态，并使用预定的规则来设置路口绿灯阶段。这两种方法都没有考虑长期的交通状况，无法实时地优化信号灯控制，缺乏动态缓解交通拥堵的能力。Traffic lights can adjust the traffic flow on different roads by controlling the phases of red and green lights in different directions at intersections, thereby achieving the purpose of traffic diversion, reducing congestion, and reducing the waiting time of vehicles at intersections. Generally speaking, traffic light control methods are mainly divided into three types: timing control, sensing control, and adaptive control. Among them, timing control updates the phase of intersection lights based on a predetermined schedule, and sensing control uses sensors to obtain the current traffic flow status and uses predetermined rules to set the green light phase of the intersection. Both methods do not take into account long-term traffic conditions, cannot optimize signal light control in real time, and lack the ability to dynamically alleviate traffic congestion.

随着城市化和人工智能的发展，数据驱动的控制方法在智慧交通系统中发挥着越来越重要的作用。因此，许多研究人员专注于使用自适应的方法解决交通信号灯控制问题。在此背景下，强化学习的方法被大量运用在交通信号灯控制中。交通信号灯控制实际上是一个序列决策问题，可以建模为马尔可夫决策过程并通过强化学习算法解决。交通信号灯控制的目的是通过调节交叉路口的信号灯香味，最小化整个交通路网的车辆等待时间。在单路口信号灯控制中，Genders等人第一个将深度强化学习方法应用在了交通信号灯控制中，并使用卷积神经网络逼近Q值函数。然而，在现实生活中，交通网络是相互关联的，某个路口信号的控制必然会影响其上下游路口的交通状况，进而引发周边路口的连锁反应。因此许多研究探索了多智能体强化学习算法在多路口交通信号灯协同控制中的应用，目标是在集中式模型和分散式模型中寻找一个平衡，减少交通拥堵，并最大限度地减少整个路网上车辆的等待时间。譬如，Van der Pool等人提出了基于深度强化学习的多路口交通信号灯控制机制，定义了一个新的奖励函数，并采用了针对较小交叉口集的迁移规划技术，将学习结果与较大交叉口集的最大加协同算法联系起来，实现了多个路口之间较高效率的协同控制。又譬如，Casas将Deep deterministic policy gradient(DDPG，深度确定性策略梯度)算法应用在交通信号灯控制系统中，DDPG使用policy gradient的方法，相比于价值更新的方法，如Deep Q-Network(DQN，深度Q网络)，DDPG直接优化智能体策略，并使用确定性策略直接给出确定的动作值，减少动作选择的方差，加速算法优化。又譬如，Tian,Tan,Feng等人根据自区域划分的思想，优化了全局学习的模型，提出了the Coder算法，对自区域的输出进行了较好的聚合，优化了不同交叉口间的协同效果。为了进一步探索不同路口之间交通数据的时空依赖性，许多学者提出了基于图强化学习的交通信号灯控制。Nishi等人提出了一种基于图卷积神经网络的强化学习算法，讲多个交通路口建模为一个网络图。又譬如，Devailly等人提出了一种利用图卷积网络和独立DQN相结合的归纳图强化学习算法。又譬如，Wang等人提出了时空多智能体强化学习框架，该框架使用图神经网络提取空间结构信息，并使用循环神经网络建模系统动态性，最后通过分散式的DQN和注意力机制实现多路口交通信号灯间的协同控制。With the development of urbanization and artificial intelligence, data-driven control methods play an increasingly important role in intelligent transportation systems. Therefore, many researchers focus on using adaptive methods to solve the problem of traffic light control. In this context, reinforcement learning methods are widely used in traffic light control. Traffic light control is actually a sequential decision problem, which can be modeled as a Markov decision process and solved by reinforcement learning algorithms. The purpose of traffic light control is to minimize the waiting time of vehicles in the entire traffic network by adjusting the signal lights at intersections. In the control of single-intersection signals, Genders et al. were the first to apply deep reinforcement learning methods to traffic light control and used convolutional neural networks to approximate the Q-value function. However, in real life, traffic networks are interconnected, and the control of a certain intersection signal will inevitably affect the traffic conditions of its upstream and downstream intersections, thereby triggering a chain reaction at surrounding intersections. Therefore, many studies have explored the application of multi-agent reinforcement learning algorithms in the coordinated control of multi-intersection traffic lights, with the goal of finding a balance between centralized and decentralized models to reduce traffic congestion and minimize the waiting time of vehicles on the entire road network. For example, Van der Pool et al. proposed a multi-intersection traffic light control mechanism based on deep reinforcement learning, defined a new reward function, and adopted a migration planning technique for a smaller intersection set, linking the learning results with the maximum plus coordination algorithm for a larger intersection set, and achieved highly efficient coordinated control between multiple intersections. For another example, Casas applied the Deep deterministic policy gradient (DDPG) algorithm to the traffic light control system. DDPG uses the policy gradient method. Compared with the value update method, such as Deep Q-Network (DQN), DDPG directly optimizes the agent strategy and uses a deterministic strategy to directly give a certain action value, reducing the variance of action selection and accelerating algorithm optimization. For another example, Tian, Tan, Feng et al. optimized the global learning model based on the idea of self-region division, proposed the Coder algorithm, and performed a good aggregation of the output of the self-region, optimizing the coordination effect between different intersections. In order to further explore the spatiotemporal dependence of traffic data between different intersections, many scholars have proposed traffic light control based on graph reinforcement learning. Nishi et al. proposed a reinforcement learning algorithm based on graph convolutional neural networks, modeling multiple traffic intersections as a network graph. For another example, Devailly et al. proposed an inductive graph reinforcement learning algorithm that combines graph convolutional networks with independent DQNs. For another example, Wang et al. proposed a spatiotemporal multi-agent reinforcement learning framework that uses graph neural networks to extract spatial structure information and uses recurrent neural networks to model system dynamics. Finally, the coordinated control of traffic lights at multiple intersections is achieved through decentralized DQNs and attention mechanisms.

尽管上述的工作通过协同的强化学习处理了多路口交通信号灯的控制问题，但是相互耦合的智能体对全局路网的影响未能得到明确的考虑。因此，建模多智能体间的关系，并通过一个可解释的网络提取输入交通流数据之间的关系，具有重要的研究意义。Although the above works have dealt with the control problem of traffic lights at multiple intersections through collaborative reinforcement learning, the impact of coupled agents on the global road network has not been explicitly considered. Therefore, modeling the relationship between multiple agents and extracting the relationship between input traffic flow data through an interpretable network is of great research significance.

发明内容Summary of the invention

本发明提供多路口交通信号灯协同控制方法及设备、可读存储介质，旨在至少解决现有技术中存在的技术问题之一。The present invention provides a method and device for collaboratively controlling traffic lights at multiple intersections, and a readable storage medium, aiming to solve at least one of the technical problems existing in the prior art.

本发明的技术方案涉及动态场景中的多路口交通信号灯协同控制方法，其中，每个交通路口的每条车道上设置有用于将车辆信息传送给交通控制中心的道路传感器；根据本发明的方法包括以下步骤：The technical solution of the present invention relates to a method for coordinated control of traffic lights at multiple intersections in a dynamic scene, wherein each lane at each traffic intersection is provided with a road sensor for transmitting vehicle information to a traffic control center; the method according to the present invention comprises the following steps:

S100、通过所述道路传感器获取交通流信息，输入到基于多智能体强化学习算法的交通信号灯控制模型中，以获得并存储每个智能体的状态、动作和奖励函数；S100, acquiring traffic flow information through the road sensor and inputting it into a traffic light control model based on a multi-agent reinforcement learning algorithm to obtain and store the state, action and reward function of each agent;

S200、根据每一个控制间隔的所述交通流信息，采用基于高效链接神经网络的可解释影响机制，求出不同交通路网上输入数据的重要性系数，并求出加权聚合后的交通流隐变量；S200, according to the traffic flow information of each control interval, using an interpretable influence mechanism based on an efficient link neural network, to find the importance coefficient of input data on different traffic road networks, and to find the weighted aggregated traffic flow latent variables;

S300、采用偏置ReLU神经网络逼近actor-critic强化学习算法中的价值函数和策略函数，以构造分片线性actor-critic的框架；S300, using biased ReLU neural network to approximate the value function and policy function in the actor-critic reinforcement learning algorithm to construct a piecewise linear actor-critic framework;

S400、采用集中式训练分布式执行的方法，每个智能体的actor通过步骤S100获得局部观测的交通流信息，训练得到各自的策略函数；集中式critic根据步骤S200获得的加权聚合后的交通流隐变量，训练得到一个联合价值函数；S400, using a centralized training distributed execution method, each intelligent agent actor obtains the local observed traffic flow information through step S100, and trains to obtain its own strategy function; the centralized critic trains to obtain a joint value function based on the weighted aggregated traffic flow hidden variables obtained in step S200;

S500、根据训练获得的策略函数和联合价值函数，获得多智能体强化学习算法的最优解，以应用于多路口交通信号灯协同控制中。S500. According to the strategy function and joint value function obtained through training, the optimal solution of the multi-agent reinforcement learning algorithm is obtained for application in the coordinated control of traffic lights at multiple intersections.

进一步，所述步骤S100中，Further, in step S100,

所述交通流信息包括车道j上的车流数量n_j、车道j上的车辆密集程度路口i的等待时间W_i(k)，在k时刻，其分别表示如下：The traffic flow information includes the number of vehicles on lane _j , the density of vehicles on lane j, The waiting time _Wi (k) at intersection i at time k is expressed as follows:

式中，u_j,d_j分别表示车道j的上游节点和下游节点，表示流入车道j的交通流量，表示离开车道j并前往下一个路口v的交通流量，表示下游节点d_j的邻接路口，τ表示车辆间的间隔系数，L(e_j)表示车道j的车道长度，表示上游节点为u_j与下游节点为v_i的车道上的车辆等待时间。In the formula, u _j , d _j represent the upstream node and downstream node of lane j respectively. represents the traffic flow into lane j, represents the traffic flow leaving lane j and heading to the next intersection v, represents the adjacent intersection of the downstream node d _j , τ represents the interval coefficient between vehicles, L(e _j ) represents the lane length of lane j, Represents the waiting time of vehicles on the lane with upstream node u _j and downstream node _vi .

进一步，所述步骤S100中，基于部分可观测的马尔科夫决策机制，在时间步k处，每个智能体i的用于表示智能体联合状态S＝o₁×o₂×…o_N的局部观测o_i表示如下：Furthermore, in step S100, based on the partially observable Markov decision mechanism, at time step k, the local observation o _i of each agent i used to represent the joint state S = o ₁ ×o ₂ ×…o _N of the agents is expressed as follows:

式中，每个智能体i的观测值为交通路口v_i的相邻道路的车辆队列长度ρ_i＝a_i为用于表示智能体动作的当前路口的信号灯相位，其是一个四位的独热编码，有效编码为当前绿灯相位状态；表示当前路口的道路密度程度；其中，每个交通路口均设置有一个单独的用于提供信号灯相位的控制器；In the formula, the observation value of each agent i is the vehicle queue length of the adjacent road of the traffic intersection _vi ρ _i = a _i is the signal light phase of the current intersection used to represent the agent's action. It is a four-bit one-hot encoding that effectively encodes the current green light phase state; Indicates the road density of the current intersection; wherein each traffic intersection is provided with a separate controller for providing a signal light phase;

其中，所述智能体的奖励函数表示如下：Among them, the reward function of the agent is expressed as follows:

式中，k₁,k₂,k₃表示不同交通路口情况下奖励函数的超参数，ΔQ_i(k)表示相邻时间步间车辆队列的变化；Where k ₁ , k ₂ , k ₃ represent the hyperparameters of the reward function under different traffic intersection conditions, and ΔQ _i (k) represents the change of the vehicle queue between adjacent time steps;

其中，整个交通路网上的全局奖励设置为每个智能体奖励r_i的线性和r(k)，其表示如下：The global reward on the entire traffic network is set to the linear sum r(k) of each agent’s reward _ri , which is expressed as follows:

式中，N表示智能体的个数。In the formula, N represents the number of agents.

进一步，所述步骤S200中，Further, in step S200,

所述高效链接神经网络的所有隐藏层节点都连接到输出层，所述高效链接神经网络的隐藏层中的神经元包括源节点D和中间节点C；All hidden layer nodes of the efficient link neural network are connected to the output layer, and the neurons in the hidden layer of the efficient link neural network include source nodes D and intermediate nodes C;

其中，不同交通路网上输入数据的重要性系数表示如下：Among them, the importance coefficients of input data on different traffic road networks are expressed as follows:

式中，方差σ_m作为不同交通路网上输入数据的重要性系数，VAR(·)表示输入变量对应的第m个输出分量对应的方差，其中f(·)为高效链接神经网络的输出函数；表示根据所述交通流信息获得的所述高效链接神经网络在时间步k的输入；α_p,s为所述高效链接神经网络的权重系数，z_p,s为中间节点C的输出；In the formula, the variance _σm is used as the input data of different traffic networks. The importance coefficient of VAR(·) represents the input variable The corresponding mth output component The corresponding variance, where f(·) is the output function of the efficient link neural network; represents the input of the efficient link neural network at time step k obtained according to the traffic flow information; α _p,s is the weight coefficient of the efficient link neural network, and z _p,s is the output of the intermediate node C;

其中，加权聚合后的交通流隐变量表示如下：Among them, the traffic flow latent variable after weighted aggregation is expressed as follows:

式中，H_i,k是在时间步k第i个路口由所述高效链接神经网络输出的隐变量。Where H _i,k is the latent variable output by the efficient link neural network at the i-th intersection at time step k.

进一步，所述步骤S300中，Further, in step S300,

偏置ReLU神经网络的神经元z(x)表示如下：The neuron z(x) of the biased ReLU neural network is expressed as follows:

式中，q_i表示输入数据第i个维度上的分割偏置数，偏置ReLU神经网络在不同维度上使用不同的偏置参数β_i,qi；Where, _qi represents the number of segmentation biases in the i-th dimension of the input data, and the biased ReLU neural network uses different bias parameters βi _,qi in different dimensions;

进一步，所述步骤S300中，Further, in step S300,

强化学习算法的最优解π^*表示如下：The optimal solution π ^* of the reinforcement learning algorithm is expressed as follows:

u∈π(x)u∈π(x)

式中，x表示强化学习里的状态，w表示一个概率分布为P(·|x,u)的随机噪声，π是策略函数，g(x,u,w)是每步的代价值，γ是折扣因子；是最优价值函数，f(x,u,w)表示强化学习的状态转移函数。Where x represents the state in reinforcement learning, w represents a random noise with probability distribution P(·|x,u), π is the policy function, g(x,u,w) is the cost of each step, and γ is the discount factor; is the optimal value function, and f(x,u,w) represents the state transition function of reinforcement learning.

进一步，所述步骤S400中，Further, in step S400,

通过求和所有critic的损失函数，得到一个联合价值函数L(φ)，critic的目标是最小化联合价值函数，其表示如下：By summing up the loss functions of all critics, we get a joint value function L(φ). The goal of the critic is to minimize the joint value function, which is expressed as follows:

式中，表示优势函数，N_b表示训练批次大小，b′为样本b之后的任意抽样序列，γ为折扣因子，λ为正则化参数，W为估计联合价值函数的神经网络的权重系数；In the formula, represents the advantage function, _Nb represents the training batch size, b′ is any sampling sequence after sample b, γ is the discount factor, λ is the regularization parameter, and W is the estimated joint value function The weight coefficients of the neural network;

其中，每个分散式的actor根据交通路网中局部观测到的状态变量更新其策略函数π_i,θ，以及使用clip函数限制新旧策略的变化比L_i(θ)：Each decentralized actor updates its policy function π _i,θ according to the state variables observed locally in the traffic network, and uses the clip function to limit the change ratio of the new and old policies _Li (θ):

式中，∈表示clip函数中的截断系数，为局部观测输入的编码变量。In the formula, ∈ represents the truncation coefficient in the clip function, The coding variable entered for the local observations.

进一步，所述步骤S500中，使用平均等待时间AVE和交通流稳定性STA定量评估交通路网的状况，其表示如下：Furthermore, in step S500, the average waiting time AVE and traffic flow stability STA are used to quantitatively evaluate the status of the traffic network, which is expressed as follows:

其中，T_s表示测试时间，E_i(t)表示智能体i在测试时间T_s内的车辆平均等待时间。Where _Ts represents the test time, and _Ei (t) represents the average waiting time of the vehicle of agent i during the test time _Ts .

本发明的技术方案还涉及计算机可读存储介质，其上储存有程序指令，所述程序指令被处理器执行时实施上述的方法。The technical solution of the present invention also relates to a computer-readable storage medium on which program instructions are stored, and the above-mentioned method is implemented when the program instructions are executed by a processor.

本发明的技术方案还涉及多路口交通信号灯协同控制设备，所述设备包括计算机装置，该计算机装置包含上述计算机可读存储介质。The technical solution of the present invention also relates to a coordinated control device for traffic lights at multiple intersections, wherein the device includes a computer device, and the computer device includes the above-mentioned computer-readable storage medium.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

本发明提供了一种具有可解释性的影响机制，从而揭示输入交通流数据的时空依赖性，并提出了基于该影响机制的多智能体强化学习算法。在多路口交通信号灯的场景里，使用本发明提出的基于可解释影响机制的多智能体强化学习算法，一方面通过具有可解释性的影响机制提取交通特征的时空相关性，建模不同输入交通数据对输出的影响，另一方面通过多智能体强化学习算法实现多路口交通信号灯的协同控制，缓解路口拥堵，有效地提高道路交通流的效率和安全性。The present invention provides an interpretable influence mechanism to reveal the spatiotemporal dependency of input traffic flow data, and proposes a multi-agent reinforcement learning algorithm based on the influence mechanism. In the scenario of multi-intersection traffic lights, the multi-agent reinforcement learning algorithm based on the interpretable influence mechanism proposed by the present invention is used. On the one hand, the spatiotemporal correlation of traffic characteristics is extracted through the interpretable influence mechanism, and the influence of different input traffic data on the output is modeled. On the other hand, the multi-agent reinforcement learning algorithm is used to realize the coordinated control of traffic lights at multiple intersections, alleviate intersection congestion, and effectively improve the efficiency and safety of road traffic flow.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是根据本发明方法的基本流程图。FIG. 1 is a basic flow chart of the method according to the present invention.

图2是根据本发明方法的高效链接神经网络的结构示意图。FIG. 2 is a schematic diagram of the structure of a highly efficient linked neural network according to the method of the present invention.

图3是根据本发明方法的偏置ReLU神经网络的结构示意图。FIG3 is a schematic diagram of the structure of a biased ReLU neural network according to the method of the present invention.

图4是根据本发明方法的多智能体强化学习算法的流程图。FIG4 is a flow chart of a multi-agent reinforcement learning algorithm according to the method of the present invention.

图5是根据本发明方法的仿真非规则交通路网的结构示意图。FIG. 5 is a schematic diagram of the structure of a simulated irregular traffic network according to the method of the present invention.

图6是根据本发明方法的与现有多种方法的路口等待时间的对比结果图。FIG. 6 is a graph showing the comparison of the intersection waiting time between the method according to the present invention and various existing methods.

图7是根据本发明方法与现有先进算法每回合平均奖励的对比图。FIG. 7 is a comparison chart of the average reward per round between the method according to the present invention and the existing advanced algorithm.

图8是根据本发明方法与现有先进算法每回合路网交通流稳定性的对比图。FIG8 is a comparison diagram of the stability of road network traffic flow in each round between the method according to the present invention and the existing advanced algorithm.

具体实施方式DETAILED DESCRIPTION

以下将结合实施例和附图对本发明的构思、具体结构及产生的技术效果进行清楚、完整的描述，以充分地理解本发明的目的、方案和效果。The concept, specific structure and technical effects of the present invention will be clearly and completely described below in combination with the embodiments and drawings to fully understand the purpose, scheme and effect of the present invention.

需要说明的是，如无特殊说明，当某一特征被称为“固定”、“连接”在另一个特征，它可以直接固定、连接在另一个特征上，也可以间接地固定、连接在另一个特征上。本文所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。此外，除非另有定义，本文所使用的所有的技术和科学术语与本技术领域的技术人员通常理解的含义相同。本文说明书中所使用的术语只是为了描述具体的实施例，而不是为了限制本发明。本文所使用的术语“和/或”包括一个或多个相关的所列项目的任意的组合。It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to another feature, or it may be indirectly fixed or connected to another feature. The singular forms "a", "said" and "the" used herein are also intended to include the plural forms, unless the context clearly indicates otherwise. In addition, unless otherwise defined, all technical and scientific terms used herein have the same meaning as those commonly understood by those skilled in the art. The terms used in this specification are intended only to describe specific embodiments and are not intended to limit the invention. The term "and/or" used herein includes any combination of one or more of the related listed items.

应当理解，尽管在本公开可能采用术语第一、第二、第三等来描述各种元件，但这些元件不应限于这些术语。这些术语仅用来将同一类型的元件彼此区分开。例如，在不脱离本公开范围的情况下，第一元件也可以被称为第二元件，类似地，第二元件也可以被称为第一元件。本文所提供的任何以及所有实例或示例性语言(“例如”、“如”等)的使用仅意图更好地说明本发明的实施例，并且除非另外要求，否则不会对本发明的范围施加限制。It should be understood that, although the term first, second, third etc. may be adopted to describe various elements in the present disclosure, these elements should not be limited to these terms. These terms are only used to distinguish the same type of elements from each other. For example, without departing from the scope of the present disclosure, the first element may also be referred to as the second element, and similarly, the second element may also be referred to as the first element. The use of any and all examples or exemplary language ("for example", "such as" etc.) provided herein is only intended to better illustrate embodiments of the present invention, and unless otherwise required, the scope of the present invention will not be limited.

参照图1至图4，在一些实施例中，根据本发明的多路口交通信号灯协同控制方法，至少包括以下步骤：1 to 4, in some embodiments, the method for coordinated control of traffic lights at multiple intersections according to the present invention includes at least the following steps:

S100、通过道路传感器获取交通流信息，输入到基于多智能体强化学习算法的交通信号灯控制模型中，以获得并存储每个智能体的状态、动作和奖励函数；S100, obtaining traffic flow information through road sensors and inputting it into a traffic light control model based on a multi-agent reinforcement learning algorithm to obtain and store the state, action and reward function of each agent;

S200、根据每一个控制间隔的交通流信息，采用基于高效链接神经网络的可解释影响机制，求出不同交通路网上输入数据的重要性系数，并求出加权聚合后的交通流隐变量；S200, according to the traffic flow information of each control interval, using an interpretable influence mechanism based on an efficient link neural network, to find the importance coefficient of input data on different traffic road networks, and to find the weighted aggregated traffic flow latent variables;

本发明提出了基于可解释影响机制的多路口信号灯协同控制方法，其为在非规则路网下对大规模信号灯协同控制方法，可更好地建立不同路口间的关系，实现全局交通路网的信号灯协同控制。The present invention proposes a method for collaborative control of traffic lights at multiple intersections based on an explainable influence mechanism, which is a method for collaborative control of large-scale traffic lights in an irregular road network. It can better establish the relationship between different intersections and realize collaborative control of traffic lights in the global traffic network.

本发明方法可对包含多个交通路网的整个交通信号灯进行协同控制，可有效缓解路口拥堵，提高道路交通流的效率和安全性。其交通路网可为非规则的交通路网、多乘多的规则交通路网等，每个交通路网可具有多个交通路口，其交通路口可为三向交叉路口和/或十字路口等，每个交通路口在不同方向可设置有多条双向车道和/或单向车道。其中，可通过交通控制中心获取相邻两个交通路口之间的车道长度，以及每条车道的上限速，以及获取外部输入车道的数量等信息。进一步地，可通过交通控制中心获取交通路口的信号灯时间，其中包括黄灯时间和最小及最大的绿灯时间等。The method of the present invention can coordinately control the entire traffic light including multiple traffic networks, effectively alleviate the congestion at the intersection, and improve the efficiency and safety of the road traffic flow. The traffic network can be an irregular traffic network, a multi-multi regular traffic network, etc. Each traffic network can have multiple traffic intersections, and the traffic intersections can be three-way intersections and/or crossroads, etc. Each traffic intersection can be provided with multiple two-way lanes and/or one-way lanes in different directions. Among them, the lane length between two adjacent traffic intersections, the speed limit of each lane, and the number of external input lanes can be obtained through the traffic control center. Furthermore, the signal light time of the traffic intersection can be obtained through the traffic control center, including the yellow light time and the minimum and maximum green light time.

步骤S100的具体实施方式Specific implementation of step S100

本发明方法中，当车辆进入智能交通系统的覆盖范围，道路传感器对车辆信息进行统计，并将信息传送到交通控制中心，定义强化学习中状态、动作、奖励函数的变量，并储存在记忆库中。In the method of the present invention, when a vehicle enters the coverage area of the intelligent transportation system, the road sensor collects statistics on the vehicle information and transmits the information to the traffic control center, defines the variables of the state, action, and reward function in the reinforcement learning, and stores them in the memory bank.

具体的，通过交通路网上的每条车道上的道路传感器，得到k时刻每条交通路网上的交通流信息，本发明将交通路网建模为图网络，每个交叉口为图网络中的一个节点，相邻路口的连接路段为图网络中的一条边。其中包括车道j上的车流数量n_j、车道j上的车辆密集程度路口i的等待时间W_i(k)，其分别表示如下：Specifically, the traffic flow information on each traffic network at time k is obtained through the road sensors on each lane on the traffic network. The present invention models the traffic network as a graph network, where each intersection is a node in the graph network and the connecting section of adjacent intersections is an edge in the graph network. This includes the number of vehicles on lane j, the density of vehicles on lane _j , and the number of vehicles on lane j. The waiting time _Wi (k) at intersection i is expressed as follows:

式中，u_j,d_j分别表示车道j的上游节点和下游节点，表示流入车道j的交通流量，表示离开车道j并前往下一个路口v的交通流量，表示下游节点d_j的邻接路口，τ表示车辆间的间隔系数。L(e_j)表示车道j的车道长度，即相邻两个交通路口之间的车道长度。表示上游节点为u_j与下游节点为v_i的车道上的车辆等待时间。In the formula, u _j , d _j represent the upstream node and downstream node of lane j respectively. represents the traffic flow into lane j, represents the traffic flow leaving lane j and heading to the next intersection v, represents the adjacent intersection of the downstream node d _j , τ represents the interval coefficient between vehicles, and L(e _j ) represents the lane length of lane j, that is, the lane length between two adjacent traffic intersections. Represents the waiting time of vehicles on the lane with upstream node u _j and downstream node _vi .

本发明基于多智能体强化学习算法的交通信号灯控制模型中，将多路口交通信号灯控制问题定义为一个部分可观测的马尔科夫决策问题，并给出了每个智能体的状态、动作和奖励函数。In the traffic light control model based on the multi-agent reinforcement learning algorithm of the present invention, the multi-intersection traffic light control problem is defined as a partially observable Markov decision problem, and the state, action and reward function of each agent are given.

在一些应用实施例中，本发明为了更好地表示道路上的进出交通流量和路口的车辆队列数量，将每个智能体i的观测值定义为交通路口v_i(即车道j的下游节点v_i)的相邻道路的车辆队列长度当前路口的信号灯相位ρ和道路密度程度在时间步k处，每个智能体i的局部观测o_i(k)可以定义为：In some application embodiments, in order to better represent the inbound and outbound traffic flow on the road and the number of vehicle queues at the intersection, the present invention defines the observation value of each agent i as the vehicle queue length of the adjacent road of the traffic intersection _vi (i.e., the downstream node _vi of lane j) The signal light phase ρ and road density at the current intersection At time step k, the local observation o _i (k) of each agent i can be defined as:

其中，o_i(k)用于表示智能体i联合状态S＝o₁×o₂×…o_N的局部观测。每个交通路口都有一个单独的控制器来提供信号灯的相位，信号灯相位ρ用于表示智能体动作，其是一个四位的独热编码，有效编码为当前绿灯相位状态。在每个时间步k，交通路口v_i的控制器选择一个离散动作，其表示为联合行动随着智能体的数量呈指数级增长。其中，本发明只考虑动作集中可行的相位配置，并使用四阶段绿色相位控制。Here, o _i (k) is used to represent the local observation of the joint state S = o ₁ ×o ₂ ×…o _N of agent i. Each traffic intersection has a separate controller to provide the phase of the traffic light. The traffic light phase ρ is used to represent the agent action, which is a four-bit one-hot encoding that effectively encodes the current green light phase state. At each time step k, the controller of the traffic intersection v _i selects a discrete action, which is represented as The joint action grows exponentially with the number of agents. Among them, the present invention only considers feasible phase configurations in the action set and uses four-stage green phase control.

在一些应用实施例中，本发明提出了一种新的奖励函数，考虑大规模交通路网的特点和交通信号灯控制的优化目标，智能体的奖励函数定义为：In some application embodiments, the present invention proposes a new reward function. Considering the characteristics of large-scale traffic networks and the optimization goal of traffic light control, the reward function of the agent is defined as:

式中，κ₁,κ₂,κ₃是不同交通路口情况下奖励函数的超参数，ΔQ_i(k)是相邻时间步间车辆队列的变化。Where κ ₁ ,κ ₂ ,κ ₃ are the hyperparameters of the reward function under different traffic intersection conditions, and ΔQ _i (k) is the change of the vehicle queue between adjacent time steps.

式中，N为智能体的个数，即为交通路口的个数。Where N is the number of agents, that is, the number of traffic intersections.

步骤S200的具体实施方式Specific implementation of step S200

交通信号灯控制模型是一个非线性系统，其用于处理复杂的时空交通流数据，具有很大的挑战性，因此，本发明提出了一种基于高效链接神经网络(EHHNN)的可解释影响机制，提取交通流数据的时空依赖性，以增强不同交通路口信号灯的协同控制。本发明根据记忆库中每一个控制间隔的路网交通流数据，基于可解释影响机制，求出不同路网上输入数据的重要性系数，并求出加权聚合后的交通流隐变量。The traffic light control model is a nonlinear system, which is very challenging to process complex spatiotemporal traffic flow data. Therefore, the present invention proposes an interpretable influence mechanism based on an efficient link neural network (EHHNN) to extract the spatiotemporal dependency of traffic flow data to enhance the coordinated control of traffic lights at different traffic intersections. Based on the interpretable influence mechanism, the present invention calculates the importance coefficient of input data on different road networks based on the road network traffic flow data of each control interval in the memory bank, and calculates the weighted aggregated traffic flow hidden variables.

首先，本发明将从步骤S100中得到的交通流数据输入到一个线性变换层，作为高效链接神经网络在第k步的输入其中，EHHNN可以看作一个有向无环图，其网络结构参见图2所示，EHHNN中的所有隐藏层节点都连接到输出层，隐藏层中包括两种类型的神经元，包括源节点D和中间节点C。First, the present invention inputs the traffic flow data obtained from step S100 into a linear transformation layer as the input of the efficient link neural network in step k. Among them, EHHNN can be regarded as a directed acyclic graph, and its network structure is shown in Figure 2. All hidden layer nodes in EHHNN are connected to the output layer. The hidden layer includes two types of neurons, including source nodes D and intermediate nodes C.

在高效链接神经网络中，源节点D的输出可以描述为：In the Efficient Link Neural Network, the output of the source node D can be described as:

z_1,s＝max{0,x_m-β_m,qm}z _1,s ＝max{0,x _m -β _m,qm }

其中，m表示输入数据的维度，β_m,qm表示输入变量x_m上第q_m个偏置参数。Among them, m represents the dimension of the input data, and β _m,qm represents the q _m -th bias parameter on the input variable x _m .

在EHHNN的隐藏层中，中间节点C的输出z_p,s通过对前几层生成的神经元取最小值获得：In the hidden layer of EHHNN, the output zp _,s of the intermediate node C is obtained by taking the minimum value of the neurons generated in the previous layers:

其中，定义为包含所有由前层网络生成的神经元，|J_p,s|＝p表示第p层网络中间节点的个数。其中，设置 in, It is defined as including all neurons generated by the previous layer network, and |J _p,s |=p represents the number of intermediate nodes in the p-th layer network.

最后，EHHNN的输出是隐藏层中所有神经元的加权和：Finally, the output of the EHHNN is the weighted sum of all neurons in the hidden layer:

其中α_1,s,α_p,s均为EHHNN的权重系数，α₀为偏置常量，n₁,n_p分别表示第1层和第p层网络神经元的个数。Among them, α _1,s ,α _p,s are the weight coefficients of EHHNN, α ₀ is the bias constant, n ₁ ,n _p represent the number of neurons in the first and pth layers of the network respectively.

在本发明采用双因素方差分析(ANOVA)来确定不同交通流量信息作为个体因素的主要效应，以及双变量因素对交叉口拥堵的交互作用。EHHNN的这一特性提供了对不同变量如何对整体预测做出贡献的见解，并有助于更深入地理解数据中的潜在关系，从而可以识别影响方差分析中每个输出分量的隐藏节点：In this paper, a two-way analysis of variance (ANOVA) is used to determine the main effects of different traffic flow information as individual factors, as well as the interactive effects of bivariate factors on intersection congestion. This feature of EHHNN provides insights into how different variables contribute to the overall prediction and helps to gain a deeper understanding of the underlying relationships in the data, so that hidden nodes that affect each output component in the ANOVA can be identified:

式中，VAR(·)表示输入变量对应的第m个输出分量对应的方差，其中，方差σ_m作为不同交通路网上输入数据的重要性系数，其中f(·)为高效链接神经网络的输出函数，σ_m的值越大，其对应的输入变量对预测输出的影响也越大。In the formula, VAR(·) represents the input variable The corresponding mth output component The corresponding variance, where the variance _σm is used as the input data on different traffic networks The importance coefficient of f(·) is the output function of the efficient link neural network. The larger the value of σ _m , the larger the corresponding input variable The greater the impact on the prediction output.

本发明基于EHHNN的影响机制一方面实现了精确的预测，另一方面通过ANOVA分解得到不同输入对输出的影响系数，从而推导出加权了来自交通路网其他车道上交通流信息的聚合嵌入，即加权聚合后的交通流隐变量表示如下：The present invention realizes accurate prediction based on the influence mechanism of EHHNN on the one hand, and obtains the influence coefficient of different inputs on the output through ANOVA decomposition on the other hand, thereby deriving the weighted aggregate embedding of traffic flow information from other lanes of the traffic network, that is, the weighted aggregated traffic flow latent variable is expressed as follows:

其中H_i,k是在第k步第i个路口由EHHNN得到的隐变量。Where Hi _,k is the latent variable obtained by EHHNN at the i-th intersection in the k-th step.

步骤S300的具体实施方式Specific implementation of step S300

Actor-Critic算法作为强化学习中的一种策略学习方法，本发明通过设计一个新的分片线性神经网络，即偏置ReLU神经网络，逼近actor-critic(演员-评论家)方法中的联合价值函数和策略函数π_i,θ。The Actor-Critic algorithm is a strategy learning method in reinforcement learning. This paper designs a new piecewise linear neural network, namely a biased ReLU neural network, to approximate the joint value function in the actor-critic method. and policy functionπ _i,θ .

参见图3的偏置ReLU神经网络的网络结构，其输入数据先经过源节点D(sourceneurons，源神经元)，再经过中间节点C(intermediate neurons，中间神经元)后输出(Output layer，输出层)。本发明的偏置ReLU神经网络是一种与ReLU神经网络相似的分片线性神经网络，其与ReLU神经网络不同，偏置ReLU在不同维度上使用不同的偏置参数 See the network structure of the biased ReLU neural network in Figure 3, where the input data First it passes through the source node D (source neurons), then through the intermediate node C (intermediate neurons) before outputting The biased ReLU neural network of the present invention is a piecewise linear neural network similar to the ReLU neural network. Unlike the ReLU neural network, the biased ReLU uses different bias parameters in different dimensions.

具体地，将每个维度上的偏置参数集定义为其根据输入数据的分布确定：Specifically, the bias parameter set in each dimension is defined as It is determined based on the distribution of the input data:

式中，v,η分别为输入数据的方差和期望。Where v and η are the variance and expectation of the input data respectively.

偏置ReLU神经网络的神经元z(x)可以表示为：The neuron z(x) of the biased ReLU neural network can be expressed as:

式中，q_i表示输入数据第i个维度上的分割偏置数。Where q _i represents the number of segmentation biases on the i-th dimension of the input data.

对一个强化学习问题，其最优解π^*可以表达为：For a reinforcement learning problem, the optimal solution π ^* can be expressed as:

u∈π(x)u∈π(x)

式中，x表示强化学习里的状态，w是一个概率分布为P(·|x,u)的随机噪声，π是策略函数，g(x,u,w)是每步的代价值，γ是折扣因子，是最优价值函数，f(x,u,w)表示强化学习中的状态转移函数。Where x represents the state in reinforcement learning, w is a random noise with probability distribution P(·|x,u), π is the policy function, g(x,u,w) is the cost of each step, and γ is the discount factor. is the optimal value function, and f(x,u,w) represents the state transfer function in reinforcement learning.

根据现有技术中的结论，最大化或最小化一个分片线性的价值函数会得到一个分片线性的策略解，本发明通过使用偏置ReLU逼近actor-critic中的策略函数和价值函数，从而构造了一个分片线性的actor-critic框架。According to the conclusions in the prior art, maximizing or minimizing a piecewise linear value function will result in a piecewise linear policy solution. The present invention constructs a piecewise linear actor-critic framework by using biased ReLU to approximate the policy function and value function in the actor-critic.

步骤S400的具体实施方式Specific implementation of step S400

参见图4的本发明的多智能体强化学习算法的流程图。不同于现有技术的独立近端策略优化方法(IPPO)，本发明提出的多智能体actor-critic算法为使用集中式训练分布式执行的方法，每个智能体的actor通过局部观测(步骤S100中的o_i(k))获得的交通流信息，训练得到各自的策略函数，而集中式critic根据步骤S200中得到的获得的加权聚合后的交通流隐变量，训练得到一个联合价值函数。See the flowchart of the multi-agent reinforcement learning algorithm of the present invention in Figure 4. Different from the independent proximal policy optimization method (IPPO) of the prior art, the multi-agent actor-critic algorithm proposed in the present invention is a method using centralized training and distributed execution, where each agent actor is trained to obtain its own strategy function based on the traffic flow information obtained through local observation (o _i (k) in step S100), and the centralized critic is trained to obtain a joint value function based on the weighted aggregated traffic flow latent variables obtained in step S200.

具体地，本发明通过将所有critic一起更新，以最小化联合价值函数，即critic的目标是最小化联合价值函数。表示优势函数，其由现有技术的广义优势估计的截断估计，N_b表示训练批次大小，b′为样本b之后的任意抽样序列，γ为折扣因子，λ为正则化参数，W为估计联合价值函数的神经网络的权重系数，则通过求和所有critic的损失函数，得到一个联合价值函数L(φ)，其表示如下：Specifically, the present invention updates all critics together to minimize the joint value function, that is, the goal of the critic is to minimize the joint value function. represents the advantage function, which is a truncated estimate of the generalized advantage estimate of the prior art, _Nb represents the training batch size, b′ is an arbitrary sampling sequence after sample b, γ is the discount factor, λ is the regularization parameter, and W is the estimated joint value function The weight coefficient of the neural network is obtained by summing the loss functions of all critics to obtain a joint value function L(φ), which is expressed as follows:

其中，L(φ)为最小化联合价值函数，为步骤S200中获得的加权聚合后的交通流隐变量。每个分散式的actor根据交通路网中局部观测到的状态变量更新其策略函数π_i,θ，并使用近端策略优化的方法，使用clip函数限制新旧策略的变化比L_i(θ)：Among them, L(φ) is the function that minimizes the joint value, is the weighted aggregated traffic flow hidden variable obtained in step S200. Each decentralized actor updates its strategy function π _i,θ according to the state variables observed locally in the traffic network, and uses the proximal strategy optimization method and the clip function to limit the change ratio of the new and old strategies _Li (θ):

步骤S500的具体实施方式Specific implementation of step S500

重复步骤S100至步骤S400，将每一时间步k中求得的策略(步骤S300中的强化学习最优解π^*)应用在交通信号灯控制中，直到完成每回合的多路口交通信号灯协同控制。需要说明的是，一个回合具体指从交通路网初始状态开始，即时间步k＝0，直到达到设定的终止状态为止的一个完整序列，本发明的终止状态为仿真路网时间达到2500s。Repeat steps S100 to S400, and apply the strategy obtained in each time step k (the optimal solution π ^* of reinforcement learning in step S300) to the traffic light control until each round of multi-intersection traffic light coordinated control is completed. It should be noted that a round specifically refers to a complete sequence starting from the initial state of the traffic network, that is, time step k=0, until the set end state is reached. The end state of the present invention is that the simulation network time reaches 2500s.

具体地，通过控制每个交通路口的信号灯相位，并通过交通车道上的传感器收集每个时间步k路口的交通流信息，并使用平均等待时间AVE和交通流稳定性STA定量评估交通路网的状况：Specifically, by controlling the signal light phase at each traffic intersection and collecting traffic flow information at each time step k through sensors on the traffic lanes, the average waiting time AVE and traffic flow stability STA are used to quantitatively evaluate the status of the traffic network:

其中，T_s表示测试时间，E_i(t)表示智能体i在测试时间T_s内的车辆平均等待时间(参见图2中神经元E)。Where _Ts represents the test time, and _Ei (t) represents the average waiting time of vehicles of agent i during the test time _Ts (see neuron E in Figure 2).

此处以一个具体实施例加以说明。将本发明的多路口交通信号灯协同控制方法应用在两个不同的交通路网上，以进行仿真验证，其中仿真平台为Simulation of UrbanMobility(SUMO)，多口交通信号灯的仿真场景的设计如下。A specific embodiment is used here to illustrate the method. The multi-intersection traffic light coordinated control method of the present invention is applied to two different traffic networks for simulation verification, wherein the simulation platform is Simulation of Urban Mobility (SUMO), and the simulation scene of the multi-intersection traffic light is designed as follows.

譬如，交通路网设置为5×5的交通网格，其中，每个路口在东南西北四个方向上都有三条双向车道，相邻路口间的车道长度为100米，每条车道上限速为13.89m/s，每个回合有930辆仿真车流进入该路网。For example, the traffic network is set as a 5×5 traffic grid, in which each intersection has three two-way lanes in the four directions of east, south, west and north. The lane length between adjacent intersections is 100 meters, the speed limit of each lane is 13.89m/s, and 930 simulated vehicles enter the network in each round.

譬如，对于非规则交通路网，其设置为具有8个交通路口的非规则交通路网，交通路口包括了2个三向交叉路口和6个十字路口。其中，相邻路口间的车道长度为75米到150米之间，每条车道上限速为13.89m/s，有10条外部输入车道，每回合有250辆仿真车流输入该路网。For example, for an irregular traffic network, it is set as an irregular traffic network with 8 traffic intersections, including 2 three-way intersections and 6 crossroads. The lane length between adjacent intersections is between 75 meters and 150 meters, the speed limit of each lane is 13.89m/s, there are 10 external input lanes, and 250 simulated vehicles are input into the network each round.

此外，为保证交通路口的安全性与信号灯输出的有效性，设置路口信号灯的黄灯时间为2秒，最小和最大的绿灯时间分别为5秒和50秒。通过调用traci接口，可以通过仿真平台每条车道上的传感器，得到k时刻每条交通路网上的交通流信息。In addition, to ensure the safety of traffic intersections and the effectiveness of signal light output, the yellow light time of the intersection signal light is set to 2 seconds, and the minimum and maximum green light times are 5 seconds and 50 seconds respectively. By calling the traci interface, the traffic flow information on each traffic network at time k can be obtained through the sensors on each lane of the simulation platform.

具体地，通过使用python在SUMO仿真平台上，控制每个路口信号灯的相位，并通过交通道路上的传感器收集每个时间步k路口的交通流信息，参见图5至图8，本发明在图5的仿真非规则交通路网上进行了实验，图6给出了本发明对比定时控制、IPPO算法(independent policy proximal optimization，独立近端优化)、IDQN算法(independentdeep Q-Network，独立深度Q网络)、MADDPG算法(multi-agent deep deterministicpolicy gradient，多智能体深度确定性策略优化)、DGN算法(Graph ConvolutionalReinforcement Learning，深度图强化学习)的路口等待时间结果，图7给出了本发明与现有先进算法每回合平均奖励的对比图，图8给出了本发明与现有先进算法每回合路网交通流稳定性的对比图，本发明提出的算法具有更好的控制性能，能实现更低的交通路网延迟和更平稳的车流状况，有效地缓解了多路口交通拥堵情况。同时，本发明完成了两个消融实验，对比BR-EHH-waiting-time-reward实验，本发明提出的新的奖励函数能获得更好的控制效果；对比ReLU-EHH实验，本发明使用的基于偏置ReLU神经网络的actor-critic框架具有更好的控制效果。Specifically, by using python on the SUMO simulation platform, the phase of the traffic light at each intersection is controlled, and the traffic flow information of each time step k intersection is collected through sensors on the traffic road. Referring to Figures 5 to 8, the present invention is experimented on the simulated irregular traffic network of Figure 5. Figure 6 shows the intersection waiting time results of the present invention compared with timing control, IPPO algorithm (independent policy proximal optimization), IDQN algorithm (independent deep Q-Network), MADDPG algorithm (multi-agent deep deterministic policy gradient), and DGN algorithm (Graph Convolutional Reinforcement Learning). Figure 7 shows a comparison chart of the average reward per round between the present invention and the existing advanced algorithms. Figure 8 shows a comparison chart of the stability of the road network traffic flow per round between the present invention and the existing advanced algorithms. The algorithm proposed by the present invention has better control performance, can achieve lower traffic network delay and smoother traffic conditions, and effectively alleviates traffic congestion at multiple intersections. At the same time, the present invention completed two ablation experiments. Compared with the BR-EHH-waiting-time-reward experiment, the new reward function proposed in the present invention can obtain better control effect; compared with the ReLU-EHH experiment, the actor-critic framework based on the biased ReLU neural network used in the present invention has better control effect.

以上所述，只是本发明的较佳实施例而已，本发明并不局限于上述实施方式，只要其以相同的手段达到本发明的技术效果，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。在本发明的保护范围内其技术方案和/或实施方式可以有各种不同的修改和变化。The above is only a preferred embodiment of the present invention. The present invention is not limited to the above implementation. As long as the technical effect of the present invention is achieved by the same means, any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the scope of protection of the present invention. Within the scope of protection of the present invention, its technical scheme and/or implementation method may have various modifications and changes.

Claims

1. A method for coordinated control of traffic lights at multiple intersections, characterized in that, each lane at each traffic intersection is provided with a road sensor for transmitting vehicle information to a traffic control center;

The method comprises the following steps:

S100, acquiring traffic flow information through the road sensor and inputting it into a traffic light control model based on a multi-agent reinforcement learning algorithm to obtain and store the state, action and reward function of each agent;

S200, according to the traffic flow information of each control interval, using an interpretable influence mechanism based on an efficient link neural network, to find the importance coefficient of input data on different traffic road networks, and to find the weighted aggregated traffic flow latent variables;

S300, using biased ReLU neural network to approximate the value function and policy function in the actor-critic reinforcement learning algorithm to construct a piecewise linear actor-critic framework;

S400, using a centralized training distributed execution method, each intelligent agent actor obtains the local observed traffic flow information through step S100, and trains to obtain its own strategy function; the centralized critic trains to obtain a joint value function based on the weighted aggregated traffic flow hidden variables obtained in step S200;

S500. According to the strategy function and joint value function obtained through training, the optimal solution of the multi-agent reinforcement learning algorithm is obtained for application in the coordinated control of traffic lights at multiple intersections.

2. The method according to claim 1, characterized in that in step S100,

The traffic flow information includes the number of vehicles on lane _j , the density of vehicles on lane j, The waiting time _Wi (k) at intersection i at time k is expressed as follows:

In the formula, u _j , d _j represent the upstream node and downstream node of lane j respectively. represents the traffic flow into lane j, represents the traffic flow leaving lane j and heading to the next intersection v, represents the adjacent intersection of the downstream node d _j , τ represents the interval coefficient between vehicles, L(e _j ) represents the lane length of lane j, Represents the waiting time of vehicles on the lane with upstream node u _j and downstream node _vi .

3. The method according to claim 2, characterized in that in step S100, based on the partially observable Markov decision mechanism, at time step k, the local observation o _i of each agent i used to represent the joint state S = o ₁ ×o ₂ ×…o _N of the agents is expressed as follows:

In the formula, the observation value of each agent i is the vehicle queue length of the adjacent road of the traffic intersection _vi ρ _i = a _i is the signal light phase of the current intersection used to represent the action of the agent; Indicates the road density of the current intersection; wherein each traffic intersection is provided with a separate controller for providing a signal light phase;

Among them, the reward function of the agent is expressed as follows:

Where κ ₁ ,κ ₂ ,κ ₃ represent the hyperparameters of the reward function under different traffic intersection conditions, and ΔQ _i (k) represents the change of the vehicle queue between adjacent time steps;

The global reward on the entire traffic network is set to the linear sum r(k) of each agent’s reward _ri , which is expressed as follows:

In the formula, N represents the number of agents.

4. The method according to claim 1, characterized in that in step S200,

All hidden layer nodes of the efficient link neural network are connected to the output layer, and the neurons in the hidden layer of the efficient link neural network include source nodes D and intermediate nodes C;

Among them, the importance coefficients of input data on different traffic road networks are expressed as follows:

In the formula, the variance _σm is used as the input data of different traffic networks. The importance coefficient of VAR(·) represents the input variable The corresponding mth output component The corresponding variance, where f(·) is the output function of the efficient link neural network; represents the input of the efficient link neural network at time step k obtained according to the traffic flow information; α _p,s is the weight coefficient of the efficient link neural network, and z _p,s is the output of the intermediate node C;

Among them, the traffic flow latent variable after weighted aggregation is expressed as follows:

Where H _i,k is the latent variable output by the efficient link neural network at the i-th intersection at time step k.

5. The method according to claim 1, characterized in that in step S300,

The neuron z(x) of the biased ReLU neural network is expressed as follows:

In the formula, q _i represents the number of segmentation biases in the i-th dimension of the input data, and the biased ReLU neural network uses different bias parameters in different dimensions.

6. The method according to claim 1, characterized in that in step S300,

The optimal solution π* of the reinforcement learning algorithm is expressed as follows:

u∈π(x)

Where x represents the state in reinforcement learning, w represents a random noise with probability distribution P(·|x,u), π is the policy function, g(x,u,w) is the cost of each step, and γ is the discount factor; is the optimal value function, and f(x,u,w) represents the state transition equation of reinforcement learning.

7. The method according to claim 4, characterized in that in step S400,

By summing up the loss functions of all critics, we get a joint value function L(φ), which is expressed as follows:

In the formula, represents the advantage function, _Nb represents the training batch size, b′ is any sampling sequence after sample b, γ is the discount factor, λ is the regularization parameter, and W is the estimated joint value function The weight coefficients of the neural network;

Each decentralized actor updates its policy function π _i,θ according to the state variables observed locally in the traffic network, and uses the clip function to limit the change ratio of the new and old policies _Li (θ):

In the formula, ∈ represents the truncation coefficient in the clip function, The coding variable entered for the local observations.

8. The method according to claim 2, characterized in that, in step S500, the average waiting time AVE and traffic flow stability STA are used to quantitatively evaluate the status of the traffic network, which is expressed as follows:

Where _Ts represents the test time, and _Ei (t) represents the average waiting time of the vehicle of agent i during the test time _Ts .

9 . A computer-readable storage medium having program instructions stored thereon, wherein the program instructions, when executed by a processor, implement the method according to any one of claims 1 to 8.

10. A multi-intersection traffic light coordinated control device, characterized in that it includes:

A computer device comprising a computer readable storage medium according to claim 9.