CN108364467B

CN108364467B - A Road Condition Information Prediction Method Based on Improved Decision Tree Algorithm

Info

Publication number: CN108364467B
Application number: CN201810144289.2A
Authority: CN
Inventors: 何泾沙; 侯立夫; 廖志钢; 黄辉祥
Original assignee: Beijing University of Technology
Current assignee: Gong Weihua
Priority date: 2018-02-12
Filing date: 2018-02-12
Publication date: 2020-08-07
Anticipated expiration: 2038-02-12
Also published as: CN108364467A

Abstract

The invention discloses a road condition information prediction method based on an improved decision tree algorithm, comprising: determining and analyzing attributes of road connectivity based on the influence factor of road connectivity; collecting road data, data preprocessing, calculating information entropy, and calculating each attribute Calculate the attribute entropy of each attribute based on the association function, calculate the weight value of each attribute based on the association function value of each attribute; calculate the information gain of each attribute based on the information entropy, the attribute entropy of each attribute and the weight value of each attribute , build a decision tree according to the size of the information gain of each attribute, and predict the road conditions according to the decision tree. The present invention constructs a decision tree by calculating the correlation function value of the attribute and expanding the attribute weight value obtained by the information entropy operation, which can overcome the problem that the traditional ID3 algorithm tends to select elements with more possible values as high-weight attributes. Use the constructed decision tree to predict the degree of road congestion improvement in the next period.

Description

A Road Condition Information Prediction Method Based on Improved Decision Tree Algorithm

技术领域technical field

本发明涉及决策树算法与道路交通流模型技术领域，尤其涉及一种基于改进型决策树算法的路况信息预测方法。The invention relates to the technical field of decision tree algorithm and road traffic flow model, in particular to a road condition information prediction method based on an improved decision tree algorithm.

背景技术Background technique

随着全球城市化的推进，城市中的机动车保有量逐年增加。截止2016年6月底，北京的机动车保有量已达544万量，位居全国之首。对北京、东京、纽约这种特大型城市，这一数据会随着年代的推移而继续上升。数量巨大的机动车不仅会造成交通拥堵，伴随而来的还有空气污染、能源浪费等问题，阻碍了城市发展，降低了人们的生活水平。在全球200个大型城市中(人口大于80万)，北京的交通拥挤程度排名第15。拥堵的交通状况不仅给人们的出行带来了额外的时间消耗，更提升了燃油消耗，使得运输、物流等行业的运营成本大幅提升。2011年，在澳大利亚国内最大的六个城市中，交通堵塞引发的物流运输延误致使物流公司亏损137亿美元。在中国，运输燃料费用则占物流公司总运营成本的46％。综上而言，减缓道路拥堵、改善道路交通条件不仅可以降低人们的出行成本，还有益于降低道路拥堵带来的经济损失并减少尾气排放。造成交通拥堵的原因是多样化的，在现有交通条件下，最大化利用道路资源可以有效地减缓堵塞，展开对道路信息的研究则有助于直观高效地评估道路可承载车流量与寻路算法的有效性。With the advancement of global urbanization, the number of motor vehicles in cities increases year by year. By the end of June 2016, the number of motor vehicles in Beijing had reached 5.44 million, ranking first in the country. For megacities such as Beijing, Tokyo, and New York, this figure will continue to rise over time. The huge number of motor vehicles will not only cause traffic congestion, but also cause air pollution, energy waste and other problems, hinder urban development and reduce people's living standards. Among the world's 200 large cities (population greater than 800,000), Beijing ranks 15th in terms of traffic congestion. Congested traffic conditions not only bring extra time consumption to people's travel, but also increase fuel consumption, which greatly increases the operating costs of transportation, logistics and other industries. In 2011, logistics companies lost $13.7 billion in delays caused by traffic jams in Australia's six largest cities. In China, transportation fuel costs account for 46 percent of a logistics company's total operating costs. To sum up, alleviating road congestion and improving road traffic conditions can not only reduce people's travel costs, but also help reduce economic losses caused by road congestion and reduce exhaust emissions. The reasons for traffic congestion are diverse. Under the existing traffic conditions, maximizing the use of road resources can effectively alleviate congestion. The study of road information is helpful to intuitively and efficiently evaluate the traffic flow and wayfinding on the road. the effectiveness of the algorithm.

1、决策树算法与改进1. Decision tree algorithm and improvement

决策树是一种常用的分类与回归方法，该方法从一组无序、无规律的事例中推理出决策树表示形式的分类规则。决策树分类算法采用自顶向下的递归方式，在决策树内部节点间进行属性值比较，根据不同属性值判断从该节点向下的分支，判断至叶子节点时即可得出分类结果。决策树中的节点表示一个属性，测试结果则输出在叶节点的分支中，不同条件对应的结果在下一层的节点中进一步验证。因此，从决策树的根到叶节点的每条路径便对应了一种选择办法，越靠近根部的节点属性权重值越高，整棵决策树对应了一组表达式规则。Decision tree is a commonly used classification and regression method, which infers the classification rules of decision tree representation from a set of disordered and irregular cases. The decision tree classification algorithm adopts a top-down recursive method, compares the attribute values among the internal nodes of the decision tree, judges the branches from the node downward according to different attribute values, and obtains the classification result when the leaf node is judged. The nodes in the decision tree represent an attribute, the test results are output in the branches of leaf nodes, and the results corresponding to different conditions are further verified in the nodes of the next layer. Therefore, each path from the root of the decision tree to the leaf node corresponds to a selection method. The attribute weight value of the node closer to the root is higher, and the entire decision tree corresponds to a set of expression rules.

决策树分类算法由决策树的生成和修剪两个步骤组成。生成算法通过输入一组带有类别标记的样本参数来构造一棵二叉或多叉的决策树。对于二叉树，内部节点，一般表示为一个逻辑判断；树的边可看作逻辑判断的分支结果。对于多叉树，内部节点是样本集的属性，边是该属性的所有取值，属性值的数量决定了决策树边的数量，树的叶子节点是类别标记。决策树构造过程采用的方法是自上而下的递归方法，具体算法如下：Decision tree classification algorithm consists of two steps of decision tree generation and pruning. The generative algorithm constructs a binary or multi-fork decision tree by inputting a set of sample parameters with class labels. For a binary tree, the internal node is generally represented as a logical judgment; the edge of the tree can be regarded as the branch result of the logical judgment. For a multi-fork tree, the internal nodes are the attributes of the sample set, the edges are all the values of the attribute, the number of attribute values determines the number of edges in the decision tree, and the leaf nodes of the tree are the category labels. The method used in the decision tree construction process is a top-down recursive method. The specific algorithm is as follows:

算法Generate_decision_treeAlgorithmGenerate_decision_tree

输入：训练样本samples，由离散值属性表示；候选属性的集合attribute_list。Input: training sample samples, represented by discrete-valued attributes; attribute_list, a collection of candidate attributes.

输出：由给定样本产生的一棵决策树。Output: A decision tree generated from the given sample.

(1)、创建节点N；(1), create node N;

(2)、如果samples都在同一个类C则返回N作为叶子节点，以类C为标记，程序结束；(2) If the samples are all in the same class C, return N as a leaf node, mark class C, and the program ends;

(3)、如果attribute_list为空，则返回N作为叶节点，标记为samples中最普通的类，程序结束；(3) If attribute_list is empty, return N as a leaf node, marked as the most common class in samples, and the program ends;

(4)、选择attribute_list中具有最高信息增益的属性h_attribute；(4), select the attribute h_attribute with the highest information gain in the attribute_list;

(5)、标记节点N为h_attribute；(5), mark the node N as h_attribute;

(6)、对于h_attribute中的每一个已知值Si，由节点N生长出一个条件为h_attribute＝Si的分枝；(6), for each known value Si in h_attribute, grow a branch with the condition h_attribute=Si from node N;

(7)、设Si是samples中h_attribute＝Si的样本的集合，如果Si为空则加上一个树叶，标记为samples中最普通的类，否则加上一个由Generate_decision_tree(Si，attribute_list，h_attribute)返回的节点。(7), let Si be the set of samples with h_attribute=Si in samples, if Si is empty, add a leaf, marked as the most common class in samples, otherwise add one and return by Generate_decision_tree(Si, attribute_list, h_attribute) node.

决策树生成的过程的关键在于如何选择好的逻辑判断或属性，选择合适的属性构造决策树属于NP难问题，因此只能采用启发式策略来进行属性选择。属性选择依赖于对各种样本子集的不纯度度量方法，包括信息增益、信息增益比、证据权重、最小描述长度等。The key to the process of decision tree generation lies in how to choose a good logical judgment or attribute. Choosing appropriate attributes to construct a decision tree is an NP-hard problem, so only heuristic strategies can be used for attribute selection. Attribute selection relies on impurity measures for various sample subsets, including information gain, information gain ratio, evidence weight, minimum description length, etc.

由于现实世界的数据一般不完美，所以要根据问题与属性字段特征来选择对应的度量方法。此外，对于部分属性中存在的缺值、数据不准确或含有噪声或错误等情况，决策树可通过预剪枝与后剪枝两种办法来降噪，确保数据的完整与准确。基本的决策树构造法没有考虑噪声，因此生成的决策树完全与训练样本拟合，在有噪声的情况下，完全拟合将导致过分拟合，即分类模型对训练数据的完全拟合反而使分类模型对显示数据的分类预测能力下降。因此，通过对数据的剪枝来降噪是决策树的构建过程中的另一环节，它将使树得到简化而变得更容易理解。Since the data in the real world is generally imperfect, the corresponding measurement method should be selected according to the characteristics of the problem and attribute fields. In addition, for the lack of values in some attributes, inaccurate data, noise or errors, etc., the decision tree can reduce noise through pre-pruning and post-pruning to ensure the integrity and accuracy of the data. The basic decision tree construction method does not consider noise, so the generated decision tree completely fits the training samples. In the case of noise, complete fitting will lead to overfitting, that is, the complete fitting of the classification model to the training data will make the training data fit. The classification model's ability to predict the classification of the displayed data decreases. Therefore, denoising by pruning the data is another part of the decision tree construction process, which will simplify the tree and make it easier to understand.

在决策树思想的指导下，Quinlan提出了以熵和信息增益为衡量标准构建决策树的ID3算法。该算法对样本集使用信息熵作为启发知识来选择合适的分类属性以实现对样本集划分为若干子集的操作。通过选择具有最高信息熵的属性作为对样本集的优先分类条件来逐层构建决策树的节点。照此办法，分类后得到的训练样本子集所需的信息将是最小的，使用具有最高信息熵的属性来划分当前节点中所包含的样本集，将使得所有生成的样本子集的属性混合程度降低到最小。在为寻找对样本分类的最优办法时，构建决策树时应尽量所要问的问题数量，也就是要减少树的深度，信息增益函数便是提供这种平衡划分的办法之一。Under the guidance of the decision tree idea, Quinlan proposed the ID3 algorithm to construct a decision tree based on entropy and information gain. The algorithm uses information entropy as heuristic knowledge for the sample set to select appropriate classification attributes to realize the operation of dividing the sample set into several subsets. The nodes of the decision tree are constructed layer by layer by selecting the attribute with the highest information entropy as the preferential classification condition for the sample set. In this way, the information required for the training sample subset obtained after classification will be the smallest, and using the attribute with the highest information entropy to divide the sample set contained in the current node will make the attributes of all generated sample subsets mixed. reduced to a minimum. In order to find the best way to classify samples, the number of questions to be asked when building a decision tree should be as much as possible, that is, to reduce the depth of the tree, and the information gain function is one of the ways to provide this balanced division.

设S是训练集样本，它包含n个类别的样本，分别用C₁，C₂，…，C_n表示，S的熵值就为Suppose S is a training set sample, which contains n categories of samples, denoted by C ₁ , C ₂ , ..., C _n respectively, and the entropy value of S is

其中，pi表示类Ci的概率。如果将S中的n类训练样本看成是n种不同的消息，那么S的熵表示对每种消息编码需要的平均比特数，|S|×entropy(S)就表示对S进行编码所需的比特数。|S|表示S中的样本数目。where pi represents the probability of class Ci. If the n types of training samples in S are regarded as n different messages, then the entropy of S represents the average number of bits required to encode each message, and |S|×entropy(S) represents the required number of bits to encode S. number of bits. |S| denotes the number of samples in S.

设属性A将S划分成m份，根据A划分的自己的熵或期望信息可以表示为下式：Suppose that attribute A divides S into m parts, and its own entropy or expected information divided according to A can be expressed as the following formula:

其中，S_i表示根据属性A划分的样本集的第i个子集，|S|和|S_i|分别表示S和S_i中的样本数目。信息增益用来衡量熵的期望减少的值，故使用属性A对S进行划分获得的信息增益为Among them, S _i represents the ith subset of the sample set divided according to the attribute A, and |S| and |S _i | represent the number of samples in S and S _i , respectively. Information gain is used to measure the expected reduction of entropy, so the information gain obtained by dividing S by attribute A is

Gain(S,A)＝Entropy(S)-Entropy(S,A)Gain(S,A)=Entropy(S)-Entropy(S,A)

Gain(S,A)是指因知道属性A的值后导致的熵的期望压缩。根据信息增益的定义，信息增益越大，熵的减小量也越大，节点越趋近于钝；故Gain(S,X)越大说明选择测试属性X对分类提供的信息越多。相应地，根据每个属性的信息增益值，选择较大者优先作为决策树的分支属性。Gain(S,A) refers to the expected compression of entropy due to knowing the value of attribute A. According to the definition of information gain, the greater the information gain, the greater the reduction of entropy, and the more blunt the node is; therefore, the greater the Gain(S, X), the more information the selection of the test attribute X provides for the classification. Correspondingly, according to the information gain value of each attribute, the larger one is selected as the branch attribute of the decision tree.

ID3算法在已有的决策树算法基础上以信息熵为首要参考指标应用于决策树节点属性的选择，提升了对样本分类的相关性。ID3算法如下：Based on the existing decision tree algorithm, the ID3 algorithm takes information entropy as the primary reference index and applies it to the selection of decision tree node attributes, which improves the relevance of sample classification. The ID3 algorithm is as follows:

算法ID3Algorithm ID3

输入：训练样本，具有离散值的样本属性，候选归纳属性集Input: training samples, sample attributes with discrete values, set of candidate inductive attributes

输出：一棵决策树Output: a decision tree

(1)初始化决策树T，创建节点N，若样本集种植包含一类属性Q，则返回N作为树的根节点，Q为全体属性集。(1) Initialize the decision tree T and create a node N. If the sample set planting contains a class of attributes Q, return N as the root node of the tree, and Q is the overall attribute set.

(2)If(T中所有叶节点(X’,Q’)都满足X属于用一类or Q’为空)then算法停止；(3)Else{任取一个不具有(2)中所属状态的叶节点(X’,Q’)}；(2) If (all leaf nodes (X', Q') in T satisfy that X belongs to a class or Q' is empty) then the algorithm stops; (3) Else{Any one that does not have the state in (2) The leaf nodes of (X', Q')};

(4)For each Q’中的属性A(4) Attribute A in For each Q'

Do计算信息增益gain(A,X’)；Do calculate the information gain gain(A, X');

(5)选择具有最高信息增益的属性B作为节点(X’,Q’)的测试属性；(5) Select the attribute B with the highest information gain as the test attribute of the node (X', Q');

(6)For each B的取值b_i (6) The value b _i of For each B

Do{从该节点(X’,Q’)深处分支，代表测试输出B＝b_i；求得X中B值等于b_i的子集X_i，并生成相应的叶节点(X_i’,Q’-{B})；}Do{Branch from the depth of the node (X', Q'), representing the test output B=bi; obtain the subset X _i whose B value is equal to _bi in X, and generate the corresponding leaf nodes _{(X i} _' , Q'-{B});}

(7)跳转至步骤(2)；(7) Jump to step (2);

ID3算法是一种贪心算法，采用自顶向下，分而治之的递归方法构建决策树。该递归的终止条件是：节点内的所有样本属于同一类别。如果没有属性可以用来划分目前的样本集，然后使用投票原则使其成为一个强制叶子节点，并将其标记为具有的类别最多的样本类型。The ID3 algorithm is a greedy algorithm that uses a top-down, divide-and-conquer recursive method to build a decision tree. The termination condition of this recursion is that all samples within a node belong to the same class. If there is no attribute that can be used to divide the current sample set, then use the voting principle to make it a mandatory leaf node and mark it as the sample type with the most categories.

由此可见，ID3算法的优点在于方法简单，对样本集的处理能力强，但依赖于特征数目较多的特征，而属性取值最多的属性并不一定最优。换言之，对于实际情况中需要考虑较多因素的问题，传统的ID3算法对属性的优先级判定存在一定不足，这往往源自于具有较多选项的属性的信息熵值越高。It can be seen that the advantage of the ID3 algorithm is that the method is simple and the processing ability of the sample set is strong, but it depends on the features with a large number of features, and the attribute with the most attribute value is not necessarily the best. In other words, for the problem that more factors need to be considered in the actual situation, the traditional ID3 algorithm has certain shortcomings in the priority determination of attributes, which is often due to the higher information entropy value of attributes with more options.

因此在创建决策树时应避免仅考虑信息熵，对此，张等人采用改进型的ID3算法来避免这一部分问题，例如对属性的敏感性进行分析计算以给出更加合理的属性权重、利用基于信息熵的联合密度函数对决策树节点进行二次评估整合等方法。在这两种算法中，敏感性计算将根据输入的属性数值导出相应的神经网络并加以训练，这会极大地提升算法复杂度，算法效率不高；利用联合密度函数的分析只适用于离散型的数据，与本发明的数据未必兼容。Therefore, it is necessary to avoid only considering information entropy when creating a decision tree. In this regard, Zhang et al. adopted an improved ID3 algorithm to avoid this part of the problem, such as analyzing and calculating the sensitivity of attributes to give more reasonable attribute weights, using The joint density function based on information entropy performs secondary evaluation and integration of decision tree nodes. In these two algorithms, the sensitivity calculation will derive the corresponding neural network according to the input attribute value and train it, which will greatly increase the complexity of the algorithm, and the algorithm efficiency is not high; the analysis using the joint density function is only suitable for discrete types. data, not necessarily compatible with the data of the present invention.

发明内容SUMMARY OF THE INVENTION

针对上述问题中存在的不足之处，本发明提供一种基于改进型决策树算法的路况信息预测方法。In view of the deficiencies in the above problems, the present invention provides a road condition information prediction method based on an improved decision tree algorithm.

为实现上述目的，本发明提供一种基于改进型决策树算法的路况信息预测方法，包括：To achieve the above object, the present invention provides a method for predicting road condition information based on an improved decision tree algorithm, including:

步骤1、确定道路连通力的影响因子，所述影响因子包括车道长度R_l、车道数R_s、车道宽度R_d、红绿灯数R_x、连续行驶平均长度R_l*、路段路口数L_n、连接路段数L_r、路段是否含公交车站B_s和行车变道指数L_e；Step 1. Determine the influencing factors of road connectivity, the influencing factors include lane length R _l , number of lanes R _s , lane width R _d , number of traffic lights R _x , average length of continuous driving R _l *, number of road intersections L _n , The number of connected road segments L _r , whether the road segment contains bus stops B _s and the lane change index _Le ;

步骤2、基于步骤1的影响因子确定分析道路连通力的属性，所述属性包括道路容量R_m、车道宽度R_d、期望等候时间T_s、道路平均转向数L_g、连续行驶平均长度R_l*、路段是否含公交车站B_s和行车变道指数L_e；Step 2. Determine and analyze the attributes of the road connectivity force based on the influencing factors in step 1. The attributes include road capacity R _m , lane width R _d , expected waiting time T _s , average number of turns L _g on the road, and average length of continuous driving R _l *. Whether the road section contains bus stop B _s and lane change index _Le ;

步骤3、采集道路数据；Step 3. Collect road data;

步骤4、数据预处理：剔除不合要求的数据与数据修补；Step 4. Data preprocessing: Eliminate unsatisfactory data and repair data;

步骤5、基于采集的道路数据计算信息熵E(S)；Step 5: Calculate the information entropy E(S) based on the collected road data;

式中：S为道路数据的训练集样本，其包含n个类别的样本，分别用C₁，C₂，…，C_n表示；p_i表示类C_i的概率；In the formula: S is the training set sample of road data, which contains n categories of samples, which are represented by C ₁ , C ₂ , ..., C _n respectively; pi _{represents the probability of class C i} _;

步骤6、基于采集的道路数据计算各属性的属性熵E(S,A)；Step 6: Calculate the attribute entropy E(S, A) of each attribute based on the collected road data;

其中，S_i表示根据属性A划分的训练集样本的第i个子集，|S|和|S_i|分别表示S和S_i中的样本数目；Among them, S _i represents the ith subset of training set samples divided according to attribute A, and |S| and |S _i | represent the number of samples in S and S _i , respectively;

步骤7、基于关联函数计算各属性的关联函数值CF(A)；Step 7, calculate the correlation function value CF(A) of each attribute based on the correlation function;

式中：X_im-1与X_im为参数X_ij的特定值，下标j表示属性A的每一种情况，i表示数据每一种取值情况，n为数据总量；In the formula: X _im-1 and X _im are the specific values of the parameter X _ij , the subscript j represents each condition of attribute A, i represents each value condition of the data, and n is the total amount of data;

步骤8、基于各属性的关联函数值CF(A)计算各属性的权重值Wg(A)；Step 8. Calculate the weight value Wg(A) of each attribute based on the associated function value CF(A) of each attribute;

式中：m为属性数量，CF(1)、CF(2)…CF(m)分别为每个属性的关联函数值；In the formula: m is the number of attributes, CF(1), CF(2)...CF(m) are the correlation function values of each attribute respectively;

步骤9、基于信息熵E(S)、各属性的属性熵E(S,A)和各属性的权重值Wg(A)计算各属性的信息增益Gain′(S,A)；Step 9. Calculate the information gain Gain'(S,A) of each attribute based on the information entropy E(S), the attribute entropy E(S,A) of each attribute, and the weight value Wg(A) of each attribute;

Gain′(S,A)＝(E(S)-E(S,A))*Wg(A)Gain'(S,A)=(E(S)-E(S,A))*Wg(A)

步骤10、根据各属性信息增益的大小进行排序构建决策树，并根据决策树预测道路路况。Step 10: Sort and construct a decision tree according to the size of each attribute information gain, and predict road conditions according to the decision tree.

作为本发明的进一步改进，在步骤2中，道路容量R_m为：As a further improvement of the present invention, in step 2, the road capacity R _m is:

式中，α_l为车道衰减系数。where α _l is the lane attenuation coefficient.

作为本发明的进一步改进，在步骤2中，期望等候时间T_s为：As a further improvement of the present invention, in step 2, the expected waiting time T _s is:

式中，T_p为每一个通行方向的时间占总循环的时间的比值。In the formula, T _p is the ratio of the time of each traffic direction to the total cycle time.

作为本发明的进一步改进，在步骤2中，路段平均转向数L_g为：As a further improvement of the present invention, in step 2, the average turning number L _g of the road section is:

作为本发明的进一步改进，在步骤4中，剔除不合要求的数据包括：As a further improvement of the present invention, in step 4, excluding unsatisfactory data includes:

剔除交通流量大于最大限定值的数据；Exclude data with traffic flow greater than the maximum limit value;

剔除道路车速大于最大限定值的数据；Exclude the data whose road speed is greater than the maximum limit value;

剔除交通流量、道路车速为负或空的数据；Exclude data with negative or empty traffic flow and road speed;

剔除道路车速为零，但交通流量不为零的数据；Exclude data where the road speed is zero, but the traffic flow is not zero;

剔除交通流量为零，但道路速度不为零的数据。Data with zero traffic flow but non-zero road speed is excluded.

作为本发明的进一步改进，在步骤4中，数据修补包括：采用临近道路或临近时刻的数据进行修补。As a further improvement of the present invention, in step 4, the data repairing includes: using data of an adjacent road or an adjacent time to repair.

作为本发明的进一步改进，在步骤4与步骤5之间，还包括：As a further improvement of the present invention, between step 4 and step 5, it also includes:

数据分类：根据流入端平均车流密度构造拥堵和流畅条件的道路连通力决策树。Data classification: Construct a road connectivity decision tree for congestion and smooth conditions based on the average traffic density at the inflow end.

作为本发明的进一步改进，在步骤10中，信息增益最大的属性作为决策树的根节点，信息增益次大的属性作为决策树的第二层节点，依次类推，构建出决策树。As a further improvement of the present invention, in step 10, the attribute with the largest information gain is used as the root node of the decision tree, the attribute with the second largest information gain is used as the second layer node of the decision tree, and so on to construct a decision tree.

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

本发明利用关联函数的办法来对属性值权重给出更加合理的评估。通过计算属性的关联函数值，同信息熵展开运算得出的属性权重值来构建决策树，这不仅可以克服传统ID3算法倾向于将选取具有更多可能取值的元素作为高权重属性的问题，同时还把属性间的关联性考虑到算法中，更为全面地反映了属性的权重。The present invention uses the method of the correlation function to give a more reasonable evaluation to the attribute value weight. The decision tree is constructed by calculating the value of the association function of the attribute and expanding the attribute weight value obtained by the information entropy operation, which can not only overcome the problem that the traditional ID3 algorithm tends to select elements with more possible values as high-weight attributes, At the same time, the correlation between attributes is taken into account in the algorithm, which more comprehensively reflects the weight of attributes.

附图说明Description of drawings

图1为本发明一种实施例公开的基于改进型决策树算法的路况信息预测方法的流程图；1 is a flowchart of a method for predicting road condition information based on an improved decision tree algorithm disclosed in an embodiment of the present invention;

图2为本发明一种实施例公开的非高峰时段的决策树示意图；2 is a schematic diagram of a decision tree in an off-peak period disclosed by an embodiment of the present invention;

图3为本发明一种实施例公开的高峰时段的决策树示意图；3 is a schematic diagram of a decision tree during peak hours disclosed by an embodiment of the present invention;

图4为本发明一种实施例公开的非高峰时段单因素对连通力的影响正确率示意图；FIG. 4 is a schematic diagram of the correct rate of influence of a single factor on the connectivity force during off-peak hours disclosed by an embodiment of the present invention;

图5为本发明一种实施例公开的非高峰时段两种算法对连通力判断的准确率示意图；FIG. 5 is a schematic diagram of the accuracy of judging connectivity by two algorithms during off-peak hours disclosed in an embodiment of the present invention;

图6为本发明一种实施例公开的非高峰时段不同路况下的道路预测准确率示意图；6 is a schematic diagram of road prediction accuracy under different road conditions during off-peak hours disclosed by an embodiment of the present invention;

图7为本发明一种实施例公开的高峰时段单因素对连通力的影响正确率示意图；FIG. 7 is a schematic diagram of the correct rate of influence of a single factor on the connectivity force during peak hours disclosed by an embodiment of the present invention;

图8为本发明一种实施例公开的高峰时段两种算法对连通力判断的准确率示意图；FIG. 8 is a schematic diagram showing the accuracy of the judgment of connectivity between two algorithms during peak hours disclosed in an embodiment of the present invention;

图9为本发明一种实施例公开的高峰时段不同路况下的道路预测准确率示意图。FIG. 9 is a schematic diagram showing the accuracy of road prediction under different road conditions during peak hours according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

下面结合附图对本发明做进一步的详细描述：Below in conjunction with accompanying drawing, the present invention is described in further detail:

本发明提供一种基于改进型决策树算法的路况信息预测方法，通过使用改进后的决策树ID3算法对道路路况进行预测。本发明利用关联函数的办法来对属性值权重给出更加合理的评估。通过计算属性的关联函数值，同信息熵展开运算得出的属性权重值来构建决策树，这不仅可以克服传统ID3算法倾向于将选取具有更多可能取值的元素作为高权重属性的问题，同时还把属性间的关联性考虑到算法中，更为全面地反映了属性的权重。根据属性的关联函数值对信息增益的定义公式采取适当地调整。The invention provides a road condition information prediction method based on an improved decision tree algorithm, and predicts the road conditions by using the improved decision tree ID3 algorithm. The present invention uses the method of the correlation function to give a more reasonable evaluation to the attribute value weight. The decision tree is constructed by calculating the value of the association function of the attribute and expanding the attribute weight value obtained by the information entropy operation, which can not only overcome the problem that the traditional ID3 algorithm tends to select elements with more possible values as high-weight attributes, At the same time, the correlation between attributes is taken into account in the algorithm, which more comprehensively reflects the weight of attributes. The definition formula of information gain is appropriately adjusted according to the value of the correlation function of the attribute.

首先，利用关联函数分析属性时，首先计算属性的关联函数值。设A为数据集P的一个属性值，B为数据集P中的一个类别属性，对于属性A及其类别属性B之间的关系如下：First, when analyzing an attribute using a correlation function, the value of the attribute's correlation function is calculated first. Let A be an attribute value of data set P, and B be a category attribute in data set P. The relationship between attribute A and its category attribute B is as follows:

对

有以下关联函数定义式：right

There are the following associated function definitions:

该算式中，X_im-1与X_im可看作一般参数X_ij的特定值，该值指代某个数据的具体数量值，下标j表示属性A的每一种可能情况，i表示数据每一种可能的取值情况，数据总量用n来表示。In this formula, X _im-1 and X _im can be regarded as the specific value of the general parameter X _ij , which refers to the specific quantity value of a certain data, the subscript j represents each possible situation of the attribute A, and i represents the data For each possible value, the total amount of data is represented by n.

接下来，对每个属性的关联函数值作归一化处理。假设有m个属性，每个属性的关联函数值分别为CF(1)、CF(2)…CF(m)，对应的属性权重值可以表示为：Next, normalize the value of the association function for each attribute. Assuming that there are m attributes, the correlation function value of each attribute is CF(1), CF(2)...CF(m), and the corresponding attribute weight value can be expressed as:

其中，0<A≤m；那么，属性A的含权重的信息增益可表示为以下形式：Among them, 0<A≤m; then, the weighted information gain of attribute A can be expressed as the following form:

Gain′(S,A)＝(Entropy(S)-Entropy(S,A))*Wg(A)Gain'(S,A)=(Entropy(S)-Entropy(S,A))*Wg(A)

在构造决策树的过程中，Gain’(S,A)可以有效地替代Gain(S,A)成为属性分类的新标准。这可以有效地避免在选择结点时优先选择取值较多的属性，同时对属性值的权重值作了保真处理，这减小了信息增益与实际情况间的差异，确保在构建树的过程中优先选取更为有影响力的属性。In the process of constructing decision tree, Gain'(S,A) can effectively replace Gain(S,A) as a new standard for attribute classification. This can effectively avoid preferentially selecting attributes with more values when selecting nodes, and at the same time, the weight values of attribute values are processed with fidelity, which reduces the difference between the information gain and the actual situation, and ensures that the In the process, the more influential attributes are preferentially selected.

具体的：specific:

如图1所示，本发明提供一种基于改进型决策树算法的路况信息预测方法，包括：As shown in Figure 1, the present invention provides a method for predicting road condition information based on an improved decision tree algorithm, including:

步骤1、确定道路连通力的影响因子：Step 1. Determine the influencing factors of road connectivity:

若想利用决策树算法对道路连通力进行定量分析，首先要明确相应的影响因素与环境变量。影响道路连通力的因素有很多种，从道路的初始设计初衷与施工标准，到具体的所处地理位置与社会环境、实时天气、车流大方向以及驾驶员习惯、行人是否遵守交通规则、是否有突发交通事故等因素。在本发明中，将把重点放在道路信息上，从道路信息上展开分析既符合数据客观不受额外因素的科学研究标准，也利于收集数据，确保了实验结果的客观性与有效性。影响因子包括车道长度R_l、车道数R_s、车道宽度R_d、红绿灯数R_x、红绿灯时长R_t、连续行驶平均长度R_l*、路段路口数L_n、连接路段数L_r、路段是否含公交车站B_s和行车变道指数L_e。其中：If you want to use the decision tree algorithm to quantitatively analyze the road connectivity, you must first clarify the corresponding influencing factors and environmental variables. There are many factors that affect road connectivity, from the initial design intention and construction standards of the road to the specific geographical location and social environment, real-time weather, general traffic flow direction and driver habits, whether pedestrians abide by traffic rules, whether there are sudden traffic accidents and other factors. In the present invention, the emphasis will be placed on road information, and the analysis from the road information not only conforms to the scientific research standard that the data is objective without additional factors, but also facilitates the collection of data, ensuring the objectivity and validity of the experimental results. The influencing factors include lane length R _l , number of lanes R _s , lane width R _d , number of traffic lights R _x , duration of traffic lights R _t , average length of continuous driving R _l *, number of road intersections L _n , number of connected road segments L _r , whether the road segment is Including bus stop B _s and lane change index _Le . in:

1)、车道长度R_l 1), Lane length R _l

车道长度定义了道路承载力，同样也是道路连通力的基本属性之一。车道长度越大，在拥堵时段等候通行的最大车辆数越多。考虑到启动时的车速与间距，等候车辆越多，放行时通过交叉口的车流占总等候车辆的比例就越小，转化为对堵塞道路的缓解能力便相对较弱。由此可见车道长度对道路连通力的影响是十分显著的。Lane length defines the road bearing capacity and is also one of the basic properties of road connectivity. The longer the lane length, the higher the maximum number of vehicles waiting to pass during the congestion period. Considering the speed and distance at the time of starting, the more vehicles waiting, the smaller the proportion of the traffic passing through the intersection to the total waiting vehicles when releasing, which translates into a relatively weak ability to alleviate road congestion. It can be seen that the influence of lane length on road connectivity is very significant.

2)、车道数R_s 2), the number of lanes R _s

车道数对连通力的影响与车道长度类似。这俩者作为路段的基础属性决定了道路的容载量而间接影响连通力的强弱。The effect of the number of lanes on the connectivity force is similar to that of the lane length. As the basic attributes of the road segment, these two determine the capacity of the road and indirectly affect the strength of the connectivity.

3)、车道宽度R_d 3), lane width R _d

由于城市道路规划与道路连通目标不尽一致，即便在相邻的街区内不同路段的车道宽度.也不完全一致，这一因素致使局部区域不同路段的车流速度存在显著差异。这方面的速度差异直接对车辆起步、正常行驶、通过交叉口等行车环境产生了影响。据《城市道路设计规范》，理想的连续交通流道路车道宽应在3.6米及以上，当车道宽度小于该值时，高峰时段的路段车流平均速度会有明显下降。Due to the inconsistency between urban road planning and road connectivity objectives, even the lane widths of different road sections in adjacent blocks are not completely consistent. This factor results in significant differences in the traffic speeds of different road sections in local areas. The speed difference in this aspect directly affects the driving environment such as vehicle starting, normal driving, and passing through intersections. According to the "Code for Design of Urban Roads", the ideal continuous traffic flow road lane width should be 3.6 meters or more. When the lane width is less than this value, the average speed of traffic flow on the road section during peak hours will drop significantly.

4)、红绿灯数R_x 4), the number of traffic lights R _x

城市道路不仅是机动车通行的区域，也是行人出行所必经的重要场所。相应地，交叉路口不仅有红绿灯进行交通流控制，在非交叉路段也会有红绿灯对穿越交通流的人群进行管辖由此导致地交通流延缓现象十分普遍，因此，红绿灯数量对道路连通力的影响也在本研究之中。值得注意的是，交叉路口处的交通灯体现了道路连通的多样选择，也可作为车道的起点与终点的参考，路段中的行人红绿灯则是对连通力有一定阻塞所用。这里记录的值仅为路段内的值，不包含两端路口的指示灯。Urban roads are not only areas for motor vehicles to pass through, but also important places for pedestrians to travel. Correspondingly, there are not only traffic lights at intersections to control traffic flow, but also traffic lights in non-intersection sections to govern the people crossing the traffic flow, which leads to the phenomenon of delaying local traffic flow. Therefore, the number of traffic lights has an impact on road connectivity. also in this study. It is worth noting that the traffic lights at the intersection reflect the various choices of road connectivity, and can also be used as a reference for the start and end points of the lane. The pedestrian traffic lights in the road section are used to block the connectivity to a certain extent. The value recorded here is only the value within the road segment, and does not include the lights at the intersections at both ends.

5)、红绿灯时长R_t 5), traffic light duration R _t

交通指示灯的交替顺序与每一方向的通行时间不尽相同，这决定了单股车流的通行时长，频繁地切换通行状态指示灯会让行车的起步阶段占有更多的路口通过时间，进而影响交通流的疏解速度。单纯考虑红绿灯时长很难看出该属性与道路连通力的关系，因此在算法实现阶段本发明将联合红绿灯数量与时长，计算单一道路的非等待时长期望作为这两个属性的影响因子。The alternating sequence of traffic lights is not the same as the passing time of each direction, which determines the passing time of a single traffic flow. Frequent switching of traffic status lights will make the starting stage of driving occupy more crossing time, which will affect the traffic. The unwinding speed of the flow. It is difficult to see the relationship between this attribute and road connectivity by simply considering the duration of traffic lights. Therefore, in the algorithm implementation stage, the present invention combines the number and duration of traffic lights to calculate the expected non-waiting duration of a single road as an influencing factor for these two attributes.

6)、连续行驶平均长度R_l*6), the average length of continuous driving R _l *

机动车行驶过程中，遇到交叉口时必然会减速缓行或停车暂驻，这给交通流行进带了阻塞效果。因此，在考虑持续行车这一方面时，本文选择连续行驶长度的平均值作为道路连通力的影响因素之一。When a motor vehicle is driving, it will inevitably slow down or stop temporarily when it encounters an intersection, which brings a blocking effect to the traffic flow. Therefore, when considering the aspect of continuous driving, this paper chooses the average value of continuous driving length as one of the influencing factors of road connectivity.

7)、路段路口数L_n 7), the number of road intersections L _n

在街区道路环境中，由于车流量小，非高峰时段行车不会堵塞，部分道路的交叉口处并没有交通信号指示灯。由此便产生了部分路段的路口数多于红绿灯数的情况，在这种条件下，仅考虑红绿灯数量不足以表明道路车流流入流出条件，故引入路段路口数作为构成路段连通条件的因素之一。In the block road environment, due to the small traffic flow, traffic will not be blocked during off-peak hours, and there are no traffic lights at the intersections of some roads. As a result, the number of intersections in some road sections is more than the number of traffic lights. Under this condition, only considering the number of traffic lights is not enough to indicate the inflow and outflow conditions of road traffic, so the number of intersections in the road section is introduced as one of the factors that constitute the connectivity conditions of the road section. .

8)、连接路段数L_r 8), the number of connecting road segments L _r

引入路段路口数不仅是起到区别于红绿灯数的作用，更是作为描述当前道路与其他路段连接程度的参数之一。为进一步细化这种连接程度，本研究引入连接路段数这一属性。国内外的交叉口普遍为十字或丁字型，对应的连接路段数依次为3或2，如果有施工作业或临时管制某一方向禁止行车的情况，其值也会相应减小。该属性值越大，表明当前路段的车流在流出端的选择越多，对应的道路连通力更强。The number of junctions in the introduced road segment is not only to distinguish it from the number of traffic lights, but also to be one of the parameters to describe the degree of connection between the current road and other road segments. To further refine this degree of connection, this study introduces the attribute of the number of connected road segments. The intersections at home and abroad are generally cross or T-shaped, and the number of corresponding connecting sections is 3 or 2 in turn. If there is construction work or temporary restrictions on traffic in a certain direction, the value will be reduced accordingly. The larger the value of this attribute, the more choices of traffic flow at the outflow end of the current road segment, and the stronger the corresponding road connectivity.

9)、路段是否含公交车站B_s 9) Whether the road section contains bus stop B _s

刨除私家车与出租汽车，公共汽车也是城市道路交通环境中不可忽视的因素。由于公交车自身车型较大且需要进站停车，故在道路行驶过程中易发生车速经常改变、进站停车造成短暂的区域堵塞等情况。这些条件会致使路面上的其他车辆产生更频繁地并道、加减速等行驶行为，进而影响交通流的疏解速度。Apart from private cars and taxis, buses are also a factor that cannot be ignored in the urban road traffic environment. Due to the large size of the bus itself and the need to stop at the stop, it is prone to frequent changes in the speed of the bus during road travel, and short-term area congestion caused by stop and stop. These conditions will cause other vehicles on the road to merge, accelerate and decelerate more frequently, and then affect the speed of the traffic flow.

10)、行车变道指数L_e 10), lane change index _Le

在道路上行驶的过程中，出于各种目的，驾驶员必然会遇到变更车道行驶的情况，例如，要在下个路口转向；路面标识规定的当前车道行驶方向与既定目的地不同；超车变道或避让交通事故变道等等。从科研角度，本发明无法有效地统计驾驶员主观变道的意愿与突发的道路故障，因此本发明从客观道路设计角度出发，对交通引导标志引发的驾驶换到行为进行统计分析，即判断在进入路段后的所在车道的预设转向与路段末端地标给出的转向标志是否一致。对于车道数大于等于2的路段，若车道数R_s为奇数，默认路段右侧

车道为直行道路，左侧

车道为左转道路，若R_s为偶数，右侧

车道为直行道路，左侧为左转道路。在实际调研中，两者的实际差数的绝对值为该路段的行车变道指数。In the process of driving on the road, for various purposes, the driver will inevitably encounter the situation of changing lanes, for example, to turn at the next intersection; the current lane driving direction specified by the road marking is different from the intended destination; overtaking changes change lanes or avoid traffic accidents, etc. From the scientific research point of view, the present invention cannot effectively count the driver's subjective lane change willingness and sudden road failures. Therefore, from the perspective of objective road design, the present invention conducts statistical analysis on the driving change behavior caused by the traffic guide signs, that is, judges Whether the preset steering of the lane after entering the road segment is consistent with the steering sign given by the landmark at the end of the road segment. For the road segment with the number of lanes greater than or equal to 2, if the number of lanes R _s is odd, the default right side of the road segment

Lane is straight road, left

The lane is a left turn road, if R _s is an even number, the right

The lane is a straight road and the left is a left turn road. In the actual investigation, the absolute value of the actual difference between the two is the lane change index of the road section.

显然，评价与量化道路连通力并不是一件容易的事情，影响因子众多且存在相互影响的可能性。因此，在提出上述影响因子后，十分有必要对其进行归纳整理，给出可利用改进型ID3算法进行实现的属性集。Obviously, evaluating and quantifying road connectivity is not an easy task, as there are many influencing factors and the possibility of mutual influence. Therefore, after the above-mentioned impact factors are proposed, it is very necessary to summarize and organize them, and give the attribute set that can be realized by the improved ID3 algorithm.

步骤2、基于步骤1的影响因子确定分析道路连通力的属性，属性包括道路容量R_m、车道宽度R_d、期望等候时间T_s、道路平均转向数L_g、连续行驶平均长度R_l*、路段是否含公交车站B_s和行车变道指数L_e。其中：Step 2. Determine and analyze the attributes of the road connectivity force based on the influencing factors in step 1. The attributes include road capacity R _m , lane width R _d , expected waiting time T _s , average road turning number L _g , average continuous driving length R _l *, Whether the road section contains bus stop B _s and lane change index _Le . in:

从道路自身角度讲，车道宽度可独立于车道长度、车道数作为单独的检测标准，但是单独考虑车道长度或车道数都过于单薄，因此本研究综合了两者，并融入车道衰减系数α_l，综合三者后形成属性道路容量R_m作为综合分析道路连通力的属性。道路容量R_m的定义式如下所示：From the perspective of the road itself, the lane width can be independent of the lane length and the number of lanes as a separate detection standard, but considering the length of the lane or the number of lanes alone is too thin, so this study combines the two and incorporates the lane attenuation coefficient α _l , After synthesizing the three, the attribute road capacity _Rm is formed as the attribute of comprehensive analysis of road connectivity. The definition of road capacity _Rm is as follows:

接下来对交通信号指示灯的相关信息进行处理：考虑到不同指示灯的通行方向与通行时长不尽相同，倘若采取统一的计时方法或通行顺序的评估办法很难达成数据集内部的平衡，进而影响分析结果的有效性，不妨采用期望等候时间这一属性来从道路驾驶的角度综合反映红绿灯的时长与数量对道路连通力的影响。期望等候时间，是车辆行驶在目标路段上因交通指示灯造成停车等候的时间期望值。对于该路段上的每一个红绿灯，每一个通行方向的时间占总循环的时间是固定值T_p，T_p由红绿灯时长R_t决定，则等候时间的期望为

因此只需要对所有值进行求和即可得到该路段的期望等候时间T_s，定义式如下：Next, the relevant information of the traffic signal lights is processed: considering that the traffic directions and traffic durations of different lights are different, it is difficult to achieve a balance within the data set if a unified timing method or traffic sequence evaluation method is adopted. In order to affect the validity of the analysis results, it is advisable to use the attribute of expected waiting time to comprehensively reflect the impact of the duration and number of traffic lights on road connectivity from the perspective of road driving. The expected waiting time is the expected value of the time when the vehicle stops and waits due to the traffic light caused by the traffic light on the target road section. For each traffic light on the road section, the time of each traffic direction accounts for a fixed value T _p in the total cycle time, and T _p is determined by the traffic light duration R _t , then the expectation of waiting time is

Therefore, it is only necessary to sum all the values to obtain the expected waiting time T _s of the road section, which is defined as follows:

类似地，单独考虑路段路口数与连接路段数过于单一，对于二者之差---路口的平均转向数便可以很好地反应当前道路的流出端是否具有多样性的选择。路段平均转向数L_g的定义如下：Similarly, considering the number of road intersections and the number of connected road segments alone is too simple, the difference between the two, the average number of turns at the intersection, can well reflect whether the outflow end of the current road has a variety of choices. The average number of turns L _g of the road segment is defined as follows:

综上所述，本研究以道路自身特征、交通信号指示灯的分布与等候时间、相邻路口属性、行驶条件为四大主要特征对影响道路连通力的因子展开分析。在后续实验中，本文将对所有属性划分整合为这些属性：道路特征属性包括道路容量R_m、车道宽度R_d；交通信号灯特征属性包括期望等候时间T_s；相邻路口特征属性包括道路平均转向数L_g；行驶条件特征属性包括连续行驶平均长度R_l*、路段是否含公交车站B_s、行车变道指数L_e。In summary, this study analyzes the factors that affect road connectivity by taking the characteristics of the road itself, the distribution and waiting time of traffic lights, the attributes of adjacent intersections, and driving conditions as the four main characteristics. In subsequent experiments, this paper will divide and integrate all attributes into these attributes: road characteristic attributes include road capacity R _m , lane width R _d ; traffic signal characteristic attributes include expected waiting time T _s ; adjacent intersection characteristic attributes include road average turning The number L _g ; the characteristic attributes of driving conditions include the average length of continuous driving R _l *, whether the road section contains a bus stop B _s , and the driving lane change index L _e .

步骤3、采集道路数据；Step 3. Collect road data;

在明确道路连通力的影响因子后，本发明对道路数据进行收集，筛选出符合实验条件的数据进行后续实验。对于道路长度、车道数等静态信息，通过实地考察、文献查阅、地图测绘的方式来采集。为了考察道路连通力，本研究对指定单位时间内的道路车速数据进行收集对比，根据路段车速的改变值，结合车速-流量模型计算道路流量变化从而定性推导连通力的强弱。After clarifying the influencing factors of road connectivity, the present invention collects road data, and selects data that meets the experimental conditions for subsequent experiments. For static information such as road length, number of lanes, etc., it is collected through on-the-spot investigation, literature review, and map surveying and mapping. In order to investigate the road connectivity force, this study collects and compares the road speed data in a specified unit time. According to the change value of the road speed, combined with the vehicle speed-flow model, the road flow change is calculated to qualitatively deduce the strength of the connectivity force.

近年来北京道路交通信息的数字化发展迅速，在北京市交通信息对外发布系统创立后，交叉路口监控与路面交通数据采集功能分别应用到了日常生活中，为常用的地图工具所用以展现实时道路信息。本文每15分钟采集午后时间段(14:00至15:30)与傍晚时间段(17:00-18:30)的北京东南部区域内的道路交通车速作为原始数据，示例数据如表1所示：In recent years, the digitization of Beijing's road traffic information has developed rapidly. After the establishment of the Beijing Traffic Information External Release System, the functions of intersection monitoring and road traffic data collection have been applied to daily life respectively, and are used by commonly used map tools to display real-time road information. In this paper, the road traffic speed in the southeastern part of Beijing in the afternoon time period (14:00-15:30) and the evening time period (17:00-18:30) is collected every 15 minutes as the original data. The example data is shown in Table 1. Show:

表1Table 1

步骤4、数据预处理：Step 4. Data preprocessing:

考虑到路段单位时间内不同路段流入的车流量有所差异，因此在数据预处理阶段要建立有效地筛选机制。国内外对交通数据的预处理有大量研究，本文在此基础上针对采集的道路交通数据归纳总结了一些影响实验结果的共性问题，由此提出道路连通力数据预处理的办法，主要包括数据有效性分析和数据补充两个环节。Considering the difference in traffic flow in different sections of the section per unit time, an effective screening mechanism should be established in the data preprocessing stage. There are a lot of researches on the preprocessing of traffic data at home and abroad. On this basis, this paper summarizes some common problems that affect the experimental results for the collected road traffic data, and proposes a method for road connectivity data preprocessing, which mainly includes data validity. Sexuality analysis and data supplementation.

1)、剔除不合要求的数据：1), remove the data that does not meet the requirements:

在采集道路路况信息的过程中，难免会遇到突发道路事故以及大范围交通堵塞而导致的交通流量、车速超过合理阈值的情况，这种模式下的道路数据与本研究的交通流速度-密度-流量模型相悖。针对这种极端条件下的数据，很难从中发掘有效信息用于总结普适性规律。因此，本发明拟出以下几条数据剔除规则来排除错误数据，以提升数据在挖掘时间和空间关联方面的有效性。In the process of collecting road condition information, it is inevitable to encounter sudden road accidents and large-scale traffic congestion caused by traffic flow and vehicle speed exceeding a reasonable threshold. The density-flow model contradicts. For data under such extreme conditions, it is difficult to find effective information for summarizing universal laws. Therefore, the present invention proposes the following data elimination rules to eliminate erroneous data, so as to improve the effectiveness of data mining in temporal and spatial associations.

2)、数据修补：2), data repair:

对于已经剔除的数据与部分未能准确采集的路况信息，有必要对零星时刻的缺失数据进行补充与调整。本研究对于初次筛选后的数据采用临近道路或临近时刻的数据进行修补。对于某一地点的部分时段数据缺失，采用邻近时刻的流出端或流入端道路信息的平均值来补充模拟随后进行数据平滑处理，这种情况只有在相邻道路与数据缺失道路有较强关联性的情况下方可采用。For the data that has been eliminated and some road condition information that cannot be accurately collected, it is necessary to supplement and adjust the missing data at sporadic moments. In this study, the data after the initial screening were repaired by the data near the road or the near time. For the missing data in a certain period of time, the average value of the road information at the outflow end or inflow end at the adjacent time is used to supplement the simulation and then perform data smoothing processing. In this case, only when the adjacent road has a strong correlation with the data-missing road can be used under the circumstances.

经过上述两个过程处理后的数据与一些运算生成的属性参数便可用于生成决策树，部分数据的样例如表2所示：The data processed by the above two processes and the attribute parameters generated by some operations can be used to generate a decision tree. Some data samples are shown in Table 2:

表2Table 2

在表2中，属性B_s为1时代表本路段有公交站台，0表示未设立。路况改善值为0表示当前路段交通流密度没有改变，该值的具体数值绝对值表示路况改变的跨度区域，数值为正代表朝增益改善，为负则表示减益改变。由于统计的时间跨度较多，不同路段的道路变化情况纷繁复杂，因此在对道路流密度改善值作判定时取分布较多的情况作为当前路段的结果以得出内容。In Table 2, when the attribute B _s is 1, it means that there is a bus stop in this section, and 0 means that it is not established. A road condition improvement value of 0 indicates that the traffic flow density on the current road section has not changed. The absolute value of the specific value of this value indicates the span area where the road condition changes. A positive value indicates an improvement toward the gain, and a negative value indicates a change in the debuff. Due to the large time span of statistics, the road changes in different road sections are complicated, so when judging the improvement value of road flow density, the situation with more distribution is taken as the result of the current road section to obtain the content.

为简化这部分属性信息，对于数值较为具体的属性，根据数值分布情况对其进行属性范围划分以契合决策树算法的需求。根据全部103条道路3天10个时间段共3090个训练样本集，其中含2394个有效样本，对道路容量R_m、期望等待时间T_s、路段平均转向数L_g、连续行驶平均长度R_l*四个属性作如表3所示的划分：In order to simplify this part of the attribute information, for the more specific attributes, the attribute range is divided according to the numerical distribution to meet the needs of the decision tree algorithm. According to a total of 3090 training sample sets for all 103 roads in 3 days and 10 time periods, including 2394 valid samples, the road capacity R _m , the expected waiting time T _s , the average turning number L _g of the road section, and the average continuous driving length R _l *The four attributes are divided as shown in Table 3:

表3table 3

步骤5、数据分类：根据流入端平均车流密度构造拥堵和流畅条件的道路连通力决策树。Step 5. Data classification: Construct the road connectivity decision tree for congestion and smooth conditions according to the average traffic density at the inflow end.

通过对整理过的影响道路连通力的属性分别计算信息熵与属性权重值得出改进后的信息增益。随后按照所得数据确定属性影响力的大小构建决策树。但是对于采集到的数据中傍晚时段(17:00-18:30)与午后时段(14:00-15:30)，傍晚时段为高峰时段，午后时段为非高峰时段；所得结果存在显著差异，因此针对不同时段不同交通流压力下的道路信息，分别构建高峰时段与非高峰时段的两棵道路连通力决策树。The improved information gain is obtained by calculating the information entropy and attribute weight value of the sorted attributes that affect the road connectivity. Then, a decision tree is constructed according to the obtained data to determine the influence of attributes. However, for the collected data in the evening period (17:00-18:30) and the afternoon period (14:00-15:30), the evening period is the peak period, and the afternoon period is the off-peak period; there are significant differences in the obtained results. Therefore, according to the road information under different traffic flow pressures in different periods, two decision trees for road connectivity in peak hours and non-peak hours are constructed respectively.

步骤6、基于采集的道路数据计算信息熵E(S)；Step 6: Calculate the information entropy E(S) based on the collected road data;

式中：S为道路数据的训练集样本，其包含n个类别的样本，分别用C₁，C₂，…，C_n表示；p_i表示类C_i的概率；其中：In the formula: S is the training set sample of road data, which contains n categories of samples, which are represented by C ₁ , C ₂ , ..., C _n respectively; pi _{represents the probability of class C i} _; where:

通过对总共2394个有效样本集进行归类划分，将其分为午后与傍晚两组数据分别计算属性的信息熵。这两组数据的划分依据不仅是简单地从时间上进行拆分，也是基于所有采样的平均行车速度，亦可以理解为不同的拥堵程度，午后时段道路平均负载率不高，道路流入端的平均拥堵程度不足30％，傍晚时段的车流密度则有明显的增多，平均车速也较午后时段有所下降。在所有有效样本中，午后时段有效数据为1347组，晚高峰时段有效数据包含1047组。By classifying and dividing a total of 2394 valid sample sets, they are divided into two groups of data in the afternoon and evening to calculate the information entropy of attributes respectively. The division of these two sets of data is not only based on simple time division, but also based on the average driving speed of all samples, which can also be understood as different congestion levels. The level is less than 30%, and the traffic density in the evening period has increased significantly, and the average vehicle speed has also decreased compared with the afternoon period. Among all valid samples, there are 1347 groups of valid data in the afternoon period, and 1047 groups of valid data in the evening peak period.

首先，来对午后时段的路段平均转向数属性计算信息熵。对于1347个训练样本，路面车流密度有所改善的数据有727组，没有改善的样本有620组。根据式可得到信息熵：First, the information entropy is calculated for the average turning number attribute of the road segment in the afternoon period. For the 1347 training samples, there are 727 sets of data with improved road traffic density and 620 sets of samples without improvement. According to the formula, the information entropy can be obtained:

步骤7、基于采集的道路数据计算各属性的属性熵E(S,A)；Step 7: Calculate the attribute entropy E(S, A) of each attribute based on the collected road data;

以道路容量R_m为例：Take the road capacity R _m as an example:

对道路连通力的道路容量R_m属性计算属性熵。对于样本集S，R_m将S划分为三个部分，R_m≤1812、1812<R_m≤2834、R_m>2834，即容量小、中、大。本发明用S_v表示属性值为v的样本集，|S_小|＝234，|S_中|＝687，|S_大|＝426，在|S_小|中，类R_m的两种取值的样本数分别为101、133，|S_小|的熵为：Attribute entropy is calculated for the road capacity _Rm attribute of the road connectivity force. For the sample set S, R _m divides S into three parts, R _m ≤ 1812, 1812<R _m ≤ 2834, and R _m >2834, that is, the capacity is small, medium, and large. In the present invention, S _v is used to represent the sample set with the attribute value v, |S _small |=234, |S _medium |=687, |S _large |=426, in |S _small |, there are two values of class R _m The number of samples are 101 and 133 respectively, and the entropy of | _Ssmall | is:

同理，可以计算出S_中和S_大的熵分别为0.752和0.821，因此使用属性R_m划分S的期望信息为：In the same way, it can be calculated that _the entropy _in S and S are 0.752 and 0.821, respectively, so the expected information for dividing S by attribute R _m is:

类似地，其余属性的信息增益可照此方法同样得出。Similarly, the information gain for the remaining attributes can be derived in the same way.

步骤8、基于关联函数计算各属性的关联函数值CF(A)；Step 8, calculate the correlation function value CF(A) of each attribute based on the correlation function;

式中：X_im-1与X_im为参数X_ij的特定值，下标j表示属性A的每一种情况，i表示数据每一种取值情况，n为数据总量；其中：In the formula: X _im-1 and X _im are the specific values of the parameter X _ij , the subscript j represents each case of attribute A, i represents each value case of the data, and n is the total amount of data; where:

为避免属性分布对分析结果的影响，计算每个属性的权重函数与对应的信息熵权重值的方法来获得权重值。以属性R_m为例，其数据的构成情况如表4所示：In order to avoid the influence of attribute distribution on the analysis results, the weight value is obtained by calculating the weight function of each attribute and the corresponding information entropy weight value. Taking the attribute R _m as an example, the composition of its data is shown in Table 4:

表4样本数据中的道路容量单因素结果Table 4 Single factor results of road capacity in sample data

同理，本发明可以计算出其余六个属性的关联函数值，如表5所示：Similarly, the present invention can calculate the correlation function values of the remaining six attributes, as shown in Table 5:

表5其他属性信息熵权重值Table 5 Other attribute information entropy weight values

步骤9、基于各属性的关联函数值CF(A)计算各属性的权重值Wg(A)；Step 9. Calculate the weight value Wg(A) of each attribute based on the association function value CF(A) of each attribute;

步骤10、基于信息熵E(S)、各属性的属性熵E(S,A)和各属性的权重值Wg(A)计算各属性的信息增益Gain′(S,A)；Step 10: Calculate the information gain Gain'(S,A) of each attribute based on the information entropy E(S), the attribute entropy E(S,A) of each attribute, and the weight value Wg(A) of each attribute;

Gain′(S,A)＝(E(S)-E(S,A))*Wg(A)Gain'(S,A)=(E(S)-E(S,A))*Wg(A)

步骤11、根据各属性信息增益的大小进行排序构建决策树，并根据决策树预测道路路况；构建决策树的方法为：Step 11: Sort and construct a decision tree according to the size of the information gain of each attribute, and predict road conditions according to the decision tree; the method of constructing the decision tree is:

信息增益最大的属性作为决策树的根节点，信息增益次大的属性作为决策树的第二层节点，依次类推，构建出决策树。The attribute with the largest information gain is used as the root node of the decision tree, and the attribute with the second largest information gain is used as the second-level node of the decision tree, and so on, to construct a decision tree.

对于非高峰时段，根据基于属性权重的ID3决策树算法以及上述的相关信息熵运算，给出如图2所示的决策树。从这棵决策树中本发明可以看到，在午后时间段，基于权重的信息熵增益最大的属性是路段平均转向数，因此将其作为整棵树的根节点，针对不同的样本分类情况再选取信息熵增益次大的期望等候时间属性作第二层节点，以此类推直至所有样本均得到分类。For off-peak hours, according to the ID3 decision tree algorithm based on attribute weight and the above-mentioned related information entropy operation, the decision tree shown in Figure 2 is given. From this decision tree, the present invention can see that, in the afternoon time period, the attribute with the largest gain of information entropy based on weight is the average number of turns of the road section, so it is used as the root node of the whole tree, and the classification of different samples is carried out according to different sample classification conditions. Select the expected waiting time attribute with the second largest information entropy gain as the second layer node, and so on until all samples are classified.

不同于午后时段，对于采集到的高峰时段的数据所得决策树如图3所示。Different from the afternoon period, the decision tree obtained for the collected peak period data is shown in Figure 3.

通过上述结果，针对不同路况条件下的道路进行路况预测，预测结果如图4-9所示。Based on the above results, road conditions are predicted for roads under different road conditions, and the prediction results are shown in Figure 4-9.

以上仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. a road condition information prediction method based on improved decision tree algorithm, is characterized in that, comprises:

Step 1. Determine the influencing factors of road connectivity, the influencing factors include lane length R _l , number of lanes R _s , lane width R _d , number of traffic lights R _x , average length of continuous driving R _l *, number of road intersections L _n , The number of connected road segments L _r , whether the road segment contains bus stops B _s and the lane change index _Le ;

Step 2. Determine and analyze the attributes of the road connectivity force based on the influencing factors in step 1. The attributes include road capacity R _m , lane width R _d , expected waiting time T _s , average number of turns L _g on the road, and average length of continuous driving R _l *. Whether the road section contains bus stop B _s and lane change index L _e ;

Step 3. Collect road data;

Step 4. Data preprocessing: Eliminate unsatisfactory data and repair data;

Step 5: Calculate the information entropy E(S) based on the collected road data;

In the formula: S is the training set sample of road data, which contains n categories of samples, which are represented by C ₁ , C ₂ , ..., C _n respectively; pi _{represents the probability of class C i} _;

Step 6: Calculate the attribute entropy E(S, A) of each attribute based on the collected road data;

Among them, S _i represents the ith subset of the training set samples divided according to the attribute A, |S| and |S _i | represent the number of samples in S and S _i respectively, m is the number of subsets divided by the training set samples; Among them, attribute A is the attribute of determining and analyzing road connectivity force based on the influence factor of step 1;

Step 7, calculate the correlation function value CF(A) of each attribute based on the correlation function;

In the formula: X _{i m-1} and X _im are the specific values of the parameter X _ij , the subscript j represents each case of the attribute A, i represents each value of the data, n is the total amount of data, and m represents the attribute For the specific attributes selected in A, X _ij represents the value of each attribute;

Step 8: Calculate the weight value of each attribute based on the associated function value CF(A) of each attribute;

In the formula: m is the number of attributes, CF(1), CF(2)...CF(m) are the correlation function values of each attribute respectively;

Step 9. Calculate the information gain Gain'(S, A) of each attribute based on the information entropy E(S), the attribute entropy E(S, A) of each attribute, and the weight value Wg(A) of each attribute;

Gain'(S,A)=(E(S)-E(S,A))*Wg(A)

Step 10: Sort and construct a decision tree according to the size of each attribute information gain, and predict road conditions according to the decision tree.

2. the road condition information prediction method based on improved decision tree algorithm as claimed in claim 1 is characterized in that, in step 2, road capacity R _m is:

where α _l is the lane attenuation coefficient,

is the lane length of the i-th lane.

3. the road condition information prediction method based on improved decision tree algorithm as claimed in claim 1 is characterized in that, in step 2, expected waiting time T _s is:

In the formula, T _p is the ratio of the time of each traffic direction to the total cycle time.

4. the road condition information prediction method based on improved decision tree algorithm as claimed in claim 1, is characterized in that, in step 2, road section average turning number L _g is:

5. the road condition information prediction method based on improved decision tree algorithm as claimed in claim 1, is characterized in that, in step 4, excluding unqualified data comprises:

Exclude data with traffic flow greater than the maximum limit value;

Exclude the data whose road speed is greater than the maximum limit value;

Exclude data with negative or empty traffic flow and road speed;

Exclude data where the road speed is zero, but the traffic flow is not zero;

Data with zero traffic flow but non-zero road speed is excluded.

6 . The method for predicting road condition information based on an improved decision tree algorithm according to claim 1 , wherein in step 4, the data repairing comprises: using data of an adjacent road or an adjacent time for repairing. 7 .

7. The road condition information prediction method based on improved decision tree algorithm as claimed in claim 1, is characterized in that, between step 4 and step 5, also comprises:

Data classification: Construct a road connectivity decision tree for congestion and smooth conditions based on the average traffic density at the inflow end.

8. The road condition information prediction method based on improved decision tree algorithm as claimed in claim 1, is characterized in that, in step 10, the attribute with the largest information gain is used as the root node of the decision tree, and the attribute with the second largest information gain is used as the decision tree. The second-level nodes of the tree, and so on, construct a decision tree.