CN110544373B

CN110544373B - A method of truck warning information extraction and risk identification based on Beidou Internet of Vehicles

Info

Publication number: CN110544373B
Application number: CN201910773932.2A
Authority: CN
Inventors: 杨小宝; 郑留洋; 高自友; 毕军; 闫学东
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2020-11-03
Anticipated expiration: 2039-08-21
Also published as: CN110544373A

Abstract

The invention relates to a method for extracting and risk identification of truck warning information based on the Beidou Internet of Vehicles, comprising: step 1, obtaining original data related to vehicle early warning through a vehicle network vehicle terminal provided with a Beidou positioning system; The data is preprocessed, step 3, extraction of key variables of truck warning information, step 4, clustering of vehicle safety risks, step 5, discriminant analysis, and step 6, risk identification. The present invention can give the warning frequency per unit driving mileage and the warning frequency per unit driving time for a certain vehicle/driver, and then can provide the judgment function of the discriminant analysis based on the historical data of the Internet of Vehicles, so as to realize the identification of its risk or predict.

Description

A method of truck warning information extraction and risk identification based on Beidou Internet of Vehicles

技术领域technical field

本发明属于交通安全应用领域，应用于交通运输行业，具体为一种基于北斗车联网的货车预警信息提取与风险识别方法。The invention belongs to the application field of traffic safety and is applied to the transportation industry, in particular to a method for extracting early warning information and risk identification of trucks based on the Beidou Internet of Vehicles.

背景技术Background technique

2019年4月12日交通运输部公布了2018年交通运输行业发展统计公报，数据表明2018年我国全年完成公路货运量395.69亿吨，增长7.3％，公路货物周转量71249.21亿吨公里，增长6.7％。近年来，我国公路货运量和货物周转量仍在逐渐上升，但道路交通安全问题日益凸显，道路交通安全事故仍是我国交通事业发展的一大隐患。据公安部交管局统计，2016年底全国汽车保有量1.94亿辆，其中载货汽车(简称货车)1351.77万辆，占汽车总量的7.0％；2016年全国共发生货车责任道路交通事故5.04万起，占全国汽车责任事故总量的30.5％，这个比例远远高于货车保有量占汽车总量的比例。由此可见，货车的交通安全问题尤为重要，如何减少货车交通事故的相关研究迫在眉睫。On April 12, 2019, the Ministry of Transport announced the 2018 Statistical Bulletin on the Development of the Transportation Industry. The data shows that in 2018, my country's annual road freight volume was 39.569 billion tons, an increase of 7.3%, and the road freight turnover was 7,124.921 billion ton-kilometers, an increase of 6.7%. . In recent years, my country's road freight volume and cargo turnover are still gradually increasing, but road traffic safety problems have become increasingly prominent, and road traffic safety accidents are still a major hidden danger in the development of my country's transportation industry. According to the statistics of the Traffic Management Bureau of the Ministry of Public Security, at the end of 2016, there were 194 million vehicles in the country, including 13.5177 million trucks (referred to as trucks), accounting for 7.0% of the total number of vehicles. In 2016, there were 50,400 road traffic accidents involving trucks in the country. , accounting for 30.5% of the total number of automobile liability accidents in the country, which is much higher than the proportion of the number of trucks in the total number of vehicles. It can be seen that the traffic safety of trucks is particularly important, and relevant research on how to reduce truck traffic accidents is imminent.

传统的道路交通安全研究多以交通事故数据为基础，构建事故频次或事故严重程度模型，从人、车、路、环境四个方面来揭示交通事故特征及其关键影响因素。随着互联网和计算机技术的快速发展，交通数据采集手段更加多元化和智能化，车联网技术在交通安全方面的应用得到相应发展。现有基于车联网的交通安全研究大多涉及的是车辆预警的模型方法、装置及系统。例如，专利CN109615879A为基于车联网的车速异常预警模型，专利CN109584630A为基于车联网的车辆变道预警装置及预警方法，CN109147279A为基于车联网的疲劳驾驶监测预警方法及系统，专利CN105869439B、CN108986544A、CN109559559A、CN109584631A为基于车联网的防碰撞预警方法及系统等。但是，现有文献鲜有涉及如何基于车联网利用货车车载装置的预警信息来提升道路交通安全的相关技术。Traditional road traffic safety research is mostly based on traffic accident data to build accident frequency or accident severity models to reveal the characteristics of traffic accidents and their key influencing factors from four aspects: people, vehicles, roads, and the environment. With the rapid development of the Internet and computer technology, traffic data collection methods are more diversified and intelligent, and the application of Internet of Vehicles technology in traffic safety has been developed accordingly. Most of the existing traffic safety research based on the Internet of Vehicles involves model methods, devices and systems for vehicle early warning. For example, the patent CN109615879A is an early warning model for abnormal vehicle speed based on the Internet of Vehicles, the patent CN109584630A is a vehicle lane change early warning device and an early warning method based on the Internet of Vehicles, CN109147279A is based on the Internet of Vehicles. , CN109584631A is an anti-collision warning method and system based on the Internet of Vehicles. However, there are few related technologies on how to improve road traffic safety by using the early warning information of truck on-board devices based on the Internet of Vehicles.

当前针对车联网技术在交通领域的运用前景广阔，交通运输部规定城际货运车辆必须联网，通过车联网对货车运行进行实时监管监控，据此来提升道路交通安全。目前多个省市货运车辆运用的是北斗车联网系统，并要求货车安装相应的安全预警装置。北斗车联网系统通过采用先进的北斗定位导航、传感、控制等技术，构建起以车辆为节点的车联网系统，实时监控和存储车辆的行驶轨迹、速度、时间、里程以及行驶状态中的报警预警信息等，其中，车辆预警信息主要包括超速预警、疲劳驾驶预警、碰撞预警、侧翻预警等。但目前仍缺少如何提取和利用这些预警信息来提升道路交通安全的相关技术。预警多发生在车辆违章之前，表明已存在一定的潜在风险，预警是为了提示驾驶员规范其驾驶行为，减少交通事故和违章事件的发生。驾驶员交通事故或违章事件属于确定性的危险行为，发生概率相对较低，事件是否发生的随机性较强；而预警属于潜在危险行为，发生概率更高，数据量更大，预警信息能更全面深入地反映驾驶员的驾驶风险。据此，本发明提出一种基于北斗车联网的货车预警信息提取与风险识别方法，基于车联网技术获取车辆预警相关的原始数据，对数据进行处理，提取预警信息的关键变量，在此基础上对车辆/驾驶员的风险程度进行聚类与判别分析，精准识别出高风险的车辆/驾驶员，可督促驾驶员养成良好的驾驶习惯，提升道路交通安全。At present, the application of Internet of Vehicles technology in the field of transportation has broad prospects. The Ministry of Transport stipulates that intercity freight vehicles must be connected to the Internet, and real-time supervision and monitoring of the operation of freight vehicles through the Internet of Vehicles can improve road traffic safety. At present, freight vehicles in many provinces and cities use the Beidou Internet of Vehicles system, and require trucks to install corresponding safety warning devices. The Beidou Internet of Vehicles system uses advanced Beidou positioning, navigation, sensing, control and other technologies to build a vehicle-to-vehicle system with vehicles as nodes to monitor and store the vehicle's driving track, speed, time, mileage and alarms in the driving state in real time. Early warning information, etc. Among them, vehicle early warning information mainly includes speeding warning, fatigue driving warning, collision warning, rollover warning, etc. However, there is still a lack of relevant technologies on how to extract and utilize these early warning information to improve road traffic safety. The early warning mostly occurs before the vehicle violates regulations, indicating that there is a certain potential risk. The early warning is to remind drivers to regulate their driving behavior and reduce the occurrence of traffic accidents and violations. Driver traffic accidents or violations are deterministic dangerous behaviors, with a relatively low probability of occurrence, and the randomness of whether the event occurs or not; while early warning is a potentially dangerous behavior, with a higher probability of occurrence, a larger amount of data, and more information about the warning. A comprehensive and in-depth reflection of the driver's driving risk. Accordingly, the present invention proposes a truck warning information extraction and risk identification method based on the Beidou Internet of Vehicles. Based on the Internet of Vehicles technology, the original data related to vehicle early warning is obtained, the data is processed, and the key variables of the early warning information are extracted. Clustering and discriminant analysis of the risk level of vehicles/drivers can accurately identify high-risk vehicles/drivers, which can urge drivers to develop good driving habits and improve road traffic safety.

发明内容SUMMARY OF THE INVENTION

针对现有技术中存在的缺陷，本发明提出了一种基于北斗车联网的货车预警信息提取与风险识别方法。In view of the defects existing in the prior art, the present invention proposes a method for extracting early warning information and risk identification for trucks based on the Beidou Internet of Vehicles.

为达到以上目的，本发明采取的技术方案是：In order to achieve the above purpose, the technical scheme adopted in the present invention is:

一种基于北斗车联网的货车预警信息的提取与风险识别的方法，包括以下几个步骤：A method for extracting and risk identification of truck warning information based on Beidou Internet of Vehicles, comprising the following steps:

步骤1，通过设有北斗定位系统的车联网车载终端获取车辆预警相关的原始数据，原始数据包括：里程信息、安全预警信息、状态数据，所述状态数据包括车辆ID、ACC状态、上传时间等；Step 1, obtain the original data related to vehicle early warning through the vehicle network vehicle terminal equipped with the Beidou positioning system, the original data includes: mileage information, safety warning information, status data, and the status data includes vehicle ID, ACC status, upload time, etc. ;

步骤2，对原始数据进行预处理Step 2, preprocessing the original data

运用Python编程技术对原始数据进行预处理与筛选；在对原始数据进行分析之前，需要对数据进行清洗和整理，提高数据质量；数据清洗包括：填补数据中的缺失值、识别数据中的异常值和冗余数据；缺失值主要表现为缺少属性值，异常值主要表现为某单个属性的值过大或者过小，结合车载终端存储数据的特点，对原始数据进行预处理，具体做法如下：Use Python programming technology to preprocess and filter the original data; before analyzing the original data, it is necessary to clean and organize the data to improve the data quality; data cleaning includes: filling the missing values in the data, identifying the outliers in the data and redundant data; missing values are mainly manifested in the lack of attribute values, and outliers are mainly manifested in that the value of a single attribute is too large or too small. Combined with the characteristics of the data stored in the vehicle terminal, the original data is preprocessed. The specific methods are as follows:

步骤2.1，数据缺失值的操作：运用Python进行编程操作，引入Python之中的os、numpy模块，定义所需函数，执行main函数，对车载终端存储的文本文件进行操作，删除缺少属性值的文本文件，保证属性的完整性；Step 2.1, operation of missing data values: use Python to perform programming operations, introduce the os and numpy modules in Python, define the required functions, execute the main function, operate the text files stored in the vehicle terminal, and delete the text with missing attribute values. file to ensure the integrity of the attributes;

步骤2.2，数据异常值的操作：数据异常值包括：里程异常、预警状态异常、上传时间异常；Step 2.2, operation of abnormal data value: abnormal data value includes: abnormal mileage, abnormal warning status, abnormal upload time;

1)里程异常处理：首先，遍历所有数据文件，计算车辆当日行驶里程，其次，做出车辆当日行驶里程累计分布图，确定车辆当日行驶里程的过大值点和过小值点；最后，剔除车辆当日行驶里程中数值过大或者过小的出行记录；1) Handling of abnormal mileage: First, traverse all data files to calculate the mileage of the vehicle on the day, secondly, make a cumulative distribution map of the mileage of the vehicle on the day, and determine the excessive and small value points of the mileage of the vehicle on the day; finally, remove the The travel record of the vehicle's mileage that is too large or too small on the day;

2)预警状态异常处理：对各个预警位的预警持续时长进行统计，求出预警位单次预警的持续时间，删除明显错误的预警状态；2) Abnormal handling of early warning status: Count the warning duration of each warning bit, find out the duration of a single warning of the warning bit, and delete the warning status with obvious errors;

3)上传时间异常处理：计算相邻上传点的时间差，剔除相邻“上传时间”差值不变点或者小于零的记录点；3) Exception handling of upload time: Calculate the time difference between adjacent upload points, and eliminate the points where the difference between adjacent "upload time" remains unchanged or is less than zero;

步骤2.3，冗余数据的操作：遍历车辆当日出行的所有记录，对重复上传和当日数据规模较小的记录进行删除，具体为：对上传记录进行比对，删除重复上传的记录，重复执行直到遍历所有数据文件；对数据规模较小的记录，统计车辆当日出行时长，对小于15min的出行记录进行删除；Step 2.3, operation of redundant data: traverse all the records of the vehicle's trip on the day, delete the repeated upload and the records with smaller data scale of the day, specifically: compare the upload records, delete the repeated upload records, and repeat the execution until Traverse all data files; for records with small data scale, count the travel time of the vehicle on the day, and delete the travel records less than 15min;

步骤3，车辆预警信息的关键变量提取Step 3, extraction of key variables of vehicle warning information

根据车辆历史的出行数据，提取车辆行驶预警信息中两个关键变量：车辆单位行驶里程的预警频次、车辆单位行驶时间的预警频次；首先，统计时段T天内每辆车的特定预警位的总预警频次，T为正整数；其次，统计时段T天内每辆车的总行驶里程；再次，统计时段T天内每辆车的总行驶时间；然后，计算每辆车在单位行驶里程的预警频次和单位行驶时间的预警频次；以车辆ID为唯一识别码，把同一ID车辆在不同时段的出行记录信息进行统计汇总；具体步骤为：According to the historical travel data of the vehicle, two key variables in the vehicle driving warning information are extracted: the warning frequency per unit mileage of the vehicle and the warning frequency per unit driving time of the vehicle; first, the total warning of the specific warning position of each vehicle within the statistical period T days frequency, T is a positive integer; secondly, the total mileage of each vehicle in the statistical period T days; thirdly, the total driving time of each vehicle in the statistical period T days; then, calculate the warning frequency and unit of each vehicle in the unit mileage The frequency of early warning of travel time; the vehicle ID is used as the unique identification code, and the travel record information of the same ID vehicle in different time periods is statistically summarized; the specific steps are:

步骤3.1，统计时段T天内每辆车在特定预警位的预警频次Step 3.1, the frequency of early warning of each vehicle at a specific early warning position in the statistical period T days

以车辆的出行预警记录为对象，首先以车辆ID为唯一识别码，对每辆车在一天内的各个预警位的预警频次进行统计，再对选定的几个特定预警位进行累加，得到每辆车在一天内特定预警位的预警总频次，最后再对时段T天内同一ID车辆每天在特定预警位的预警频次进行累加，得到时段T天内每辆车在特定预警位的预警总频次；Taking the travel warning record of the vehicle as the object, firstly, the vehicle ID is used as the unique identification code to count the warning frequency of each warning bit of each vehicle in one day, and then the selected specific warning bits are accumulated to obtain each warning bit. The total warning frequency of a vehicle at a specific early warning position within a day, and finally the daily warning frequency of the same ID vehicle at a specific early warning position in the period T day is accumulated to obtain the total early warning frequency of each vehicle in the specific early warning position within the period T day;

步骤3.2，统计时段T天内每辆车的总行驶里程Step 3.2, the total mileage of each vehicle in the statistical period T days

行驶里程是记录车辆仪表盘的里程变化，反映车辆的行驶距离；以车辆ID号为唯一识别码，对同一ID车辆的行驶里程进行累加，最终获得时段T天内每辆车的总行驶里程；Driving mileage is to record the mileage change of the vehicle dashboard, reflecting the driving distance of the vehicle; using the vehicle ID number as the unique identification code, the mileage of vehicles with the same ID is accumulated, and finally the total mileage of each vehicle within the period T is obtained;

步骤3.3，统计时段T天内每辆车的总行驶时间Step 3.3, the total travel time of each vehicle in the statistical period T days

车辆行驶时间不包含车辆因等待或者延误所损失的时间，车辆行驶时间提取的原理是：先计算车辆出行的总时间，然后计算停车时间，二者的时间差即为车辆的总行驶时间；然后，以车辆ID为唯一识别码，对时段T天内同一ID车辆的行驶时间进行累加，得到时段T天内每辆车的总行驶时间；The vehicle travel time does not include the time lost by the vehicle due to waiting or delay. The principle of vehicle travel time extraction is: first calculate the total travel time of the vehicle, and then calculate the parking time. The time difference between the two is the total travel time of the vehicle; then, Taking the vehicle ID as the unique identification code, the running time of vehicles with the same ID within the period T is accumulated to obtain the total running time of each vehicle within the period T;

步骤3.4，基于步骤3.1得到的时段T天内每辆车在特定预警位的预警频次和步骤3.2得到的时段T天内每辆车的总行驶里程，两者相除，得到车辆单位行驶里程的预警频次；基于步骤3.1得到的时段T天内每辆车在特定预警位的预警频次和步骤3.3得到的时段T天内每辆车的总行驶时间，两者相除，得到车辆单位行驶时间的预警频次；Step 3.4, based on the warning frequency of each vehicle within the specific warning position within the period T obtained in step 3.1 and the total mileage of each vehicle within the period T obtained in step 3.2, divide the two to obtain the warning frequency per vehicle mileage ; Based on the total travel time of each vehicle in the period T in the time period T obtained by step 3.1 and the total travel time of each vehicle in the period T of the specific early warning position and step 3.3, the two are divided to obtain the early warning frequency of the vehicle unit travel time;

步骤4，车辆安全风险的聚类Step 4. Clustering of vehicle safety risks

把车辆单位行驶里程的预警频次和车辆单位行驶时间的预警频次作为聚类对象，进行风险等级的划分；基于AGNES层次聚类算法对二维数据进行聚类；The warning frequency of the vehicle's unit driving mileage and the warning frequency of the vehicle's unit driving time are used as clustering objects to classify the risk level; the two-dimensional data is clustered based on the AGNES hierarchical clustering algorithm;

具体为：Specifically:

1)确定输入样本集O＝{(WFM₁，WFT₁),(WFM₂，WFT₂),...,(WFM_n，WFT_n)}以及聚类数目Z值，其中WFM_i和WFT_i分别代表是车辆i单位行驶里程的预警频次和单位行驶时间的预警频次，其中i＝1,2,…,n，n为样本个数，即车辆或驾驶员的总数量；1) Determine the input sample set O={(WFM ₁ , WFT ₁ ), (WFM ₂ , WFT ₂ ),..., (WFM _n , WFT _n )} and the number of clusters Z value, where WFM _i and WFT _i respectively represent the warning frequency per unit driving mileage and the warning frequency per unit driving time of vehicle i, where i=1,2,...,n, n is the number of samples, that is, the total number of vehicles or drivers;

2)采用自底向上的聚类策略，以样本集O中每个对象O_i作为一个样本簇Φ_i，计算任意两个样本簇Φ_c和Φ_h之间的距离并比较各个距离，其中c≠h，寻找距离最近的两个样本簇Φ_h、Φ_c作为新的样本簇的集合，Φ_v＝Φ_h∪Φ_c，其中c、v、h为正整数，取值均小于等于n；2) Using the bottom-up clustering strategy, take each object O _i in the sample set O as a sample cluster Φ _i , calculate the distance between any two sample clusters Φ _c and Φ _h and compare the distances, where c ≠h, find the two nearest sample clusters Φ _h and Φ _c as a new set of sample clusters, Φ _v =Φ _h ∪Φ _c , where c, v, h are positive integers, and the values are all less than or equal to n;

3)聚类簇距离度量函数3) Cluster distance metric function

其中两个簇之间的邻近度大小，由两个簇共同决定，采用平均距离计算任意两个样本簇之间的聚集度，聚集度用来表示两个样本簇的相似度；The size of the proximity between the two clusters is jointly determined by the two clusters, and the average distance is used to calculate the aggregation degree between any two sample clusters, and the aggregation degree is used to represent the similarity of the two sample clusters;

G＝(WFM_g,WFT_g),Q＝(WFM_q,WFT_q) (6)G=(WFM _g , WFT _g ), Q=(WFM _q , WFT _q ) (6)

式中：Φ_h,Φ_c分别代表某个样本簇，|Φ_h|、|Φ_c|分别表示样本簇Φ_h,Φ_c中元素的个数，G,Q分别代表样本簇Φ_h,Φ_c中的某个样本，WFM_g表示车辆g单位行驶里程的预警频次，WFT_g表示车辆g单位行驶时间的预警频次，WFM_q表示车辆q单位行驶里程的预警频次，WFT_q表示车辆q单位行驶时间的预警频次，dist(G,Q)表示G,Q两个样本之间的欧氏距离；In the formula: Φ _h , Φ _c respectively represent a sample cluster, |Φ _h | and |Φ _c | represent the number of elements in sample clusters Φ _h , Φ _c respectively, G and Q represent sample clusters Φ _h , Φ respectively For a sample in _c , WFM _g represents the warning frequency per unit mileage of vehicle g, WFT _g represents the warning frequency per unit driving time of vehicle g, WFM _q represents the warning frequency per unit driving mileage of vehicle q, and WFT _q represents the driving frequency of vehicle q per unit The warning frequency of time, dist(G, Q) represents the Euclidean distance between the two samples of G and Q;

4)比较3)中计算得到的每个样本簇之间的平均距离，基于聚类合并原则合并两个距离最近的簇，不断更新合并形成新的簇，重新进行簇划分；4) Compare the average distance between each sample cluster calculated in 3), merge two clusters with the closest distance based on the cluster merging principle, continuously update and merge to form a new cluster, and perform cluster division again;

5)终止条件判断5) Termination condition judgment

根据设定的聚类数目Z值，若聚类簇数等于Z值，则无需再进行聚类，聚类终止，得到Z类风险等级；According to the set Z value of the number of clusters, if the number of clusters is equal to the Z value, no more clustering is required, the clustering is terminated, and the Z risk level is obtained;

步骤5，判别分析Step 5, Discriminant Analysis

步骤5.1，确定类别变量和判别变量Step 5.1, Identify categorical and discriminant variables

以车辆所处的风险程度进行风险等级划分，按照步骤4的聚类结果来划分，分为1级、2级、…Z级，将车辆单位行驶里程的预警频次和车辆单位行驶时间的预警频次作为判别分析的判别变量，将车辆的风险等级作为判别分析的类别变量；Divide the risk level according to the risk level of the vehicle, and divide it according to the clustering result of step 4, which is divided into 1 level, 2 level, ... As the discriminant variable of the discriminant analysis, the risk level of the vehicle is taken as the category variable of the discriminant analysis;

步骤5.2，判别函数的建立，根据样本数据确定类别变量与判别变量之间的数量关系，运用Fisher判别准则，建立Fisher判别函数；Step 5.2, the establishment of the discriminant function, determine the quantitative relationship between the categorical variable and the discriminant variable according to the sample data, and use the Fisher discriminant criterion to establish the Fisher discriminant function;

步骤6，风险识别Step 6, Risk Identification

根据建立的Fisher判别函数，对新样本进行分类判别According to the established Fisher discriminant function, classify and discriminate new samples

1)计算Y空间中样本点所属类别的中心；(2)对于新样本，计算其Fisher判别函数值Y₀，构建Y₀与各个类别中心的距离函数W(Y₀)，并计算Y₀与各个类别中心的距离；(3)利用距离判别法，判定其所属类别。1) Calculate the center of the category to which the sample point belongs in the Y space; (2) For the new sample, calculate its Fisher discriminant function value Y ₀ , construct the distance function W(Y ₀ ) between Y ₀ and the center of each category, and calculate the difference between Y ₀ and the center of each category. The distance between the centers of each category; (3) Use the distance discrimination method to determine the category to which it belongs.

在上述方案的基础上，步骤2.2中里程的过小值以日行驶里程累计分布来确定，按照不足总体样本的2％来确定；里程的过大值判断方法依据统计学基本知识3σ原则来确定。On the basis of the above scheme, the too small value of mileage in step 2.2 is determined by the cumulative distribution of daily mileage, which is less than 2% of the overall sample; the method of judging the too large value of mileage is determined according to the 3σ principle of basic statistical knowledge .

在上述方案的基础上，步骤3.2中，时段T天内每辆车的总行驶里程计算公式如下所示：On the basis of the above scheme, in step 3.2, the formula for calculating the total mileage of each vehicle within the period T is as follows:

其中，M_i是车辆i在时段T天内的总行驶里程；m_ij是车辆i在第j天的当日行驶里程，i＝1,2…n；j＝1,2…T。Wherein, M _i is the total mileage of vehicle i in the period T; m _ij is the mileage of vehicle i on the jth day, i=1, 2...n; j=1,2...T.

在上述方案的基础上，步骤3.3中，车辆出行的总时间：是车载仪器从记录开始至结束的持续时间，车辆当日记录起点开始时刻和停止记录结束时刻的时间差就是车辆出行的总时间；停车时间：里程连续不变的位置就是车辆静止的位置，寻找里程记录连续不变点，统计连续不变点持续时间，得到车辆的停止时间。On the basis of the above scheme, in step 3.3, the total travel time of the vehicle: is the duration of the on-board instrument from the start to the end of the recording, and the time difference between the start time of the vehicle's recording start point and the end time of the stop recording is the total time of the vehicle travel; parking Time: The position where the mileage is continuous is the position where the vehicle is stationary. Find the continuous point of the mileage record, count the duration of the continuous point, and get the stop time of the vehicle.

在上述方案的基础上，步骤5.2中，Fisher判别函数为：On the basis of the above scheme, in step 5.2, the Fisher discriminant function is:

y＝b₁x₁+b₂x₂+…+b_px_p (8)y=b ₁ x ₁ +b ₂ x ₂ +…+b _p x _p (8)

式中，b_α为判别系数，α＝1,2,…,p，y是样本在低维Y空间中的某个维度。In the formula, b _α is the discriminant coefficient, α = 1, 2, ..., p, y is a certain dimension of the sample in the low-dimensional Y space.

在上述方案的基础上，步骤6中，所述距离函数W(Y₀)如下所示：On the basis of the above scheme, in step 6, the distance function W(Y ₀ ) is as follows:

式中，

分别表示Y空间中第e、f类样本的中心点，

表示Y空间中第e类和f类所有样本的中心点，

∑^-1表示第e、f类协方差矩阵的逆矩阵；其中e＝1,2,…,Z，f＝1,2,…,Z，e≠f；In the formula,

represent the center points of the e-th and f-class samples in the Y space, respectively,

represents the center point of all samples of class e and class f in Y space,

∑ ^-1 represents the inverse matrix of the covariance matrix of types e and f; where e=1,2,…,Z, f=1,2,…,Z, e≠f;

当W(Y₀)>0时，新样本点属于第e类。When W(Y ₀ )>0, the new sample point belongs to the e-th class.

附图说明Description of drawings

本发明有如下附图：The present invention has the following accompanying drawings:

图1为本发明的流程图；Fig. 1 is the flow chart of the present invention;

图2为数据异常值筛选流程图；Figure 2 is a flowchart of data outlier screening;

图3为车辆风险的聚类结果图；Fig. 3 is the clustering result diagram of vehicle risk;

图4为判别函数中样本点空间分布图。Figure 4 shows the spatial distribution of sample points in the discriminant function.

具体实施方式Detailed ways

以下结合附图1-4对本发明作进一步详细说明。The present invention will be further described in detail below in conjunction with accompanying drawings 1-4.

步骤1，数据的获取。Step 1, data acquisition.

通过北斗定位系统的车联网车载终端获取车辆ID、更新上传时间、车辆安全预警情况、行驶里程记录。Obtain vehicle ID, update upload time, vehicle safety warning status, and mileage record through the vehicle-connected vehicle terminal of Beidou positioning system.

步骤2，数据的预处理Step 2, data preprocessing

步骤2.1遍历所有数据文件，对不符合要求的出行记录进行删除，删除缺少属性值的文本文件，保证属性的完整性，删除车载终端重复上传的记录；Step 2.1 traverses all data files, deletes travel records that do not meet the requirements, deletes text files lacking attribute values, ensures the integrity of attributes, and deletes records uploaded repeatedly by the vehicle terminal;

步骤2.2遍历所有数据文件，计算车辆当日行驶里程、当日各个预警位累计预警频次、当日起讫点时间差。根据所有车辆的行驶里程累计分布图，确定车辆行驶里程的过大值以及过小值点；剔除当日行驶里程过大或者过小的记录(数值过小点按照当日行驶里程频数不足总体样本的2％来确定，而数值过大点根据统计学基本知识运用3σ原则来定义)，另外，剔除当日起讫点时间差值小于0的出行记录；Step 2.2 Traverse all data files, calculate the mileage of the vehicle on the day, the cumulative warning frequency of each warning position on the day, and the time difference between the start and the end of the day. According to the cumulative distribution map of the mileage of all vehicles, determine the excessive and small value points of the vehicle mileage; remove the records that the mileage of the day is too large or too small (the points with too small values are less than 2 of the overall sample based on the frequency of the mileage on the day). % to determine, and the point of excessively large value is defined according to the basic knowledge of statistics and using the 3σ principle), in addition, the travel records whose time difference between the start and the end of the day is less than 0 are excluded;

步骤3，货车出行关键变量的提取Step 3. Extraction of key variables of truck travel

步骤3.1，统计时段T内每辆车的预警频次。以车辆每天的出行日志为单位，预警频次的统计根据预警类型(超速预警、疲劳驾驶预警、碰撞预警等)进行分别统计，获得车辆每天的各个预警位预警频次，根据需求对选定预警位进行累加计算当天在选定预警位的预警总频次。假设车联网数据针对每辆车一天中记录了S条长度为L的预警信息，其中L为预警的总位数，即车载设备可预警种类的总数量，则预警信息可表述为：Step 3.1, count the warning frequency of each vehicle in the period T. Taking the daily travel log of the vehicle as the unit, the statistics of the warning frequency are calculated separately according to the warning type (speed warning, fatigue driving warning, collision warning, etc.) Accumulate the total warning frequency at the selected warning position on the current day. Assuming that the Internet of Vehicles data records S pieces of early warning information of length L for each vehicle in one day, where L is the total number of early warnings, that is, the total number of types of early warnings of on-board equipment, the early warning information can be expressed as:

其中，

s＝1,2,…,S,l＝1,2,…,L，in,

s=1,2,…,S,l=1,2,…,L,

若统计一辆车一天内在第l预警位的预警频次a_l，则有：

即为车辆当天出行记录下第l位的预警频次。再对选定的几个特定预警位的预警频次进行累加，得到每辆车当天在选定预警位的预警总频次。假如选定的是前三位预警位，则车辆一天内在选定预警位的预警总频次为

最后，再对相同ID车辆时段T天内每天(在选定预警位的)预警频次进行累加，得到T天内该车辆(在选定预警位的)预警总频次。If the warning frequency a _l of the lth warning position of a vehicle in one day is counted, there are:

That is, the warning frequency of the 1st place in the travel record of the vehicle on the day. Then, the warning frequencies of several selected specific warning positions are accumulated to obtain the total warning frequency of each vehicle at the selected early warning positions on that day. If the first three warning positions are selected, the total warning frequency of the selected warning positions in one day is:

Finally, the warning frequency of each day (at the selected warning position) for the same ID vehicle within T days is accumulated to obtain the total warning frequency of the vehicle (at the selected warning position) within T days.

步骤3.2，统计时段T内每辆车的总行驶里程。行驶里程的变化是记录车辆仪表盘的里程变化，反映车辆的行驶距离。具体的说，按车辆ID号为唯一识别码将相同的车辆ID的当日行驶里程进行累加，最终获得时段T内每个车辆的总行驶里程；其中当日行驶里程可由车辆的“里程”属性字段计算得到，即基于当日的行驶里程记录起点和里程记录终点的里程数的差值。Step 3.2, the total mileage of each vehicle in the statistical period T is calculated. The change of driving mileage is to record the mileage change of the vehicle dashboard, reflecting the driving distance of the vehicle. Specifically, the vehicle ID number is used as the unique identification code to accumulate the mileage of the same vehicle ID on the day, and finally the total mileage of each vehicle in the period T is obtained; the mileage of the day can be calculated from the "mileage" attribute field of the vehicle. Obtained, that is, based on the difference between the mileage at the starting point of the mileage record and the end point of the mileage record on the current day.

步骤3.3，统计时段T天内每辆车的总行驶时间。车辆行驶时间不包含车辆因等待或者延误所损失的时间。车辆行驶时间提取的原理是：先计算车辆出行的总时间，然后计算停车时间，二者的时间差即为车辆的行驶时间。1)车辆出行的总时间：车辆出行的总时间是车载仪器从记录开始至结束的持续时间，车辆当日记录起点开始时刻和停止记录结束时刻的时间差就是车辆出行的总时间；2)停车时间：里程连续不变的位置就是车辆静止的位置，寻找里程记录连续不变点，统计连续不变点持续时间，得到车辆的停止时间；3)车辆当日行驶时间：将得到的总出行时间减去停车时间就得到车辆当日行驶时间；4)车辆T天内的总行驶时间：与车辆总里程统计类似，以车辆ID为主键，对相同的ID车辆行驶时间累加就得到车辆的总行驶时间。Step 3.3, the total travel time of each vehicle in the T day period is counted. Vehicle travel time does not include time lost by the vehicle due to waiting or delays. The principle of vehicle travel time extraction is to first calculate the total travel time of the vehicle, and then calculate the parking time, and the time difference between the two is the travel time of the vehicle. 1) The total travel time of the vehicle: the total travel time of the vehicle is the duration from the start of the recording to the end of the on-board instrument, and the time difference between the start of the vehicle’s recording start and the end of the stop recording is the total travel time of the vehicle; 2) Parking time: The position where the mileage is continuous is the position where the vehicle is stationary. Find the continuous point of the mileage record, count the duration of the continuous point, and get the stop time of the vehicle; 3) The vehicle’s travel time on the day: subtract the parking time from the total travel time obtained 4) The total driving time of the vehicle within T days: similar to the total vehicle mileage statistics, with the vehicle ID as the main key, the total driving time of the vehicle is obtained by accumulating the driving time of the same ID vehicle.

步骤3.4，单位行驶里程的预警频次和单位行驶时间的预警频次，根据每辆车的总行驶里程、总行驶时间和选定预警位的预警总频次来计算单位行驶里程的预警频次(WFM)和单位行驶时间的预警频次(WFT)；Step 3.4, the warning frequency per unit driving mileage and the warning frequency per unit driving time, calculate the warning frequency per unit driving mileage (WFM) and Warning frequency per unit driving time (WFT);

WFM＝TW/TM (3)WFM=TW/TM (3)

WFT＝TW/TT (4)WFT=TW/TT (4)

其中，WFM(Warning Frequency per Mileage)—单位行驶里程的预警频次，WFT(Warning Frequency per Travel Time)—单位行驶时间的预警频次，TW(Total WarningFrequency)—预警总次数，TM(Total Mileage)—总行驶里程，TT(Total Travel Time)—总行驶时间。Among them, WFM (Warning Frequency per Mileage)—warning frequency per unit of driving mileage, WFT (Warning Frequency per Travel Time)—warning frequency per unit driving time, TW (Total Warning Frequency)—total number of warnings, TM (Total Mileage)—total Travel mileage, TT (Total Travel Time)—total travel time.

步骤4，车辆安全风险的聚类，基于AGNES(AGglomerative NESting)层次聚类算法对样本车辆划分预警等级。具体有：1)确定输入样本集O＝{(WFM₁，WFT₁),(WFM₂，WFT₂),...,(WFM_n，WFT_n)}以及聚类数目Z，其中(WFM_i，WFT_i)(i＝1,2,…,n)分别代表是车辆i单位行驶里程下的预警频次和单位行驶时间下的预警频次。2)采用自底向上的聚类策略，以样本集O中每个对象O_i作为一个样本簇Φ_i，计算任意两个样本簇Φ_c和Φ_h之间的距离(c≠h)比较各个距离，寻找距离最近的两个样本簇Φ_h、Φ_c作为新的样本簇的集合，Φ_v＝Φ_h∪Φ_c。3)聚类簇距离度量函数。其中两个簇之间的邻近度大小，由两个簇共同决定，本次采用平均距离(又称average-linkage法)计算任意两个样本簇之间的聚集度，用来表示两个样本簇相似度度量方式。Step 4: Clustering of vehicle safety risks, based on the AGNES (AGglomerative NESting) hierarchical clustering algorithm, the sample vehicles are classified into early warning levels. Specifically: 1) Determine the input sample set O={(WFM ₁ , WFT ₁ ), (WFM ₂ , WFT ₂ ),...,(WFM _n , WFT _n )} and the number of clusters Z, where (WFM _i , WFT _i ) (i=1, 2, . . . , n) respectively represent the warning frequency under the unit driving mileage of vehicle i and the warning frequency under the unit driving time. 2) Using the bottom-up clustering strategy, take each object O _i in the sample set O as a sample cluster Φ _i , calculate the distance (c≠h) between any two sample clusters Φ _c and Φ _h , and compare each Distance, find the two sample clusters Φ _h and Φ _c with the closest distance as a new set of sample clusters, Φ _v =Φ _h ∪Φ _c . 3) Clustering cluster distance metric function. The size of the proximity between the two clusters is jointly determined by the two clusters. This time, the average distance (also known as the average-linkage method) is used to calculate the aggregation degree between any two sample clusters, which is used to represent the two sample clusters. similarity measure.

式中：Φ_h,Φ_c分别代表某个样本簇，|Φ_h|、|Φ_c|分别表示样本簇Φ_h,Φ_c中元素的个数，G,Q分别代表样本簇Φ_h,Φ_c中的某个样本，dist(G,Q)表示G,Q两个样本之间的欧氏距离；In the formula: Φ _h , Φ _c respectively represent a sample cluster, |Φ _h | and |Φ _c | represent the number of elements in sample clusters Φ _h , Φ _c respectively, G and Q represent sample clusters Φ _h , Φ respectively For a sample in _c , dist(G, Q) represents the Euclidean distance between the two samples of G and Q;

4)比较3)中计算得到的每个样本簇之间的平均距离，基于聚类合并原则合并两个距离最近的簇，不断更新合并形成新的簇，重新进行簇划分。例如簇Φ₁和簇Φ₂之间的距离是所属不同簇之间距离最小的，则Φ₁和Φ₂就会被合并形成新的簇。5)终止条件判断。根据设定的聚类数目Z值，若聚类簇数等于Z值，则无需再进行聚类，聚类终止，得到Z类风险等级。4) Compare the average distance between each sample cluster calculated in 3), merge two clusters with the closest distance based on the cluster merging principle, continuously update and merge to form a new cluster, and divide the cluster again. For example, the distance between cluster Φ ₁ and cluster Φ ₂ is the smallest distance between different clusters, then Φ ₁ and Φ ₂ will be merged to form a new cluster. 5) Termination condition judgment. According to the set Z value of the number of clusters, if the number of clusters is equal to the Z value, no more clustering is required, the clustering is terminated, and the Z risk level is obtained.

步骤5，判别分析Step 5, Discriminant Analysis

步骤5.1，均值检验和协差阵齐性检验，为保证判别分析的效果较为理想，多个类别总体下的各判别变量的均值应存在显著差异，否则给出错误的判别结果的概率会较高，通常，应首先进行总体的均值检验，即判别各类别总体下判别变量的组间差是否显著。Step 5.1, mean test and covariance matrix homogeneity test, in order to ensure that the effect of discriminant analysis is ideal, the means of each discriminant variable under the population of multiple categories should be significantly different, otherwise the probability of giving wrong discriminant results will be high. , usually, the overall mean test should be carried out first, that is, to determine whether the difference between groups of discriminant variables under each category population is significant.

Fisher判别分析的基本思想是先投影再判别，判别分析中投影是判别分析的关键，按照最大化类间离散度和最小化类内离散度的原则，将高维数据点投影到低维数据点，达到样本最大类间的分离。将p维X空间样本点投影到r(r<＝p)维Y空间中。Fisher判别的判别函数的函数形式如下：The basic idea of Fisher discriminant analysis is to project first and then discriminate. Projection in discriminant analysis is the key to discriminant analysis. According to the principles of maximizing the dispersion between classes and minimizing the dispersion within a class, high-dimensional data points are projected to low-dimensional data points. , to achieve the maximum separation between classes of samples. Project the p-dimensional X space sample points into r (r<=p)-dimensional Y space. The functional form of Fisher's discriminant function is as follows:

其中，系数b_α称为判别系数，是各个输入变量对判别函数的影响，可由组间离差最大、组内离差最小原则来确定。y是样本在低维Y空间中的某个维度。Among them, the coefficient b _α is called the discriminant coefficient, which is the influence of each input variable on the discriminant function. y is some dimension of the sample in the low-dimensional Y space.

通过对原数据坐标系统的转化，将高维空间中的样本点转化到低维空间中。通过坐标转换尽可能将将总体的样本点分开，判别时首先计算Y空间中样本点所属类别的中心，对于新样本，计算其Fisher判别函数值Y₀，以及Y空间中Y₀与各个类别中心的距离，利用距离判别法(马氏距离)，判别其所属类别，构建Y₀与各个类别中心的距离函数W(Y₀)，By transforming the original data coordinate system, the sample points in the high-dimensional space are transformed into the low-dimensional space. The overall sample points are separated as much as possible through coordinate transformation. When discriminating, first calculate the center of the category to which the sample point belongs in the Y space. For a new sample, calculate its Fisher discriminant function value Y ₀ , and Y ₀ in the Y space and the center of each category The distance of , using the distance discriminant method (Mahilan distance), to determine the category to which it belongs, and to construct the distance function W(Y ₀ ) between Y ₀ and the center of each category,

其中，

分别表示Y空间中第e、f类样本的中心点，

表示Y空间中第e类和f类所有样本的中心点，

∑^-1表示第e、f类协方差矩阵的逆矩阵。in,

represents the center point of all samples of class e and class f in Y space,

∑ ^-1 represents the inverse matrix of the covariance matrix of the e and f types.

步骤5.2，确定判别因子。按照步骤4的聚类结果，车辆的风险等级分为1级、2级、…Z级，选定预警信息的两个关键变量(车辆单位行驶里程的预警频次和单位行驶时间的预警频次)作为判别因子或判别变量，将车辆的风险等级作为判别分析的类别变量。Step 5.2, determine the discriminant factor. According to the clustering results in step 4, the risk levels of vehicles are divided into 1, 2, ... Z, and two key variables of early warning information (warning frequency per vehicle mileage and warning frequency per driving time) are selected as The discriminant factor or discriminant variable takes the risk level of the vehicle as the categorical variable of the discriminant analysis.

步骤5.3，判别函数的建立。通过步骤4确定的风险等级，将其作为判别分析的类别变量，车辆单位行驶里程的预警频次和单位行驶时间的预警频次作为判别变量，根据已有的样本数据确定类别变量与判别变量之间的数量关系，运用Fisher判别函数，建立判别准则；Step 5.3, the establishment of the discriminant function. The risk level determined in step 4 is used as the categorical variable of the discriminant analysis, the warning frequency per unit mileage of the vehicle and the warning frequency per unit driving time are used as the discriminant variables, and the relationship between the categorical variable and the discriminant variable is determined according to the existing sample data. Quantitative relationship, using Fisher discriminant function to establish discriminant criteria;

步骤5.3上述Fisher判别分析可以通过SPSS软件之中的判别分析进行操作，求出判别函数。Step 5.3 The above Fisher discriminant analysis can be operated through the discriminant analysis in SPSS software to obtain the discriminant function.

步骤5.4，模型结果检验，通过混淆矩阵的频数百分比或各样本点在Fisher判别函数空间中的分布和位置情况来判断模型的解释程度。Step 5.4, model result test, judge the interpretation degree of the model by the frequency percentage of confusion matrix or the distribution and position of each sample point in Fisher discriminant function space.

步骤6，风险识别。通过判别函数实现对新数据未知类别的判定和预测。对于新数据的样本点而言，基于判别函数，实现对新样本(车辆/驾驶员)进行风险识别。Step 6, risk identification. The judgment and prediction of unknown categories of new data are realized by the discriminant function. For the sample points of the new data, based on the discriminant function, the risk identification of the new sample (vehicle/driver) is realized.

(1)计算Y空间中样本点所属类别的中心；(2)对于新样本，计算其Fisher判别函数值，以及Y空间中与各个类别中心的距离；(3)利用距离判别，判别其所属类别。(1) Calculate the center of the category to which the sample point belongs in the Y space; (2) For the new sample, calculate its Fisher discriminant function value and the distance from the center of each category in the Y space; (3) Use distance discrimination to determine the category to which it belongs .

本发明可以对某一车辆/驾驶员，给出其单位行驶里程预警频次和单位行驶时间的预警频次，就可基于历史数据给出判别分析的判断函数，实现对其风险的识别或预测。The present invention can provide the warning frequency per unit driving mileage and the warning frequency per unit driving time for a certain vehicle/driver, and then can provide the judgment function of discriminant analysis based on historical data, so as to realize the identification or prediction of its risk.

1案例数据介绍1 Case data introduction

本发明所使用的数据来自某企业提供的2017年9月至2017年10月的车联网数据，包含动态数据项：预警标志、行驶里程、更新上传时间，以及静态数据项：车辆ID(终端号)等。首先对原始数据进行数据的预处理和筛选。最终11139条原始记录经过筛选和处理得到10039条有效记录。将不同天数相同ID车辆进行数据的合并，以一个ID作为一个样本，得到862个车辆样本，对车辆/驾驶员实现风险的评估。最终整理得到的数据样式如下表1所示：The data used in the present invention comes from the Internet of Vehicles data provided by a company from September 2017 to October 2017, including dynamic data items: early warning signs, mileage, update upload time, and static data items: vehicle ID (terminal number )Wait. Firstly, data preprocessing and filtering are performed on the original data. Finally, 11,139 original records were filtered and processed to get 10,039 valid records. The data of vehicles with the same ID of different days are merged, and one ID is used as a sample to obtain 862 vehicle samples, which can evaluate the risk of vehicle/driver realization. The final data style is shown in Table 1 below:

各变量的解释定义如下：The interpretation of each variable is defined as follows:

第1列ID代表车辆的编码，是识别车辆的唯一编码。不同车辆具有不同的车辆ID，一个车辆ID对应同一辆车。The first column ID represents the code of the vehicle, which is the unique code to identify the vehicle. Different vehicles have different vehicle IDs, and one vehicle ID corresponds to the same vehicle.

第2列WFM代表车辆在单位行驶里程(每10km)下的预警频次。The second column WFM represents the warning frequency of the vehicle in the unit driving mileage (every 10km).

第3列WFT代表车辆在单位行驶时间(每小时)的预警频次。The third column WFT represents the warning frequency of the vehicle per unit travel time (hourly).

表1整理样本数据的基本格式(部分)Table 1 Basic format (part) of organizing sample data

2基于案例数据进行聚类分析2 Cluster analysis based on case data

将数据处理为表1所示的格式，基于案例数据中第二列单位行驶里程的预警频次和第三列单位行驶时间的预警频次进行聚类，首先导入我们的数据如表1所示，输入预设的Z值(Z＝3)，结合公式(5)和公式(6)计算样本点之间距离，根据各个样本点之间的平均距离进行聚类可以得到其聚类的结果。聚类分析得到的部分结果展示如表2所示，前3列数据的含义与表1中的一致，最后一列RL(Risk Level)为车辆的聚类结果，标志着车辆的风险等级。The data is processed into the format shown in Table 1, and clustering is performed based on the warning frequency per unit mileage in the second column of the case data and the warning frequency per unit driving time in the third column. First, import our data as shown in Table 1, enter The preset Z value (Z=3) is combined with formula (5) and formula (6) to calculate the distance between sample points, and the clustering result can be obtained by clustering according to the average distance between each sample point. Part of the results obtained by cluster analysis are shown in Table 2. The meanings of the first three columns of data are consistent with those in Table 1. The last column, RL (Risk Level), is the clustering result of the vehicle, indicating the risk level of the vehicle.

表2样本数据的聚类结果Table 2 Clustering results of sample data

图3给出了样本数据的聚类分析结果。如图所示，样本点可以分成三类，WFM和WFT的值越大，表明车辆的预警频次越高，即车辆所处风险等级越大，因此基于WFM和WFT值聚为三类时，它们的风险可分别定义为：1—安全、2—一般、3—危险。图中用圆圈标识的样本点，其WFM和WFT值都很小，因此这一类样本的风险等级最低，为安全状态；菱形代表的是WFM和WFT值都较大的样本点，它们的风险等级最高，为危险状态；三角形代表的是风险等级为一般状态的样本点。Figure 3 presents the cluster analysis results of the sample data. As shown in the figure, the sample points can be divided into three categories. The larger the value of WFM and WFT, the higher the warning frequency of the vehicle, that is, the greater the risk level of the vehicle. Therefore, when the values of WFM and WFT are clustered into three categories, they The risks can be defined as: 1-safe, 2-general, 3-dangerous. The sample points marked with circles in the figure have small WFM and WFT values, so this type of sample has the lowest risk level and is in a safe state; the diamonds represent sample points with large WFM and WFT values, and their risk The highest level is the dangerous state; the triangles represent the sample points whose risk level is the general state.

2基于聚类结果进行判别函数建立2 Establishment of discriminant function based on clustering results

通过上一步确定的车辆的风险等级，将其作为判别分析的类别变量，车辆单位行驶里程的预警频次和单位行驶时间的预警频次作为判别变量，根据已有的数据确定类别变量与判别变量之间的数量关系，运用Fisher判别准则，建立判别函数。The risk level of the vehicle determined in the previous step is used as the category variable of the discriminant analysis, and the warning frequency per unit mileage and the warning frequency per unit driving time of the vehicle are used as the discriminant variables. The quantitative relationship of , using Fisher's criterion to establish a discriminant function.

(1)确定判别因子(1) Determine the discriminant factor

选取预警信息的两个关键变量作为判别因子，即将车辆单位行驶里程的预警频次和单位行驶时间的预警频次作为判别变量。Two key variables of early warning information are selected as discriminant factors, namely the frequency of warning per unit mileage and the frequency of warning per unit driving time as discriminant variables.

(2)建立判别函数(2) Establish a discriminant function

建立判别函数实现对风险车辆的识别与判断，依据各个车辆的驾驶风险指标和风险预警等级建立相关判别函数。将车辆的风险等级作为判别分析的类别变量，将车辆单位行驶里程的预警频次和单位行驶时间的预警频次作为判别变量，建立Fisher判别函数分析每辆车辆的风险强度，从而实现了车辆的风险评估。本文依据Fisher判别法建立风险判别函数，实现风险等级判定。A discriminant function is established to realize the identification and judgment of risky vehicles, and a relevant discriminant function is established according to the driving risk index and risk warning level of each vehicle. Taking the risk level of the vehicle as the category variable of the discriminant analysis, the warning frequency per unit mileage and the warning frequency per unit driving time of the vehicle as the discriminant variables, the Fisher discriminant function is established to analyze the risk intensity of each vehicle, thereby realizing the risk assessment of the vehicle. . In this paper, a risk discriminant function is established based on Fisher's discriminant method to realize risk level determination.

1、建立的判别函数1. The established discriminant function

表3判别式函数系数Table 3 Discriminant function coefficients

得到判别函数：Get the discriminant function:

式中：x₁—WFM(单位行驶里程的预警频次)；x₂—WFT(单位行驶时间的预警频次)。In the formula: x ₁ —WFM (warning frequency per unit driving mileage); x ₂ —WFT (warning frequency per unit driving time).

2、计算类别中心位置2. Calculate the center position of the category

计算三类样本点的类别中心位置如下：The class center positions of the three classes of sample points are calculated as follows:

表4类别中心处的函数Table 4 Functions at the center of the categories

3、判别能力的检验3. Test of discrimination ability

为了检验Fisher判别函数的投影能否将各类样本很好的分开，进一步判断哪个判别函数对判别结果的解释程度更重要，需要计算两个特征值、所解释方差的百分比、解释方差的累计百分比。In order to test whether the projection of the Fisher discriminant function can well separate the various samples, and to further judge which discriminant function is more important for the interpretation of the discriminant results, it is necessary to calculate the two eigenvalues, the percentage of variance explained, and the cumulative percentage of variance explained. .

表5计算结果汇总Table 5 Summary of calculation results

可以看出第一判别函数解释方差的能力是100％，而第二判别函数解释方差的能力是0％，所以，可以略去第二判别函数。得到最终的判别函数是：It can be seen that the ability of the first discriminant function to explain the variance is 100%, while the ability of the second discriminant function to explain the variance is 0%, so the second discriminant function can be omitted. The final discriminant function is obtained as:

y＝-1.019-2.619x₁+5.426x₂ (10)y=-1.019-2.619x ₁ +5.426x ₂ (10)

对于新样本点带入Fisher判别函数，然后计算与各个类别中心的距离，利用距离判别，判别其所属类别。Bring the Fisher discriminant function into the new sample point, then calculate the distance from the center of each category, and use the distance discrimination to determine the category to which it belongs.

若通过数据获取，得到一个新样本(新的车辆或驾驶员出行行为数据，或原样本最近一个月新的出行行为数据)，新样本的数据如下：单位行驶里程的预警频次WFM为0.735次/10km，单位行驶时间的预警频次WFT为3.101次/h。If a new sample is obtained through data acquisition (new vehicle or driver travel behavior data, or new travel behavior data of the original sample in the last month), the data of the new sample is as follows: the warning frequency per unit mileage WFM is 0.735 times/ 10km, the warning frequency WFT per unit driving time is 3.101 times/h.

以新样本(0.735,3.101)为例，首先带入判别函数式(10)中，计算其值为13.882，即新样本映射到一维空间样本集Y中的值为13.882。然后，由于最终的判别函数是公式(10)的单函数形式，根据表4的函数1可知各类别中心点映射到一维空间Y中的值分别为-0.700，10.893，27.826。计算映射到Y中，新样本点与各个类别中心点的距离，利用公式(8)的距离判别，由于判别函数是单函数形式，距离可直接计算得出，分别为14.582，2.988，-13.946。可知，映射后新样本点与第二类的中心点最近，最终判定该新样本的风险类别属于第二类，即风险等级为一般。Taking the new sample (0.735, 3.101) as an example, it is first brought into the discriminant function formula (10), and its value is calculated to be 13.882, that is, the value of the new sample mapped to the one-dimensional space sample set Y is 13.882. Then, since the final discriminant function is the single function form of formula (10), according to function 1 in Table 4, it can be known that the values of the center points of each category mapped into the one-dimensional space Y are -0.700, 10.893, and 27.826, respectively. Calculate the distance between the new sample point and the center point of each category in Y, and use the distance discrimination of formula (8). Since the discriminant function is a single function form, the distance can be directly calculated, which are 14.582, 2.988, and -13.946 respectively. It can be seen that after the mapping, the new sample point is closest to the center point of the second category, and it is finally determined that the risk category of the new sample belongs to the second category, that is, the risk level is general.

(4)结果分析(4) Result analysis

各样本点在Fisher判别函数空间中的分布和位置情况，如图4所示。从样本点在Fisher判别函数空间中的分布和位置来看，各类别的分布较为集中，所以判别效果较为理想。The distribution and position of each sample point in the Fisher discriminant function space are shown in Figure 4. From the distribution and position of the sample points in the Fisher discriminant function space, the distribution of each category is relatively concentrated, so the discriminant effect is ideal.

本说明书中未作详细描述的内容属于本领域专业技术人员公知的现有技术。Contents not described in detail in this specification belong to the prior art known to those skilled in the art.

Claims

1. The utility model provides a method of extraction and risk identification of freight train early warning information based on big dipper car networking which characterized in that includes several following steps:

step 1, obtaining original data related to vehicle early warning through a vehicle-mounted terminal of a vehicle networking provided with a Beidou positioning system, wherein the original data comprises: mileage information, safety early warning information and state data, wherein the state data comprises a vehicle ID, an ACC state and uploading time;

step 2, preprocessing the original data;

preprocessing and screening original data by using a Python programming technology; before analyzing the original data, the original data needs to be cleaned and sorted, so that the data quality is improved; the data cleaning comprises the following steps: filling missing values in the data, and identifying abnormal values and redundant data in the data; the method comprises the following specific steps of preprocessing original data by combining the characteristic of data storage of the vehicle-mounted terminal:

step 2.1, data missing value operation: the method comprises the steps of performing programming operation by using Python, introducing an os module and a numpy module in the Python, defining a required function, executing a main function, operating a text file stored in a vehicle-mounted terminal, deleting the text file lacking an attribute value, and ensuring the integrity of the attribute;

step 2.2, data abnormal value operation: the data abnormal value comprises abnormal mileage, abnormal early warning state and abnormal uploading time;

1) and (3) mileage exception handling: firstly, traversing all data files, calculating the mileage of the vehicle on the day, secondly, making an accumulated distribution map of the mileage of the vehicle on the day, and determining an overlarge value point and an undersize value point of the mileage of the vehicle on the day; finally, removing travel records with overlarge or undersize numerical values in the driving mileage of the vehicle on the day;

2) and (4) abnormal processing of the early warning state: counting the early warning duration of each early warning bit, solving the duration of single early warning of the early warning bit, and deleting the early warning state with obvious errors;

3) and (3) processing an uploading time exception: calculating the time difference between adjacent uploading points, and eliminating the recording points with the difference value of the adjacent uploading time unchanged or less than zero;

step 2.3, redundant data operation: traversing all records of the vehicle trip on the same day, and deleting the repeatedly uploaded records and the records with smaller data scale on the same day, wherein the records specifically comprise: comparing the uploaded records, deleting the repeatedly uploaded records, and repeatedly executing until all data files are traversed; counting the trip time of the vehicle on the day according to the record with smaller data scale, and deleting the trip record less than 15 min;

step 3, extracting key variables of the vehicle early warning information;

according to the historical travel data of the vehicle, two key variables in the vehicle running early warning information are extracted: the early warning frequency of the unit driving mileage of the vehicle and the early warning frequency of the unit driving time of the vehicle; firstly, counting the total early warning frequency of a specific early warning position of each vehicle in a time period of T days, wherein T is a positive integer; secondly, counting the total driving mileage of each vehicle within T days; thirdly, counting the total driving time of each vehicle within the time period T days; then, calculating the early warning frequency of each vehicle in unit driving mileage and the early warning frequency of each vehicle in unit driving time; taking a vehicle ID as a unique identification code, and counting and summarizing travel record information of the same ID vehicle in different time periods; the method comprises the following specific steps:

step 3.1, counting the early warning frequency of each vehicle at a specific early warning position within a time period T days;

taking travel early warning records of vehicles as objects, firstly taking a vehicle ID as a unique identification code, counting the early warning frequency of each early warning position of each vehicle in one day, then accumulating selected specific early warning positions to obtain the early warning total frequency of each vehicle in the specific early warning position in one day, and finally accumulating the early warning frequency of the same ID vehicle in the specific early warning position every day in a time period T to obtain the early warning total frequency of each vehicle in the specific early warning position in the time period T;

step 3.2, counting the total driving mileage of each vehicle within the time period T days;

the driving mileage is to record the mileage change of the instrument panel of the vehicle and reflect the driving distance of the vehicle; accumulating the driving mileage of the same ID vehicle by taking the vehicle ID number as a unique identification code, and finally obtaining the total driving mileage of each vehicle in a time period T days;

step 3.3, counting the total driving time of each vehicle in the period T days;

the vehicle running time does not include the time lost by the vehicle due to waiting or delay, and the principle of vehicle running time extraction is as follows: firstly, calculating the total travel time of the vehicle, and then calculating the parking time, wherein the time difference between the total travel time and the parking time is the total travel time of the vehicle; then, accumulating the running time of the same ID vehicle in the time period T by taking the vehicle ID as a unique identification code to obtain the total running time of each vehicle in the time period T;

step 3.4, dividing the early warning frequency of each vehicle at the specific early warning position in the time period T day obtained in the step 3.1 and the total driving range of each vehicle in the time period T day obtained in the step 3.2 to obtain the early warning frequency of the unit driving range of the vehicle; dividing the early warning frequency of each vehicle at the specific early warning position in the time period T day obtained in the step 3.1 and the total running time of each vehicle in the time period T day obtained in the step 3.3 to obtain the early warning frequency of the vehicle in unit running time;

step 4, clustering the vehicle safety risks;

taking the early warning frequency of the unit driving mileage of the vehicle and the early warning frequency of the unit driving time of the vehicle as clustering objects, and dividing risk grades; clustering the two-dimensional data based on an AGNES hierarchical clustering algorithm;

the method specifically comprises the following steps:

1) determining an input sample set O { (WFM)₁，WFT₁),(WFM₂，WFT₂),...,(WFM_n，WFT_n) And the number of clusters Z value, where WFM_iAnd WFT_iRespectively representing the early warning frequency of i unit driving mileage and the early warning frequency of unit driving time of the vehicle, wherein i is 1,2, …, n is the number of samples, namely the total number of vehicles or drivers;

2) adopting a bottom-up clustering strategy to sample each object O in the set O_iAs a cluster of samples phi_iCalculating any two sample clusters phi_cAnd phi_hComparing the distances, wherein c is not equal to h, and searching two sample clusters phi with the shortest distance_h、Φ_cAs a new set of sample clusters, phi_v＝Φ_h∪Φ_cWherein c, v and h are positive integers, and the values are all less than or equal to n;

3) clustering cluster distance metric function;

the proximity degree between the two clusters is determined by the two clusters together, the average distance is adopted to calculate the aggregation degree between any two sample clusters, and the aggregation degree is used for representing the similarity degree of the two sample clusters;

G＝(WFM_g,WFT_g),Q＝(WFM_q,WFT_q) (6)

in the formula: phi_h,Φ_cRespectively represent a certain sample cluster, | phi_h|、|Φ_c| represents a cluster of samples Φ respectively_h,Φ_cThe number of middle elements, G and Q respectively represent a sample cluster phi_h,Φ_cA certain sample in (1), WFM_gWFT (weighted average) representing the warning frequency of g unit mileage of a vehicle_gIndicating the warning frequency of the vehicle in g units of travel time, WFM_qWFT (weighted average) representing the warning frequency of the vehicle for q units of mileage_qThe early warning frequency of the vehicle Q unit running time is represented, dist (G, Q) represents the Euclidean distance between two samples G and Q;

4) comparing the average distance between each sample cluster obtained by calculation in the step 3), merging two clusters with the closest distance based on a clustering merging principle, continuously updating and merging to form a new cluster, and performing cluster division again;

5) judging a termination condition;

according to the set value of the clustering number Z, if the clustering number is equal to the value Z, clustering is not needed, and clustering is terminated to obtain a Z-type risk level;

step 5, judging and analyzing;

step 5.1, determining a category variable and a discrimination variable;

carrying out risk grade division according to the risk degree of the vehicle, dividing the risk grade into 1 grade, 2 grade and … Z grade according to the clustering result in the step 4, taking the early warning frequency of the unit driving mileage of the vehicle and the early warning frequency of the unit driving time of the vehicle as discrimination variables of the discrimination analysis, and taking the risk grade of the vehicle as a category variable of the discrimination analysis;

step 5.2, establishing a discrimination function, namely determining the quantitative relation between the category variable and the discrimination variable according to the sample data, and establishing the Fisher discrimination function by using a Fisher discrimination criterion;

step 6, risk identification;

classifying and distinguishing the new samples according to the established Fisher distinguishing function;

(1) calculating the center of the category to which the sample point belongs in the Y space; (2) for the new sample, calculating Fisher discriminant function value Y₀Construction of Y₀Distance function W (Y) from the center of each class₀) And calculate Y₀Distance from the center of each category; (3) the category to which the image belongs is determined by a distance discrimination method.

2. The method for extracting early warning information and identifying risks of trucks based on the Beidou Internet of vehicles as claimed in claim 1, wherein the over-small points of mileage in step 2.2 are determined by cumulative daily mileage distribution, and are determined according to less than 2% of the total sample; the judgment method of the excessive value points of the mileage is determined according to the principle of 3 sigma of basic statistical knowledge.

3. The method for extracting early warning information and identifying risks of trucks based on the Beidou Internet of vehicles as claimed in claim 1, wherein in step 3.2, the calculation formula of the total driving range of each truck in the time period T is as follows:

wherein

Wherein M is_iIs the total driving range of vehicle i over a period of T days; m is_ijIs the mileage of vehicle i on day j, i ═ 1,2 … n; j is 1,2 … T.

4. The method for extracting early warning information and identifying risks of trucks based on the Beidou Internet of vehicles as claimed in claim 1, wherein in step 3.3, the total time of vehicle traveling: the time difference between the starting time of the vehicle on day recording and the ending time of the stop recording is the total time of the vehicle trip; parking time: and finding a continuous invariant point of the mileage record, and counting the duration of the continuous invariant point to obtain the stop time of the vehicle.

5. The method for extracting early warning information and identifying risks of trucks based on the Beidou Internet of vehicles as claimed in claim 1, wherein in step 5.2, the Fisher discriminant function is as follows:

y＝b₁x₁+b₂x₂+…+b_px_p(8)

in the formula, x₁,x₂,...,x_pTo discriminate variables, b_αFor the discrimination coefficient, α is 1,2, …, and p, Y is a dimension of the sample in the low-dimensional Y space.

6. The method for extracting early warning information and identifying risks of trucks based on Beidou Internet of vehicles according to claim 5, wherein in step 6, the distance function W (Y) is₀) As follows:

in the formula,

respectively representing the center points of the e-th and f-th class samples in the Y space,

representing the center points of all samples of class e and class f in Y space,

∑^-1an inverse matrix representing the e and f type covariance matrixes; wherein e is 1,2, …, Z, f is 1,2, …, Z, e ≠ f;

when W (Y)₀)>At 0, the new sample point belongs to class e.