CN111553811A

CN111553811A - A method for identifying leakage areas in water supply network based on iterative machine learning

Info

Publication number: CN111553811A
Application number: CN202010369142.0A
Authority: CN
Inventors: 冯新; 陈京钰
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-05-02
Filing date: 2020-05-02
Publication date: 2020-08-18
Anticipated expiration: 2040-05-02
Also published as: CN111553811B

Abstract

A water supply network leakage area identification method based on iterative machine learning belongs to the technical field of water supply network leakage detection. For each iteration, one of the identified leakage regions is selected, k-means clustering is adopted to cluster the leakage regions into two types, all combination types of leakage nodes are used as labels of a random forest classifier model, then, leakage coefficients are added randomly to the nodes of the leakage regions according to the combination types of the leakage nodes so as to generate leakage samples, and the generated leakage samples are adopted to train the classifier model. The model takes the selection of the features into consideration in the training process so as to reduce the feature samples required in model training. Inputting the leakage characteristics into the trained classifier model so as to output the identified leakage region and the number of leakage nodes contained in the leakage region, and repeating the steps until the final identification accuracy is less than 95%, thus finishing the iteration. The method applies the single-label classifier to the identification of the multi-leakage-point area of the water supply network for the first time, and is simple to operate and good in identification effect.

Description

A method for identifying leakage areas in water supply network based on iterative machine learning

技术领域technical field

本发明涉及基于迭代机器学习的供水管网泄漏区域识别方法，属于供水管网泄漏检测技术领域。The invention relates to a method for identifying leakage areas of a water supply pipe network based on iterative machine learning, and belongs to the technical field of water supply pipe network leakage detection.

背景技术Background technique

供水管网是社会的重要基础设施，在经济发展和正常生活中发挥着重要作用。由于供水管网老化和设计不合理，供水管网存在不同程度的泄漏问题。发达国家每年从水分配系统中损失约15％的纯净水，而发展中国家每年损失35％甚至高达60％的纯净水。因此准确定位泄漏是供水管网泄漏率控制需要解决的关键问题之一。The water supply network is an important infrastructure of society and plays an important role in economic development and normal life. Due to the aging and unreasonable design of the water supply pipe network, there are different degrees of leakage problems in the water supply pipe network. Developed countries lose about 15% of their purified water from their water distribution systems every year, while developing countries lose 35% or even up to 60% of their purified water every year. Therefore, accurate location of leakage is one of the key issues to be solved in the leakage rate control of water supply network.

随着智能大数据时代的到来，利用数据挖掘技术定位供水管网泄漏的研究逐渐成为热点。基于水力模型的泄漏研究方法可以分为两类:(1)基于相邻节点相似性的泄漏特性，近似地确定泄漏节点的位置；(2)预先根据泄漏特征将供水管网划分为不同区域，以识别泄漏区域。直接进行泄漏节点的定位是首先生成供水管网的泄漏样本，由于在进行节点定位时存在相似的节点，因此会降低模型的识别准确率，一般会通过两种方法进行模型准确率的提升：一种是通过预先合并相似节点从而减少不可区分节点的数目，另一种是增加延时模式，即通过增加时间节点从而增加泄漏信息的容量。直接进行泄漏区域定位的方法是预先采用聚类方法进行相似节点的聚类，从而减少相似节点的数目，将泄漏区域作为分类器的标签从而进行泄漏区域识别。With the advent of the era of intelligent big data, the use of data mining technology to locate the leakage of water supply network has gradually become a hot topic. Leakage research methods based on hydraulic models can be divided into two categories: (1) approximately determine the location of the leaking nodes based on the leakage characteristics of the similarity of adjacent nodes; (2) divide the water supply network into different areas according to the leakage characteristics in advance, to identify the leak area. Directly locating leaking nodes is to first generate leak samples of the water supply network. Since there are similar nodes when locating nodes, the recognition accuracy of the model will be reduced. Generally, the accuracy of the model will be improved by two methods: 1. One is to reduce the number of indistinguishable nodes by merging similar nodes in advance, and the other is to increase the delay mode, that is, to increase the capacity of leaking information by adding time nodes. The method of directly locating the leakage area is to use the clustering method to cluster similar nodes in advance, so as to reduce the number of similar nodes, and use the leakage area as the label of the classifier to identify the leakage area.

通过对以往研究方法的分析发现，虽然可以通过对泄漏特征相似的节点进行聚类从而提高分类器的识别精度。但聚类方法应用于供水管网时，由于聚类数目是由研究者给直接定，缺乏理论依据，并且在定位识别的过程中只考虑单个泄漏节点。然而，在实际的供水网络中，往往会出现多重泄漏。因此，本发明针对此问题，采用k-means聚类(k＝2)与随机森林分类器相结合的迭代机器学习方法，为供水管网聚类时的聚类数目提供了理论依据，并且同时为识别多重泄漏区域提供了一种解决方案。对于同时发生的泄漏，在每次迭代中分析了识别的区域中中所有的泄漏节点组合类型。每个泄漏组合类型都用作随机森林分类器的类别标签。采用训练后的随机森林分类器作为泄漏区域识别模型，定位泄漏区域并识别每个泄漏区域包含的泄漏数量。Through the analysis of previous research methods, it is found that although the nodes with similar leakage characteristics can be clustered, the recognition accuracy of the classifier can be improved. However, when the clustering method is applied to the water supply network, because the number of clusters is directly determined by the researchers, there is no theoretical basis, and only a single leakage node is considered in the process of localization and identification. However, in real water supply networks, multiple leaks often occur. Therefore, in order to solve this problem, the present invention adopts an iterative machine learning method combining k-means clustering (k=2) and random forest classifier, which provides a theoretical basis for the number of clusters in the clustering of the water supply network, and at the same time Provides a solution for identifying multiple leak areas. For simultaneous leaks, all leak node combination types in the identified regions were analyzed in each iteration. Each leaky combination type is used as a class label for the random forest classifier. The trained random forest classifier is used as the leak area identification model to locate leak areas and identify the number of leaks contained in each leak area.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供基于迭代机器学习的供水管网泄漏区域识别方法，从而解决聚类方法应用于供水管网时聚类数目的不确定性并为多重泄漏的区域识别提供一种解决办法。The purpose of the present invention is to provide an iterative machine learning-based water supply pipe network leakage area identification method, so as to solve the uncertainty of the number of clusters when the clustering method is applied to the water supply pipe network and provide a solution for the area identification of multiple leaks.

本发明采用的技术方案是：根据流量与压力平衡的原理建立供水管网的水力模型；假设有l个同时发生泄漏，泄漏特征(泄漏前后传感器值的差)为ΔS_l，对已经识别的泄漏区域的每个节点添加相同的泄漏系数C从而生成泄漏变化矩阵，根据已经识别的泄漏区域泄漏变化矩阵，采用k-means聚类将其聚类为两类；随后通过水力模型模拟随机生成泄漏事件，并以泄漏事件的模拟结果作为分类器的训练样本；利用样本训练特征选择的模型，采用特征选择方法-平均准确率减少(Mean Decrease Accuracy,MDA)对特征的重要性进行计算，根据特征的重要性对特征进行排序从而进行特征的筛选；将每个泄漏组合作为分类器模型的类别标签，利用所选特征训练随机森林分类器；如果模型的最终识别准确率(每次迭代的训练模型ΔS_l识别准确率的乘积，Acc)大于95％，则将泄漏特征ΔS_l输入训练好的随机森林分类器，然后进行下一次迭代；如果最终识别准确率小于95％，则停止迭代。输出识别的泄漏区域和每个泄漏区域包含的泄漏节点数。The technical scheme adopted by the present invention is: establishing a hydraulic model of the water supply pipe network according to the principle of flow and pressure balance; assuming that there are l leakages at the same time, the leakage characteristic (the difference between the sensor values before and after the leakage) is ΔS _l . The same leakage coefficient C is added to each node of the area to generate a leakage change matrix. According to the leakage change matrix of the identified leakage area, k-means clustering is used to cluster them into two categories; then the leakage events are randomly generated through hydraulic model simulation. , and use the simulation result of the leakage event as the training sample of the classifier; use the sample to train the model of feature selection, and use the feature selection method-Mean Decrease Accuracy (MDA) to calculate the importance of the feature. Sort the features by importance to filter the features; use each leakage combination as the class label of the classifier model, and use the selected features to train the random forest classifier; if the final recognition accuracy of the model (training model ΔS per iteration) _l The product of the recognition accuracy, Acc) is greater than 95%, then the leakage feature ΔS _l is input into the trained random forest classifier, and then the next iteration is performed; if the final recognition accuracy is less than 95%, the iteration is stopped. Outputs the identified leak areas and the number of leak nodes each leak area contains.

基于迭代机器学习的供水管网泄漏区域识别方法通过以下步骤进行:The iterative machine learning-based method for identifying leakage areas in water supply network is carried out through the following steps:

(1)有l个节点同时发生泄漏，泄漏特征为ΔS_l；(1) There are l nodes leaking at the same time, and the leakage characteristic is ΔS _l ;

(2)在第(β-1)次迭代中识别的泄漏区域为

每个泄漏区域内存在

个泄漏节点，i＝1,2,...,w；(2) The leakage area identified in the (β-1)th iteration is

exists within each leaked region

leaking nodes, i=1,2,...,w;

(3)从w个泄漏区域内选择其中一个泄漏区域

泄漏区域

内包含

个泄漏节点；对泄漏区域

内的每个节点添加相同的泄漏系数C从而生成泄漏变化矩阵

(3) Select one of the leakage areas from the w leakage areas

spill area

contains

leaking nodes;

The same leakage coefficient C is added to each node within to generate a leakage change matrix

(4)根据泄漏变化矩阵

采用k-means聚类将泄漏区域

聚类为两类，分别为区域

和区域

其余未聚类的(w-1)个区域及其包含的泄漏节点数目不变，则第β次迭代一共有(w+1)个区域；(4) According to the leakage change matrix

Use k-means clustering to classify leaky regions

Clustering is divided into two categories, namely regions

and area

The remaining unclustered (w-1) regions and the number of leaking nodes they contain remain unchanged, then the β-th iteration has a total of (w+1) regions;

(5)产生第β次迭代泄漏节点的组合类型；对于未聚类的(w-1)个区域，其区域内部的泄漏节点数目保持不变，对于包含

个泄漏节点的区域

和区域

其所有的泄漏组合有

种,分别为：0个泄漏节点在

个泄漏节点在区域

1个泄漏节点在

个泄漏节点在区域

个泄漏节点在

0个泄漏节点在区域

因此对于本次迭代，一共有

个不同的标签；(5) Generate the combination type of leaked nodes in the βth iteration; for unclustered (w-1) regions, the number of leaked nodes inside the region remains unchanged.

region of leaking nodes

and area

All its leak combinations have

species, respectively: 0 leaking nodes are in

leaking nodes in the region

1 leaking node in

leaking nodes in the region

leaking nodes in

0 leaking nodes in the region

So for this iteration, there are a total of

different labels;

(6)生成第β次迭代的泄漏样本；随机从区域

中选择

个节点，区域

中选择

个节点，对于未聚类的(w-1)个区域，分别从区域

中选择

个节点,

从而产生l个同时泄漏的节点，记为一个泄漏样本；一共产生ε个不同的泄漏样本；泄漏样本的集合称为T，特征的总数为N_PQ,N_P个压力传感器和N_Q个流量传感器，其中N_PQ＝N_P+N_Q；(6) Generate leaky samples of the βth iteration; randomly from the region

choose

nodes, regions

choose

nodes, for unclustered (w-1) regions, respectively from the region

choose

nodes,

As a result, l nodes that leak at the same time are recorded as a leak sample; a total of ε different leak samples are generated; the set of leak samples is called T, and the total number of features is N _PQ , N _P pressure sensors and N _Q flow sensors , where N _PQ =N _P +N _Q ;

(7)以泄漏事件的水力模拟结果作为分类器的训练样本；采用MDA进行分类器模型特征的重要性计算,随后根据特征的重要性进行排序，以训练模型准确率不减小的原则，从非重要到重要的顺序进行特征数量的删减；所述MDA的计算分为随机森林分类器的训练及特征的平均准确率减少的计算；(7) Take the hydraulic simulation result of the leakage event as the training sample of the classifier; use MDA to calculate the importance of the classifier model features, and then sort according to the importance of the features, so that the accuracy of the training model does not decrease, from The number of features is deleted in the order from unimportant to important; the calculation of the MDA is divided into the training of the random forest classifier and the calculation of the reduction of the average accuracy of the feature;

随机森林分类器的训练过程如下所示：The training process of a random forest classifier is as follows:

(a)对于第β次迭代，在第(β-1)次迭代中识别的区域组合为

识别出的泄漏区域

内存在

个泄漏节点，i＝1，2，...，w；定义

l是总的泄漏节点数目；然后根据泄漏区域

的泄漏矩阵

将泄漏区域

聚类为区域

和区域

两部分，泄漏区域组合为

对于包含

个泄漏节点的区域

和区域

其所有的泄漏组合有

种，其他未聚类的(w-1)个区域包含的泄漏节点数目保持不变，则本次迭代有

个随机森林分类器标签，泄漏节点的组合类型分别为0个泄漏节点在区域

个泄漏节点在区域

1个泄漏节点在区域

个泄漏节点在区域

个泄漏节点在区域

0个泄漏节点在区域

随机从区域

中选择

个节点，区域

中选择

个节点，区域

中选择

个节点，其中

从而产生l个同时泄漏的节点，记为一个泄漏样本；假设存在ε个不同的泄漏样本，泄漏样本的集合称为T，特征的总数为N_PQ,其中：N_P个压力传感器和N_Q个流量传感器，N_PQ＝N_P+N_Q)；(a) For the βth iteration, the combination of regions identified in the (β-1)th iteration is

Identified leak area

memory exists

leaky nodes, i = 1, 2, ..., w; definition

l is the total number of leaking nodes; then according to the leaking area

The leakage matrix of

will leak area

cluster into regions

and area

In two parts, the leakage area is combined as

for containing

region of leaking nodes

and area

All its leak combinations have

The number of leaking nodes contained in other unclustered (w-1) regions remains unchanged, then this iteration has

random forest classifier labels, the combination type of leaking nodes is 0 leaking nodes are in the region

leaking nodes in the region

1 leaky node in the region

leaking nodes in the region

0 leaking nodes in the region

random from area

choose

nodes, regions

choose

nodes, regions

choose

nodes, where

Thus, l simultaneously leaking nodes are generated, which are recorded as a leaking sample; assuming that there are ε different leaking samples, the set of leaking samples is called T, and the total number of features is N _PQ , among which: N _P pressure sensors and N _Q Flow sensor, N _PQ = N _P + N _Q );

(b)采用从泄漏样本中重复抽样的方法创建训练子集T_tr1,T_tr2,…,T_trM，M是分类树的数量，从泄漏样本T中随机选择T_tri的训练子集，每个分类树T_tri的训练子集的数目为ε，与总的泄漏样本的大小一致，因此,分类树T_tri的训练子集会有重复的样本。泄漏样本T中未被抽中的那部分叫做out-of-bag(OOB)，用来评估每棵树的准确性；因为随机森林分类器的准确性随着分类树数目的增加而增加并趋于一个常数,因此选择默认值M＝500；(b) Create training subsets T _tr1 ,T _tr2 ,…,T _trM by repeated sampling from leaked samples, where M is the number of classification trees, randomly select training subsets of T _tri from leaked samples T, each The number of training subsets of the classification tree T _tri is ε, which is consistent with the size of the total leaked samples. Therefore, the training subset of the classification tree T _tri will have duplicate samples. The part of the leaked sample T that is not drawn is called out-of-bag (OOB) and is used to evaluate the accuracy of each tree; because the accuracy of the random forest classifier increases as the number of classification trees increases and converges is a constant, so choose the default value M=500;

(c)对于训练子集T_tri的每个节点，1≤i≤500，从N_PQ个特征中随机选择m个子特征来创建分类树，1≤m≤N_PQ；m的默认值为

用来计算每个特征的基尼指数；给定训练子集T_tri和连续特征N_D,D＝1,2,...,m,训练子集T_tri有f类样本，

类别h内有|f_h|样本；特征N_D有r个不同的值；然后，将这些值从小到大排序，并将它们标记为R＝{R¹,R²,…,R^r}，划分点t可以划分N_D为两个子集

和

其中

为包含不大于t的值的样本，

为包含大于t的值的样本，相邻的值分别为R^e和R^e+1；其中在[R^e，R^e+1]中的所有值都有相同的分割结果，所以有(r-1)个分割点是候选分割点；训练子集T_tri和特征点t处的基尼系数为(c) For each node of the training subset T _tri , _1≤i≤500 , randomly select m sub-features from NPQ features to create a classification tree, _1≤m≤NPQ ; the default value of m is

It is used to calculate the Gini index of each feature; given a training subset T _tri and continuous features N _D , D=1,2,...,m, the training subset T _tri has f class samples,

There are |f _h | samples in class h; feature N _D has r distinct values; then, sort these values from small to large and label them as R = {R ¹ ,R ² ,...,R ^r }, The dividing point t can divide _ND into two subsets

and

in

is a sample containing values not greater than t,

For the samples containing values greater than t, the adjacent values are ^Re and Re ⁺¹ respectively; in which all values in [ ^Re , Re ⁺¹ ] have the same segmentation result, so there is (r- 1) The split points are candidate split points; the Gini coefficient at the training subset T _tri and the feature point t is

其中：式中，p_f表示样本属于f类的概率，

表示随机选取的样本被误分类的概率；基尼指数越大，样本被错误分类的可能性越大；选择t的最小的Gini指数和相应的特征进行分割，然后，构造每个分支来重复上述过程；Where: In the formula, p _f represents the probability that the sample belongs to class f,

Represents the probability of a randomly selected sample being misclassified; the larger the Gini index, the greater the probability of the sample being misclassified; select the smallest Gini index of t and the corresponding feature for segmentation, and then construct each branch to repeat the above process ;

(d)每个分类树均有助于训练子集T_tri的识别精度，500棵分类树的平均识别准确率是训练模型的识别准确率

(d) Each classification tree contributes to the recognition accuracy of the training subset T _tri , and the average recognition accuracy of 500 classification trees is the recognition accuracy of the training model

MDA的计算过程如下所示：The calculation process of MDA is as follows:

对于随机森林分类器模型的每个分类树T_tri，1≤i≤500，使用OOB数据计算OOB误差，记为OOB_b1；随机打乱特征F处的袋外样本数据，1≤F≤N_PQ，再次计算出袋外误差，记为OOB_b2；假设森林中有500棵树，则特征F的重要性表示为：

For each classification tree T _tri of the random forest classifier model, 1≤i≤500, use the OOB data to calculate the OOB error, denoted as OOB _b1 ; randomly scramble the out-of-bag sample data at the feature F, 1≤F≤N _PQ , and calculate the out-of-bag error again, denoted as OOB _b2 ; assuming that there are 500 trees in the forest, the importance of feature F is expressed as:

(8)将

个泄漏节点的组合类型作为分类器模型的类别标签，利用所选特征训练随机森林分类器；如果模型第β次迭代的最终识别准确率Acc^β，即每次迭代的训练模型识别准确率的乘积

大于95％，则将泄漏特征输入训练好的随机森林分类器，然后进行下一次迭代；如果最终识别准确率小于95％，则停止迭代；输出识别的泄漏区域和每个泄漏区域包含的泄漏节点数目。(8) will

The combination type of the leaked nodes is used as the class label of the classifier model, and the random forest classifier is trained with the selected features; if the final recognition accuracy Acc ^β of the βth iteration of the model is the product of the recognition accuracy of the training model for each iteration

If it is greater than 95%, input the leak feature into the trained random forest classifier, and then proceed to the next iteration; if the final recognition accuracy rate is less than 95%, stop the iteration; output the identified leak area and the leak nodes contained in each leak area number.

本发明有如下有益效果：The present invention has the following beneficial effects:

这种基于迭代机器学习的供水管网泄漏区域识别方法，对于每一次迭代，选择其中一个已经识别的泄漏区域，采用k-means聚类将其聚类为两类，将泄漏节点的所有组合类型作为随机森林分类器模型的标签，随后根据泄漏节点的组合类型对泄漏区域的节点随机添加泄漏系数从而生成泄漏样本，采用生成的泄漏样本进行分类器模型的训练。模型在训练的过程中考虑了特征的选择从减少模型训练时所需要的特征样本。将泄漏特征输入经过训练的分类器模型从而输出识别的泄漏区域及其包含的泄漏节点的数目，重复以上步骤直至最终识别准确率小于95％即结束迭代。从而输出识别的泄漏区域及其包含的泄漏节点的数目。当单节点泄漏时，二分类聚类解决了聚类方法应用于供水管网的泄漏检测时聚类数目的不确定性，为聚类的数目提供了理论依据，并且减少了所需的试算次数；将聚类及单标签分类器相结合的技术应用于多漏点的泄漏区域的识别定位，可以识别泄漏区域及其包含的泄漏节点的数目。This iterative machine learning-based method for identifying leaking areas in water supply network, for each iteration, select one of the identified leaking areas, use k-means clustering to cluster them into two categories, and classify all the combined types of leaking nodes. As the label of the random forest classifier model, then according to the combination type of the leaked nodes, the leak coefficient is randomly added to the nodes in the leak area to generate leak samples, and the generated leak samples are used to train the classifier model. In the process of training, the model considers the selection of features to reduce the feature samples required for model training. Input the leak feature into the trained classifier model to output the identified leak area and the number of leak nodes contained therein, and repeat the above steps until the final identification accuracy rate is less than 95%, and the iteration ends. The number of leaked regions identified and the leaky nodes they contain is thus output. When a single node leaks, the binary clustering solves the uncertainty of the number of clusters when the clustering method is applied to the leak detection of the water supply network, provides a theoretical basis for the number of clusters, and reduces the required trial calculation The number of times; the technology of combining clustering and single-label classifier is applied to the identification and positioning of the leaking area with multiple leaking points, and the leaking area and the number of leaking nodes it contains can be identified.

附图说明Description of drawings

图1是本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.

图2是供水管网的水力模型图。Figure 2 is a hydraulic model diagram of the water supply network.

图3是压力和流量测点的位置。Figure 3 shows the location of the pressure and flow measurement points.

图4是随机森林分类器的训练过程。Figure 4 shows the training process of the random forest classifier.

图5是包含泄漏节点101泄漏区域的识别过程。FIG. 5 is a process of identifying the leaking area containing leaky node 101 .

图6是识别单节点发生泄漏的区域:(a)传统方法确定的泄漏区域；(b)采用迭代机器学习法确定的泄漏区域。Figure 6 is the identification of the leakage area of a single node: (a) the leakage area determined by the traditional method; (b) the leakage area determined by the iterative machine learning method.

图7包含泄漏节点65和281的区域的识别过程Figure 7 Identification process of the region containing leaky nodes 65 and 281

图8包含泄漏节点132和406的区域的识别过程Figure 8 Identification process for the region containing leaky nodes 132 and 406

图9是识别两个同时发生泄漏的区域:(a)包含节点65和281的泄漏区域；(b)包含节点132和406的泄漏区域。Figure 9 is the identification of two simultaneous leakage regions: (a) the leakage region containing nodes 65 and 281; (b) the leakage region containing nodes 132 and 406.

具体实施方式Detailed ways

下面将结合本发明中的附图，对本发明的技术方案进行清楚、完整的描述。The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the present invention.

如附图1所示，根据流量与压力平衡的原理建立供水管网的水力模型；假设有l个节点同时发生泄漏，泄漏特征(泄漏前后传感器值的差)为ΔS_l，对已经识别的泄漏区域的每个节点添加相同的泄漏系数C从而生成泄漏变化矩阵，根据已经识别的泄漏区域泄漏变化矩阵，采用k-means聚类将其聚类为两类；随后通过水力模型模拟随机生成泄漏事件，并以泄漏事件的模拟结果作为分类器的训练样本；利用样本训练特征选择的模型，根据特征的重要性对特征进行排序从而进行特征的筛选；将每个泄漏组合作为分类器模型的类别标签，利用所选特征训练随机森林分类器；如果模型的最终识别准确率(每次迭代的训练模型识别准确率的乘积，Acc)大于95％，则将泄漏特征ΔS_l输入训练好的随机森林分类器，输出泄漏区域和每个泄漏区域包含的泄漏节点数，然后进行下一次迭代；如果最终识别准确率小于95％，则停止迭代。As shown in Figure 1, the hydraulic model of the water supply pipe network is established according to the principle of flow and pressure balance; assuming that there are l nodes leaking at the same time, the leakage characteristic (the difference between the sensor values before and after the leak) is ΔS _l . The same leakage coefficient C is added to each node of the area to generate a leakage change matrix. According to the leakage change matrix of the identified leakage area, k-means clustering is used to cluster them into two categories; then the leakage events are randomly generated through hydraulic model simulation. , and use the simulation result of the leakage event as the training sample of the classifier; use the sample to train the model of feature selection, sort the features according to the importance of the features to filter the features; use each leakage combination as the class label of the classifier model , using the selected features to train the random forest classifier; if the final recognition accuracy of the model (the product of the recognition accuracy of the training model at each iteration, Acc) is greater than 95%, then the leakage feature ΔS _l is input into the trained random forest classifier , output the leak area and the number of leak nodes contained in each leak area, and then proceed to the next iteration; if the final recognition accuracy rate is less than 95%, stop the iteration.

实施例Example

第一步、根据流量与压力平衡的原理建立供水管网的水力模型，本发明采用的是EPANET建立水力模型。如附图2所示，由一个水库(1)，一个蓄水池(2)，375个节点(3)和469个管段(4)组成。如附图3所示，测点由3个流量计(Q1-Q3)与18个压力测点(P1-P18)组成，假设传感器测量值被均匀的零均值高斯误差所破坏，其振幅分别为传感器样本的残差平均值的1.5％。基本需水量为148L/s。最大需水量为162.8L/s，最小需水量为118.4L/s。由于泄漏量太小不会引起传感器的变化，而泄漏量太大会让居民首先发现，因此本发明选取泄漏系数在0.5-1.2之间。The first step is to establish the hydraulic model of the water supply pipe network according to the principle of flow and pressure balance. The present invention adopts EPANET to establish the hydraulic model. As shown in Figure 2, it consists of a reservoir (1), a reservoir (2), 375 nodes (3) and 469 pipe sections (4). As shown in Figure 3, the measuring points are composed of 3 flowmeters (Q1-Q3) and 18 pressure measuring points (P1-P18). Assuming that the sensor measurement value is destroyed by a uniform zero-mean Gaussian error, its amplitudes are 1.5% of the mean of the residuals of the sensor samples. The basic water demand is 148L/s. The maximum water demand is 162.8L/s, and the minimum water demand is 118.4L/s. Since the leakage amount is too small, the sensor will not change, while the leakage amount is too large for residents to find out first, so the present invention selects the leakage coefficient between 0.5-1.2.

第二步、本发明假设供水管网有两种泄漏类型，一种是单泄漏节点，另一种是两个节点的泄漏。In the second step, the present invention assumes that there are two types of leakage in the water supply pipe network, one is a single leakage node, and the other is a leakage of two nodes.

第三步、当供水管网发生泄漏报警时，初始迭代从整个供水管网开始；The third step, when a leak alarm occurs in the water supply network, the initial iteration starts from the entire water supply network;

第四步、进行泄漏区域的聚类划分，特征的选择及多级随机森林分类器模型的训练，随机森林分类训练模型如附图4所示。The fourth step is to perform cluster division of the leaked area, selection of features and training of the multi-level random forest classifier model. The random forest classification training model is shown in FIG. 4 .

单节点泄漏single node leak

当节点101发生4L/s泄漏时，每次只选取上次迭代中识别出的泄漏区域的一个节点来生成泄漏样本。When a 4L/s leak occurs at node 101, only one node in the leak area identified in the previous iteration is selected each time to generate leak samples.

选择其中一个已经识别的泄漏区域，对泄漏区域内的节点随机添加C＝0.8的泄漏系数从而生成泄漏变化矩阵。采用k-means将泄漏区域聚类为两部分。One of the identified leak areas is selected, and a leak coefficient of C=0.8 is randomly added to the nodes in the leak area to generate a leak change matrix. K-means was used to cluster the leaked regions into two parts.

在每次迭代中，最后一次迭代中识别的泄漏区域的每个节点在泄漏系数范围内随机产生30个泄漏样本。因此，在第一次迭代中，节点总数为375，则共有11250个独立的泄漏样本。随着迭代次数的增加，候选泄漏区域的面积减小，用于模型训练的泄漏样本数量减少。In each iteration, 30 leak samples are randomly generated within the leak coefficient range for each node of the leak region identified in the last iteration. Therefore, in the first iteration, the total number of nodes is 375, and there are 11250 independent leak samples. As the number of iterations increases, the area of candidate leaky regions decreases, and the number of leaky samples used for model training decreases.

通过特征选择，可以减少每次迭代识别泄漏所需的特征。如附图5中的迭代2最多需要9个传感器，整个迭代需要10个不同的传感器，传感器的数目均小于传感器总数。With feature selection, you can reduce the number of features needed to identify leaks at each iteration. For example, iteration 2 in FIG. 5 requires a maximum of 9 sensors, and the entire iteration requires 10 different sensors, and the number of sensors is smaller than the total number of sensors.

当节点101发生4L/s泄漏时，其迭代过程如附图5所示，每个子迭代的识别准确率分别为99.99％、99.82％、99.25％、97.35％，则第四次迭代的最终识别准确率为99.99％×99.82％×99.25％×97.35％＝96.44％。对于第五个子迭代，精度为96.67％，最终识别准确率为为96.44％×96.67％＝93.22％<95％，则迭代终止；如附图6(b)所示,确定泄漏节点101(5)所在的区域(Z1)。总的迭代次数为4次。When a 4L/s leak occurs at node 101, the iterative process is shown in Figure 5. The recognition accuracy of each sub-iteration is 99.99%, 99.82%, 99.25%, and 97.35%, respectively. The final recognition of the fourth iteration is accurate The rate is 99.99%×99.82%×99.25%×97.35%=96.44%. For the fifth sub-iteration, the accuracy is 96.67%, and the final recognition accuracy is 96.44% × 96.67% = 93.22% < 95%, then the iteration is terminated; as shown in Fig. 6(b), determine the leaking node 101(5) in the zone (Z1). The total number of iterations is 4.

与传统方法相比(聚类总数预先给出，最终聚类数由试算法确定)，传统方法的试验计算次数明显多于本文方法，本发明对泄漏区有更具体的分析。按传统方法若以识别准确率为95％作为训练模型标准，可识别的区域总数为18个。如果传统方法从1开始每次以1个区域为增加量进行聚类试算，则计算总数为17。但对于迭代法。由表2可知，迭代次数为4次，迭代次数小于传统算法的试算次数。虽然传统方法对泄漏区域的识别也有很好的效果，但是识别出的包含泄漏节点101(5)的泄漏区域(Z2)(如附图6(a)所示)比本文方法的泄漏区域(Z1)面积较大。相比之下，本文提出的方法不仅可以减少迭代法的试验计算总数，而且可以通过消除每次迭代中的无泄漏区域来缩小可识别区域的面积。Compared with the traditional method (the total number of clusters is given in advance, and the final number of clusters is determined by the trial algorithm), the number of experimental calculations of the traditional method is obviously more than that of the method in this paper, and the present invention has a more specific analysis of the leakage area. According to the traditional method, if the recognition accuracy rate is 95% as the training model standard, the total number of identifiable regions is 18. If the traditional method starts from 1 and performs the clustering trial calculation in increments of 1 area at a time, the total number of calculations is 17. But for iterative method. It can be seen from Table 2 that the number of iterations is 4, and the number of iterations is less than the number of trials of the traditional algorithm. Although the traditional method also has a good effect on the identification of the leakage area, the identified leakage area (Z2) containing the leakage node 101(5) (as shown in FIG. ) is larger. In contrast, the method proposed in this paper not only reduces the total number of trial computations for the iterative method, but also reduces the area of identifiable regions by eliminating leak-free regions in each iteration.

两个节点同时发生泄漏Two nodes leak at the same time

一个双节点的泄漏组合出现在节点65和281,节点65的泄漏流量是3.6L/s，节点281的泄漏流量是2.9L/s；另一个泄漏组合发生在节点132和406,节点132的泄漏流量是3.0L/s，节点406的泄漏流量是4.7L/s。A two-node leak combination occurs at nodes 65 and 281, the leak flow at node 65 is 3.6L/s, and the leak flow at node 281 is 2.9L/s; another leak combination occurs at nodes 132 and 406, and the leak at node 132 The flow is 3.0 L/s and the leakage flow at node 406 is 4.7 L/s.

对于每次迭代，选择其中一个已经识别的泄漏区域，对泄漏区域内的节点随机添加C＝0.8的泄漏系数从而生成泄漏变化矩阵。采用k-means聚类将泄漏区域聚类为两部分。For each iteration, one of the identified leakage regions is selected, and a leakage coefficient of C=0.8 is randomly added to the nodes in the leakage region to generate a leakage change matrix. K-means clustering was used to cluster the leaked regions into two parts.

每个泄漏节点组合在泄漏系数范围内产生30个泄漏系数组合。同样，第一次迭代随机产生11250个独立的泄漏样本。每次迭代的样本量随着泄漏区域节点的增加而逐渐减少。例如，如果在最后一次迭代中识别的泄漏区域有375个节点，那么将会产生11250个泄漏样本，按比例算法，如果在最后一次迭代中识别的泄漏区域有50个节点，那么将会产生1500个泄漏样本。对于两个节点同时泄漏的迭代法的具体识别过程如附图7和附图8所示。附图7和附图8显示了不同的识别结果。Each leak node combination yields 30 leak factor combinations within the leak factor range. Likewise, the first iteration randomly generated 11,250 independent leak samples. The sample size at each iteration decreases gradually as the number of nodes in the leaky region increases. For example, if the leak area identified in the last iteration had 375 nodes, then 11,250 leak samples would be generated, and a proportional algorithm, if the leak area identified in the last iteration had 50 nodes, would yield 1,500 leak samples leaked samples. The specific identification process of the iterative method for simultaneous leakage of two nodes is shown in FIG. 7 and FIG. 8 . Figures 7 and 8 show different recognition results.

对于两个节点的同时泄漏，在每次迭代中对两个同时泄漏的模型进行训练时，特征选择也可以减少不必要的特征，但随着迭代次数的增加，相对于单节点泄漏的效果降低。如附图7所示，迭代1需要5个传感器，迭代4需要13个传感器，如附图8所示，迭代1需要5个传感器，迭代4需要12个传感器。对于整个迭代，附图7显示整个迭代需要13个不同的传感器，而附图8显示整个迭代需要17个不同的传感器。距离较近的相邻的泄漏组合所需的传感器总数小于与距离较远的泄漏组合所需的传感器总数。尽管所选传感器的数目都小于总传感器的数目，但与单个节点泄漏相比，效果有所下降。For simultaneous leakage of two nodes, feature selection can also reduce unnecessary features when training two models with simultaneous leakage in each iteration, but as the number of iterations increases, the effect relative to single-node leakage decreases . As shown in Figure 7, iteration 1 requires 5 sensors and iteration 4 requires 13 sensors, as shown in Figure 8, iteration 1 requires 5 sensors, and iteration 4 requires 12 sensors. For the entire iteration, Figure 7 shows that 13 different sensors are required for the entire iteration, while Figure 8 shows that 17 different sensors are required for the entire iteration. The total number of sensors required for the combination of closely spaced adjacent leaks is less than the total number of sensors required for the combination with the more distant leaks. Although the number of selected sensors is all smaller than the total number of sensors, the effect is reduced compared to single node leakage.

如附图7所示，当节点65和281发生泄漏时，前四次各子迭代的识别准确率分别为99.41％、98.83％、98.57％、99.76％，得到的最终识别准确率为99.41％×98.83％×98.57％×99.76％＝96.61％。对于第五个子迭代，精度为97.22％，最终识别准确率为96.61％×97.22％＝93.92％<95％，迭代停止；确定了泄漏区域和泄漏区域内的泄漏数量。As shown in Figure 7, when the nodes 65 and 281 leak, the recognition accuracy rates of the first four sub-iterations are 99.41%, 98.83%, 98.57%, and 99.76%, respectively, and the final recognition accuracy is 99.41%× 98.83%×98.57%×99.76%=96.61%. For the fifth sub-iteration, the accuracy was 97.22%, the final recognition accuracy was 96.61% × 97.22% = 93.92% < 95%, the iteration was stopped; the leak area and the number of leaks within the leak area were determined.

如附图8所示，当节点132(9)和406(8)发生泄漏时，前七次各子迭代的精度分别为99.41％、99.52％、99.64％、99.78％、98.88％、99.36％、99.67％，最终识别准确率为96.32％。第8次迭代的识别准确率为97.12％，则最终识别准确率为96.32％×97.12％＝93.55％<95％，迭代停止；确定了泄漏区域和各泄漏区域内的泄漏数量。As shown in Figure 8, when the nodes 132(9) and 406(8) leak, the accuracies of the first seven sub-iterations are 99.41%, 99.52%, 99.64%, 99.78%, 98.88%, 99.36%, 99.67%, and the final recognition accuracy is 96.32%. The recognition accuracy rate of the 8th iteration is 97.12%, then the final recognition accuracy rate is 96.32%×97.12%=93.55%<95%, the iteration stops; the leak area and the number of leaks in each leak area are determined.

如附图9(a)所示，当节点65(6)和281(7)发生泄漏时，本发明所识别的泄漏区域为Z3，如附图9(b)所示，当节点132(9)和406(8)发生泄漏时，本发明所识别的泄漏区域为Z4和Z5。对于节点65(6)和281(7)，由于这两个同时发生泄漏的节点距离很接近，因此很难区分它们。而对于节点132(9)和406(8)，由于它们之间的距离较远，因此132(9)和406(8)的泄漏特性比节点65(6)和281(7)更明显，更容易区分，从而节点132(9)和406(8)的迭代次数也比较多。As shown in FIG. 9(a), when leakage occurs at nodes 65(6) and 281(7), the leakage area identified by the present invention is Z3. As shown in FIG. 9(b), when node 132(9) ) and 406(8) when leakage occurs, the leakage areas identified by the present invention are Z4 and Z5. For nodes 65(6) and 281(7), it is difficult to distinguish between the two simultaneously leaking nodes due to their close distance. And for nodes 132(9) and 406(8), the leakage characteristics of nodes 132(9) and 406(8) are more obvious and more obvious than nodes 65(6) and 281(7) due to the greater distance between them. It is easy to distinguish, so nodes 132(9) and 406(8) have more iterations.

本发明提出了基于迭代机器学习的供水管网泄漏区域识别方法。在每次迭代时，首先采用k-means将其中一个已经识别的泄漏区域聚类为两部分,然后确定本次迭代所有的泄漏节点的组合类型，每个泄漏节点的组合类型作为随机森林分类器训练模型的一个类别标签,随后根据泄漏节点的组合类型对泄漏区域的节点随机添加泄漏系数从而生成泄漏样本，采用生成的泄漏样本进行分类器模型的训练。如果训练的模型满足迭代标准,输入泄漏特征进行下一次迭代，如果不满足则迭代结束，输出泄漏区域和每个泄漏区域包含的泄漏节点数目。将该方法应用于一个供水网络并对其性能进行了评价，结果表明，该方法能较好地识别发生泄漏的区域及其包含的泄漏节点数目，并且提高了泄漏检测的效率和准确性。The invention proposes a method for identifying the leakage area of a water supply pipe network based on iterative machine learning. In each iteration, k-means is used to cluster one of the identified leak areas into two parts, and then the combination type of all leak nodes in this iteration is determined, and the combination type of each leak node is used as a random forest classifier A class label of the training model is trained, and then leak coefficients are randomly added to the nodes in the leak area according to the combination type of leak nodes to generate leak samples, and the generated leak samples are used to train the classifier model. If the trained model satisfies the iterative criteria, input the leaky features for the next iteration, if not, the iteration ends, and output the leaky area and the number of leaky nodes contained in each leaky area. The method is applied to a water supply network and its performance is evaluated. The results show that the method can better identify the leaking area and the number of leaking nodes it contains, and improve the efficiency and accuracy of leak detection.

Claims

1. A method for identifying leakage areas of water supply network based on iterative machine learning, which is characterized by:

The method proceeds through the following steps:

(1) There are l nodes leaking at the same time, and the leakage characteristic is ΔS _l ;

(2) The leakage area identified in the (β-1)th iteration is

exists within each leaked region

leaky nodes, i=1,2,...,w;

(3) Select one of the leakage areas from the w leakage areas

spill area

contains

leaking nodes;

(4) According to the leakage change matrix

Use k-means clustering to classify leaky regions

Clustering is divided into two categories, namely regions

and area

(5) Generate the combination type of leaked nodes in the βth iteration; for unclustered (w-1) regions, the number of leaked nodes inside the region remains unchanged.

region of leaking nodes

and area

All its leak combinations have

species, respectively: 0 leaking nodes are in

leaking nodes in the region

1 leaking node in

leaking nodes in the region

leaking nodes in

0 leaking nodes in the region

So for this iteration, there are a total of

different labels;

(6) Generate leaky samples of the βth iteration; randomly from the region

choose

nodes, regions

choose

nodes, for unclustered (w-1) regions, respectively from the region

choose

nodes,

As a result, l nodes that leak at the same time are recorded as a leak sample; a total of ε different leak samples are generated; the set of leak samples is called T, and the total number of features is N _PQ , N _P pressure sensor and N _Q flow sensor, where N _PQ =N _P +N _Q ;

(7) Take the hydraulic simulation result of the leakage event as the training sample of the classifier; use MDA to calculate the importance of the classifier model features, and then sort according to the importance of the features, so that the accuracy of the training model does not decrease, from The number of features is deleted in the order from unimportant to important; the calculation of the MDA is divided into the training of the random forest classifier and the calculation of the reduction of the average accuracy of the feature;

The training process of a random forest classifier is as follows:

(a) The leak area identified in the (β-1)th iteration is

exists within each leaked region

leaky nodes, i=1,2,...,w; definition

l is the total number of leaking nodes; then according to the leaking area

The leakage matrix of

will leak area

cluster into regions

and area

In two parts, the leakage area is combined as

for containing

region of leaking nodes

and area

All its leak combinations have

leaking nodes in the region

1 leaking node in

leaking nodes in the region

leaking nodes in the region

0 leaking nodes in the region

random from area

choose

nodes, regions

choose

nodes, regions

choose

nodes, where

Thus, l simultaneously leaking nodes are generated, which are recorded as a leaking sample; assuming that there are ε different leaking samples, the set of leaking samples is called T, and the total number of features is N _PQ , among which: N _P pressure sensors and N _Q Flow sensor, N _PQ =N _P +N _Q ;

(b) Create training subsets T _tr1 ,T _tr2 ,…,T _trM by repeated sampling from leaked samples, where M is the number of classification trees, randomly select training subsets of T _tri from leaked samples T, each The number of training subsets of the classification tree T _tri is ε, which is consistent with the size of the total leaked samples. Therefore, the training subset of the classification tree T _tri will have duplicate samples. The part of the leaked sample T that is not drawn is called OOB, which is used to evaluate the accuracy of each tree; since the accuracy of the random forest classifier increases with the number of classification trees and tends to a constant, the default valueM=500;

(c) For each node of the training subset T _tri , _1≤i≤500 , randomly select m sub-features from NPQ features to create a classification tree, _1≤m≤NPQ ; the default value of m is

and

in

is a sample containing values not greater than t,

Where: In the formula, p _f represents the probability that the sample belongs to class f,

(d) Each classification tree contributes to the recognition accuracy of the training subset T _tri , so the average recognition accuracy of 500 classification trees is the recognition accuracy of the trained model

The calculation process of MDA is as follows:

(8) will

If it is greater than 95%, input the leak feature into the trained random forest classifier, and then proceed to the next iteration; if the final recognition accuracy rate is less than 95%, stop the iteration, and output the identified leak area and the leak nodes contained in each leak area. number.