[go: up one dir, main page]

CN111553811A - A method for identifying leakage areas in water supply network based on iterative machine learning - Google Patents

A method for identifying leakage areas in water supply network based on iterative machine learning Download PDF

Info

Publication number
CN111553811A
CN111553811A CN202010369142.0A CN202010369142A CN111553811A CN 111553811 A CN111553811 A CN 111553811A CN 202010369142 A CN202010369142 A CN 202010369142A CN 111553811 A CN111553811 A CN 111553811A
Authority
CN
China
Prior art keywords
nodes
leakage
leaking
leak
iteration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010369142.0A
Other languages
Chinese (zh)
Other versions
CN111553811B (en
Inventor
冯新
陈京钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202010369142.0A priority Critical patent/CN111553811B/en
Publication of CN111553811A publication Critical patent/CN111553811A/en
Application granted granted Critical
Publication of CN111553811B publication Critical patent/CN111553811B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Examining Or Testing Airtightness (AREA)

Abstract

A water supply network leakage area identification method based on iterative machine learning belongs to the technical field of water supply network leakage detection. For each iteration, one of the identified leakage regions is selected, k-means clustering is adopted to cluster the leakage regions into two types, all combination types of leakage nodes are used as labels of a random forest classifier model, then, leakage coefficients are added randomly to the nodes of the leakage regions according to the combination types of the leakage nodes so as to generate leakage samples, and the generated leakage samples are adopted to train the classifier model. The model takes the selection of the features into consideration in the training process so as to reduce the feature samples required in model training. Inputting the leakage characteristics into the trained classifier model so as to output the identified leakage region and the number of leakage nodes contained in the leakage region, and repeating the steps until the final identification accuracy is less than 95%, thus finishing the iteration. The method applies the single-label classifier to the identification of the multi-leakage-point area of the water supply network for the first time, and is simple to operate and good in identification effect.

Description

基于迭代机器学习的供水管网泄漏区域识别方法A method for identifying leakage areas in water supply network based on iterative machine learning

技术领域technical field

本发明涉及基于迭代机器学习的供水管网泄漏区域识别方法,属于供水管网泄漏检测技术领域。The invention relates to a method for identifying leakage areas of a water supply pipe network based on iterative machine learning, and belongs to the technical field of water supply pipe network leakage detection.

背景技术Background technique

供水管网是社会的重要基础设施,在经济发展和正常生活中发挥着重要作用。由于供水管网老化和设计不合理,供水管网存在不同程度的泄漏问题。发达国家每年从水分配系统中损失约15%的纯净水,而发展中国家每年损失35%甚至高达60%的纯净水。因此准确定位泄漏是供水管网泄漏率控制需要解决的关键问题之一。The water supply network is an important infrastructure of society and plays an important role in economic development and normal life. Due to the aging and unreasonable design of the water supply pipe network, there are different degrees of leakage problems in the water supply pipe network. Developed countries lose about 15% of their purified water from their water distribution systems every year, while developing countries lose 35% or even up to 60% of their purified water every year. Therefore, accurate location of leakage is one of the key issues to be solved in the leakage rate control of water supply network.

随着智能大数据时代的到来,利用数据挖掘技术定位供水管网泄漏的研究逐渐成为热点。基于水力模型的泄漏研究方法可以分为两类:(1)基于相邻节点相似性的泄漏特性,近似地确定泄漏节点的位置;(2)预先根据泄漏特征将供水管网划分为不同区域,以识别泄漏区域。直接进行泄漏节点的定位是首先生成供水管网的泄漏样本,由于在进行节点定位时存在相似的节点,因此会降低模型的识别准确率,一般会通过两种方法进行模型准确率的提升:一种是通过预先合并相似节点从而减少不可区分节点的数目,另一种是增加延时模式,即通过增加时间节点从而增加泄漏信息的容量。直接进行泄漏区域定位的方法是预先采用聚类方法进行相似节点的聚类,从而减少相似节点的数目,将泄漏区域作为分类器的标签从而进行泄漏区域识别。With the advent of the era of intelligent big data, the use of data mining technology to locate the leakage of water supply network has gradually become a hot topic. Leakage research methods based on hydraulic models can be divided into two categories: (1) approximately determine the location of the leaking nodes based on the leakage characteristics of the similarity of adjacent nodes; (2) divide the water supply network into different areas according to the leakage characteristics in advance, to identify the leak area. Directly locating leaking nodes is to first generate leak samples of the water supply network. Since there are similar nodes when locating nodes, the recognition accuracy of the model will be reduced. Generally, the accuracy of the model will be improved by two methods: 1. One is to reduce the number of indistinguishable nodes by merging similar nodes in advance, and the other is to increase the delay mode, that is, to increase the capacity of leaking information by adding time nodes. The method of directly locating the leakage area is to use the clustering method to cluster similar nodes in advance, so as to reduce the number of similar nodes, and use the leakage area as the label of the classifier to identify the leakage area.

通过对以往研究方法的分析发现,虽然可以通过对泄漏特征相似的节点进行聚类从而提高分类器的识别精度。但聚类方法应用于供水管网时,由于聚类数目是由研究者给直接定,缺乏理论依据,并且在定位识别的过程中只考虑单个泄漏节点。然而,在实际的供水网络中,往往会出现多重泄漏。因此,本发明针对此问题,采用k-means聚类(k=2)与随机森林分类器相结合的迭代机器学习方法,为供水管网聚类时的聚类数目提供了理论依据,并且同时为识别多重泄漏区域提供了一种解决方案。对于同时发生的泄漏,在每次迭代中分析了识别的区域中中所有的泄漏节点组合类型。每个泄漏组合类型都用作随机森林分类器的类别标签。采用训练后的随机森林分类器作为泄漏区域识别模型,定位泄漏区域并识别每个泄漏区域包含的泄漏数量。Through the analysis of previous research methods, it is found that although the nodes with similar leakage characteristics can be clustered, the recognition accuracy of the classifier can be improved. However, when the clustering method is applied to the water supply network, because the number of clusters is directly determined by the researchers, there is no theoretical basis, and only a single leakage node is considered in the process of localization and identification. However, in real water supply networks, multiple leaks often occur. Therefore, in order to solve this problem, the present invention adopts an iterative machine learning method combining k-means clustering (k=2) and random forest classifier, which provides a theoretical basis for the number of clusters in the clustering of the water supply network, and at the same time Provides a solution for identifying multiple leak areas. For simultaneous leaks, all leak node combination types in the identified regions were analyzed in each iteration. Each leaky combination type is used as a class label for the random forest classifier. The trained random forest classifier is used as the leak area identification model to locate leak areas and identify the number of leaks contained in each leak area.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供基于迭代机器学习的供水管网泄漏区域识别方法,从而解决聚类方法应用于供水管网时聚类数目的不确定性并为多重泄漏的区域识别提供一种解决办法。The purpose of the present invention is to provide an iterative machine learning-based water supply pipe network leakage area identification method, so as to solve the uncertainty of the number of clusters when the clustering method is applied to the water supply pipe network and provide a solution for the area identification of multiple leaks.

本发明采用的技术方案是:根据流量与压力平衡的原理建立供水管网的水力模型;假设有l个同时发生泄漏,泄漏特征(泄漏前后传感器值的差)为ΔSl,对已经识别的泄漏区域的每个节点添加相同的泄漏系数C从而生成泄漏变化矩阵,根据已经识别的泄漏区域泄漏变化矩阵,采用k-means聚类将其聚类为两类;随后通过水力模型模拟随机生成泄漏事件,并以泄漏事件的模拟结果作为分类器的训练样本;利用样本训练特征选择的模型,采用特征选择方法-平均准确率减少(Mean Decrease Accuracy,MDA)对特征的重要性进行计算,根据特征的重要性对特征进行排序从而进行特征的筛选;将每个泄漏组合作为分类器模型的类别标签,利用所选特征训练随机森林分类器;如果模型的最终识别准确率(每次迭代的训练模型ΔSl识别准确率的乘积,Acc)大于95%,则将泄漏特征ΔSl输入训练好的随机森林分类器,然后进行下一次迭代;如果最终识别准确率小于95%,则停止迭代。输出识别的泄漏区域和每个泄漏区域包含的泄漏节点数。The technical scheme adopted by the present invention is: establishing a hydraulic model of the water supply pipe network according to the principle of flow and pressure balance; assuming that there are l leakages at the same time, the leakage characteristic (the difference between the sensor values before and after the leakage) is ΔS l . The same leakage coefficient C is added to each node of the area to generate a leakage change matrix. According to the leakage change matrix of the identified leakage area, k-means clustering is used to cluster them into two categories; then the leakage events are randomly generated through hydraulic model simulation. , and use the simulation result of the leakage event as the training sample of the classifier; use the sample to train the model of feature selection, and use the feature selection method-Mean Decrease Accuracy (MDA) to calculate the importance of the feature. Sort the features by importance to filter the features; use each leakage combination as the class label of the classifier model, and use the selected features to train the random forest classifier; if the final recognition accuracy of the model (training model ΔS per iteration) l The product of the recognition accuracy, Acc) is greater than 95%, then the leakage feature ΔS l is input into the trained random forest classifier, and then the next iteration is performed; if the final recognition accuracy is less than 95%, the iteration is stopped. Outputs the identified leak areas and the number of leak nodes each leak area contains.

基于迭代机器学习的供水管网泄漏区域识别方法通过以下步骤进行:The iterative machine learning-based method for identifying leakage areas in water supply network is carried out through the following steps:

(1)有l个节点同时发生泄漏,泄漏特征为ΔSl(1) There are l nodes leaking at the same time, and the leakage characteristic is ΔS l ;

(2)在第(β-1)次迭代中识别的泄漏区域为

Figure BDA0002477495420000021
每个泄漏区域内存在
Figure BDA0002477495420000022
个泄漏节点,i=1,2,...,w;(2) The leakage area identified in the (β-1)th iteration is
Figure BDA0002477495420000021
exists within each leaked region
Figure BDA0002477495420000022
leaking nodes, i=1,2,...,w;

(3)从w个泄漏区域内选择其中一个泄漏区域

Figure BDA0002477495420000023
泄漏区域
Figure BDA0002477495420000024
内包含
Figure BDA0002477495420000025
个泄漏节点;对泄漏区域
Figure BDA0002477495420000026
内的每个节点添加相同的泄漏系数C从而生成泄漏变化矩阵
Figure BDA0002477495420000027
(3) Select one of the leakage areas from the w leakage areas
Figure BDA0002477495420000023
spill area
Figure BDA0002477495420000024
contains
Figure BDA0002477495420000025
leaking nodes;
Figure BDA0002477495420000026
The same leakage coefficient C is added to each node within to generate a leakage change matrix
Figure BDA0002477495420000027

(4)根据泄漏变化矩阵

Figure BDA0002477495420000028
采用k-means聚类将泄漏区域
Figure BDA0002477495420000029
聚类为两类,分别为区域
Figure BDA00024774954200000210
和区域
Figure BDA00024774954200000211
其余未聚类的(w-1)个区域及其包含的泄漏节点数目不变,则第β次迭代一共有(w+1)个区域;(4) According to the leakage change matrix
Figure BDA0002477495420000028
Use k-means clustering to classify leaky regions
Figure BDA0002477495420000029
Clustering is divided into two categories, namely regions
Figure BDA00024774954200000210
and area
Figure BDA00024774954200000211
The remaining unclustered (w-1) regions and the number of leaking nodes they contain remain unchanged, then the β-th iteration has a total of (w+1) regions;

(5)产生第β次迭代泄漏节点的组合类型;对于未聚类的(w-1)个区域,其区域内部的泄漏节点数目保持不变,对于包含

Figure BDA00024774954200000212
个泄漏节点的区域
Figure BDA00024774954200000213
和区域
Figure BDA00024774954200000214
其所有的泄漏组合有
Figure BDA0002477495420000031
种,分别为:0个泄漏节点在
Figure BDA0002477495420000032
个泄漏节点在区域
Figure BDA0002477495420000033
1个泄漏节点在
Figure BDA0002477495420000034
个泄漏节点在区域
Figure BDA0002477495420000035
个泄漏节点在
Figure BDA0002477495420000036
0个泄漏节点在区域
Figure BDA0002477495420000037
因此对于本次迭代,一共有
Figure BDA0002477495420000038
个不同的标签;(5) Generate the combination type of leaked nodes in the βth iteration; for unclustered (w-1) regions, the number of leaked nodes inside the region remains unchanged.
Figure BDA00024774954200000212
region of leaking nodes
Figure BDA00024774954200000213
and area
Figure BDA00024774954200000214
All its leak combinations have
Figure BDA0002477495420000031
species, respectively: 0 leaking nodes are in
Figure BDA0002477495420000032
leaking nodes in the region
Figure BDA0002477495420000033
1 leaking node in
Figure BDA0002477495420000034
leaking nodes in the region
Figure BDA0002477495420000035
leaking nodes in
Figure BDA0002477495420000036
0 leaking nodes in the region
Figure BDA0002477495420000037
So for this iteration, there are a total of
Figure BDA0002477495420000038
different labels;

(6)生成第β次迭代的泄漏样本;随机从区域

Figure BDA0002477495420000039
中选择
Figure BDA00024774954200000310
个节点,区域
Figure BDA00024774954200000311
中选择
Figure BDA00024774954200000312
个节点,对于未聚类的(w-1)个区域,分别从区域
Figure BDA00024774954200000313
中选择
Figure BDA00024774954200000314
个节点,
Figure BDA00024774954200000315
从而产生l个同时泄漏的节点,记为一个泄漏样本;一共产生ε个不同的泄漏样本;泄漏样本的集合称为T,特征的总数为NPQ,NP个压力传感器和NQ个流量传感器,其中NPQ=NP+NQ;(6) Generate leaky samples of the βth iteration; randomly from the region
Figure BDA0002477495420000039
choose
Figure BDA00024774954200000310
nodes, regions
Figure BDA00024774954200000311
choose
Figure BDA00024774954200000312
nodes, for unclustered (w-1) regions, respectively from the region
Figure BDA00024774954200000313
choose
Figure BDA00024774954200000314
nodes,
Figure BDA00024774954200000315
As a result, l nodes that leak at the same time are recorded as a leak sample; a total of ε different leak samples are generated; the set of leak samples is called T, and the total number of features is N PQ , N P pressure sensors and N Q flow sensors , where N PQ =N P +N Q ;

(7)以泄漏事件的水力模拟结果作为分类器的训练样本;采用MDA进行分类器模型特征的重要性计算,随后根据特征的重要性进行排序,以训练模型准确率不减小的原则,从非重要到重要的顺序进行特征数量的删减;所述MDA的计算分为随机森林分类器的训练及特征的平均准确率减少的计算;(7) Take the hydraulic simulation result of the leakage event as the training sample of the classifier; use MDA to calculate the importance of the classifier model features, and then sort according to the importance of the features, so that the accuracy of the training model does not decrease, from The number of features is deleted in the order from unimportant to important; the calculation of the MDA is divided into the training of the random forest classifier and the calculation of the reduction of the average accuracy of the feature;

随机森林分类器的训练过程如下所示:The training process of a random forest classifier is as follows:

(a)对于第β次迭代,在第(β-1)次迭代中识别的区域组合为

Figure BDA00024774954200000316
Figure BDA00024774954200000317
识别出的泄漏区域
Figure BDA00024774954200000318
内存在
Figure BDA00024774954200000319
个泄漏节点,i=1,2,...,w;定义
Figure BDA00024774954200000320
l是总的泄漏节点数目;然后根据泄漏区域
Figure BDA00024774954200000321
的泄漏矩阵
Figure BDA00024774954200000322
将泄漏区域
Figure BDA00024774954200000323
聚类为区域
Figure BDA00024774954200000324
和区域
Figure BDA00024774954200000325
两部分,泄漏区域组合为
Figure BDA00024774954200000326
对于包含
Figure BDA00024774954200000327
个泄漏节点的区域
Figure BDA00024774954200000328
和区域
Figure BDA00024774954200000329
其所有的泄漏组合有
Figure BDA00024774954200000330
种,其他未聚类的(w-1)个区域包含的泄漏节点数目保持不变,则本次迭代有
Figure BDA00024774954200000331
个随机森林分类器标签,泄漏节点的组合类型分别为0个泄漏节点在区域
Figure BDA00024774954200000332
个泄漏节点在区域
Figure BDA00024774954200000333
1个泄漏节点在区域
Figure BDA00024774954200000334
个泄漏节点在区域
Figure BDA00024774954200000335
个泄漏节点在区域
Figure BDA00024774954200000336
0个泄漏节点在区域
Figure BDA00024774954200000337
随机从区域
Figure BDA00024774954200000338
中选择
Figure BDA00024774954200000339
个节点,区域
Figure BDA00024774954200000340
中选择
Figure BDA00024774954200000341
个节点,区域
Figure BDA00024774954200000342
中选择
Figure BDA00024774954200000343
个节点,其中
Figure BDA00024774954200000344
从而产生l个同时泄漏的节点,记为一个泄漏样本;假设存在ε个不同的泄漏样本,泄漏样本的集合称为T,特征的总数为NPQ,其中:NP个压力传感器和NQ个流量传感器,NPQ=NP+NQ);(a) For the βth iteration, the combination of regions identified in the (β-1)th iteration is
Figure BDA00024774954200000316
Figure BDA00024774954200000317
Identified leak area
Figure BDA00024774954200000318
memory exists
Figure BDA00024774954200000319
leaky nodes, i = 1, 2, ..., w; definition
Figure BDA00024774954200000320
l is the total number of leaking nodes; then according to the leaking area
Figure BDA00024774954200000321
The leakage matrix of
Figure BDA00024774954200000322
will leak area
Figure BDA00024774954200000323
cluster into regions
Figure BDA00024774954200000324
and area
Figure BDA00024774954200000325
In two parts, the leakage area is combined as
Figure BDA00024774954200000326
for containing
Figure BDA00024774954200000327
region of leaking nodes
Figure BDA00024774954200000328
and area
Figure BDA00024774954200000329
All its leak combinations have
Figure BDA00024774954200000330
The number of leaking nodes contained in other unclustered (w-1) regions remains unchanged, then this iteration has
Figure BDA00024774954200000331
random forest classifier labels, the combination type of leaking nodes is 0 leaking nodes are in the region
Figure BDA00024774954200000332
leaking nodes in the region
Figure BDA00024774954200000333
1 leaky node in the region
Figure BDA00024774954200000334
leaking nodes in the region
Figure BDA00024774954200000335
leaking nodes in the region
Figure BDA00024774954200000336
0 leaking nodes in the region
Figure BDA00024774954200000337
random from area
Figure BDA00024774954200000338
choose
Figure BDA00024774954200000339
nodes, regions
Figure BDA00024774954200000340
choose
Figure BDA00024774954200000341
nodes, regions
Figure BDA00024774954200000342
choose
Figure BDA00024774954200000343
nodes, where
Figure BDA00024774954200000344
Thus, l simultaneously leaking nodes are generated, which are recorded as a leaking sample; assuming that there are ε different leaking samples, the set of leaking samples is called T, and the total number of features is N PQ , among which: N P pressure sensors and N Q Flow sensor, N PQ = N P + N Q );

(b)采用从泄漏样本中重复抽样的方法创建训练子集Ttr1,Ttr2,…,TtrM,M是分类树的数量,从泄漏样本T中随机选择Ttri的训练子集,每个分类树Ttri的训练子集的数目为ε,与总的泄漏样本的大小一致,因此,分类树Ttri的训练子集会有重复的样本。泄漏样本T中未被抽中的那部分叫做out-of-bag(OOB),用来评估每棵树的准确性;因为随机森林分类器的准确性随着分类树数目的增加而增加并趋于一个常数,因此选择默认值M=500;(b) Create training subsets T tr1 ,T tr2 ,…,T trM by repeated sampling from leaked samples, where M is the number of classification trees, randomly select training subsets of T tri from leaked samples T, each The number of training subsets of the classification tree T tri is ε, which is consistent with the size of the total leaked samples. Therefore, the training subset of the classification tree T tri will have duplicate samples. The part of the leaked sample T that is not drawn is called out-of-bag (OOB) and is used to evaluate the accuracy of each tree; because the accuracy of the random forest classifier increases as the number of classification trees increases and converges is a constant, so choose the default value M=500;

(c)对于训练子集Ttri的每个节点,1≤i≤500,从NPQ个特征中随机选择m个子特征来创建分类树,1≤m≤NPQ;m的默认值为

Figure BDA0002477495420000041
用来计算每个特征的基尼指数;给定训练子集Ttri和连续特征ND,D=1,2,...,m,训练子集Ttri有f类样本,
Figure BDA0002477495420000042
类别h内有|fh|样本;特征ND有r个不同的值;然后,将这些值从小到大排序,并将它们标记为R={R1,R2,…,Rr},划分点t可以划分ND为两个子集
Figure BDA0002477495420000043
Figure BDA0002477495420000044
其中
Figure BDA0002477495420000045
为包含不大于t的值的样本,
Figure BDA0002477495420000046
为包含大于t的值的样本,相邻的值分别为Re和Re+1;其中在[Re,Re+1]中的所有值都有相同的分割结果,所以有(r-1)个分割点是候选分割点;训练子集Ttri和特征点t处的基尼系数为(c) For each node of the training subset T tri , 1≤i≤500 , randomly select m sub-features from NPQ features to create a classification tree, 1≤m≤NPQ ; the default value of m is
Figure BDA0002477495420000041
It is used to calculate the Gini index of each feature; given a training subset T tri and continuous features N D , D=1,2,...,m, the training subset T tri has f class samples,
Figure BDA0002477495420000042
There are |f h | samples in class h; feature N D has r distinct values; then, sort these values from small to large and label them as R = {R 1 ,R 2 ,...,R r }, The dividing point t can divide ND into two subsets
Figure BDA0002477495420000043
and
Figure BDA0002477495420000044
in
Figure BDA0002477495420000045
is a sample containing values not greater than t,
Figure BDA0002477495420000046
For the samples containing values greater than t, the adjacent values are Re and Re +1 respectively; in which all values in [ Re , Re +1 ] have the same segmentation result, so there is (r- 1) The split points are candidate split points; the Gini coefficient at the training subset T tri and the feature point t is

Figure BDA0002477495420000047
Figure BDA0002477495420000047

其中:式中,pf表示样本属于f类的概率,

Figure BDA0002477495420000048
表示随机选取的样本被误分类的概率;基尼指数越大,样本被错误分类的可能性越大;选择t的最小的Gini指数和相应的特征进行分割,然后,构造每个分支来重复上述过程;Where: In the formula, p f represents the probability that the sample belongs to class f,
Figure BDA0002477495420000048
Represents the probability of a randomly selected sample being misclassified; the larger the Gini index, the greater the probability of the sample being misclassified; select the smallest Gini index of t and the corresponding feature for segmentation, and then construct each branch to repeat the above process ;

(d)每个分类树均有助于训练子集Ttri的识别精度,500棵分类树的平均识别准确率是训练模型的识别准确率

Figure BDA0002477495420000049
(d) Each classification tree contributes to the recognition accuracy of the training subset T tri , and the average recognition accuracy of 500 classification trees is the recognition accuracy of the training model
Figure BDA0002477495420000049

MDA的计算过程如下所示:The calculation process of MDA is as follows:

对于随机森林分类器模型的每个分类树Ttri,1≤i≤500,使用OOB数据计算OOB误差,记为OOBb1;随机打乱特征F处的袋外样本数据,1≤F≤NPQ,再次计算出袋外误差,记为OOBb2;假设森林中有500棵树,则特征F的重要性表示为:

Figure BDA0002477495420000051
For each classification tree T tri of the random forest classifier model, 1≤i≤500, use the OOB data to calculate the OOB error, denoted as OOB b1 ; randomly scramble the out-of-bag sample data at the feature F, 1≤F≤N PQ , and calculate the out-of-bag error again, denoted as OOB b2 ; assuming that there are 500 trees in the forest, the importance of feature F is expressed as:
Figure BDA0002477495420000051

(8)将

Figure BDA0002477495420000052
个泄漏节点的组合类型作为分类器模型的类别标签,利用所选特征训练随机森林分类器;如果模型第β次迭代的最终识别准确率Accβ,即每次迭代的训练模型识别准确率的乘积
Figure BDA0002477495420000053
大于95%,则将泄漏特征输入训练好的随机森林分类器,然后进行下一次迭代;如果最终识别准确率小于95%,则停止迭代;输出识别的泄漏区域和每个泄漏区域包含的泄漏节点数目。(8) will
Figure BDA0002477495420000052
The combination type of the leaked nodes is used as the class label of the classifier model, and the random forest classifier is trained with the selected features; if the final recognition accuracy Acc β of the βth iteration of the model is the product of the recognition accuracy of the training model for each iteration
Figure BDA0002477495420000053
If it is greater than 95%, input the leak feature into the trained random forest classifier, and then proceed to the next iteration; if the final recognition accuracy rate is less than 95%, stop the iteration; output the identified leak area and the leak nodes contained in each leak area number.

本发明有如下有益效果:The present invention has the following beneficial effects:

这种基于迭代机器学习的供水管网泄漏区域识别方法,对于每一次迭代,选择其中一个已经识别的泄漏区域,采用k-means聚类将其聚类为两类,将泄漏节点的所有组合类型作为随机森林分类器模型的标签,随后根据泄漏节点的组合类型对泄漏区域的节点随机添加泄漏系数从而生成泄漏样本,采用生成的泄漏样本进行分类器模型的训练。模型在训练的过程中考虑了特征的选择从减少模型训练时所需要的特征样本。将泄漏特征输入经过训练的分类器模型从而输出识别的泄漏区域及其包含的泄漏节点的数目,重复以上步骤直至最终识别准确率小于95%即结束迭代。从而输出识别的泄漏区域及其包含的泄漏节点的数目。当单节点泄漏时,二分类聚类解决了聚类方法应用于供水管网的泄漏检测时聚类数目的不确定性,为聚类的数目提供了理论依据,并且减少了所需的试算次数;将聚类及单标签分类器相结合的技术应用于多漏点的泄漏区域的识别定位,可以识别泄漏区域及其包含的泄漏节点的数目。This iterative machine learning-based method for identifying leaking areas in water supply network, for each iteration, select one of the identified leaking areas, use k-means clustering to cluster them into two categories, and classify all the combined types of leaking nodes. As the label of the random forest classifier model, then according to the combination type of the leaked nodes, the leak coefficient is randomly added to the nodes in the leak area to generate leak samples, and the generated leak samples are used to train the classifier model. In the process of training, the model considers the selection of features to reduce the feature samples required for model training. Input the leak feature into the trained classifier model to output the identified leak area and the number of leak nodes contained therein, and repeat the above steps until the final identification accuracy rate is less than 95%, and the iteration ends. The number of leaked regions identified and the leaky nodes they contain is thus output. When a single node leaks, the binary clustering solves the uncertainty of the number of clusters when the clustering method is applied to the leak detection of the water supply network, provides a theoretical basis for the number of clusters, and reduces the required trial calculation The number of times; the technology of combining clustering and single-label classifier is applied to the identification and positioning of the leaking area with multiple leaking points, and the leaking area and the number of leaking nodes it contains can be identified.

附图说明Description of drawings

图1是本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.

图2是供水管网的水力模型图。Figure 2 is a hydraulic model diagram of the water supply network.

图3是压力和流量测点的位置。Figure 3 shows the location of the pressure and flow measurement points.

图4是随机森林分类器的训练过程。Figure 4 shows the training process of the random forest classifier.

图5是包含泄漏节点101泄漏区域的识别过程。FIG. 5 is a process of identifying the leaking area containing leaky node 101 .

图6是识别单节点发生泄漏的区域:(a)传统方法确定的泄漏区域;(b)采用迭代机器学习法确定的泄漏区域。Figure 6 is the identification of the leakage area of a single node: (a) the leakage area determined by the traditional method; (b) the leakage area determined by the iterative machine learning method.

图7包含泄漏节点65和281的区域的识别过程Figure 7 Identification process of the region containing leaky nodes 65 and 281

图8包含泄漏节点132和406的区域的识别过程Figure 8 Identification process for the region containing leaky nodes 132 and 406

图9是识别两个同时发生泄漏的区域:(a)包含节点65和281的泄漏区域;(b)包含节点132和406的泄漏区域。Figure 9 is the identification of two simultaneous leakage regions: (a) the leakage region containing nodes 65 and 281; (b) the leakage region containing nodes 132 and 406.

具体实施方式Detailed ways

下面将结合本发明中的附图,对本发明的技术方案进行清楚、完整的描述。The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the present invention.

如附图1所示,根据流量与压力平衡的原理建立供水管网的水力模型;假设有l个节点同时发生泄漏,泄漏特征(泄漏前后传感器值的差)为ΔSl,对已经识别的泄漏区域的每个节点添加相同的泄漏系数C从而生成泄漏变化矩阵,根据已经识别的泄漏区域泄漏变化矩阵,采用k-means聚类将其聚类为两类;随后通过水力模型模拟随机生成泄漏事件,并以泄漏事件的模拟结果作为分类器的训练样本;利用样本训练特征选择的模型,根据特征的重要性对特征进行排序从而进行特征的筛选;将每个泄漏组合作为分类器模型的类别标签,利用所选特征训练随机森林分类器;如果模型的最终识别准确率(每次迭代的训练模型识别准确率的乘积,Acc)大于95%,则将泄漏特征ΔSl输入训练好的随机森林分类器,输出泄漏区域和每个泄漏区域包含的泄漏节点数,然后进行下一次迭代;如果最终识别准确率小于95%,则停止迭代。As shown in Figure 1, the hydraulic model of the water supply pipe network is established according to the principle of flow and pressure balance; assuming that there are l nodes leaking at the same time, the leakage characteristic (the difference between the sensor values before and after the leak) is ΔS l . The same leakage coefficient C is added to each node of the area to generate a leakage change matrix. According to the leakage change matrix of the identified leakage area, k-means clustering is used to cluster them into two categories; then the leakage events are randomly generated through hydraulic model simulation. , and use the simulation result of the leakage event as the training sample of the classifier; use the sample to train the model of feature selection, sort the features according to the importance of the features to filter the features; use each leakage combination as the class label of the classifier model , using the selected features to train the random forest classifier; if the final recognition accuracy of the model (the product of the recognition accuracy of the training model at each iteration, Acc) is greater than 95%, then the leakage feature ΔS l is input into the trained random forest classifier , output the leak area and the number of leak nodes contained in each leak area, and then proceed to the next iteration; if the final recognition accuracy rate is less than 95%, stop the iteration.

实施例Example

第一步、根据流量与压力平衡的原理建立供水管网的水力模型,本发明采用的是EPANET建立水力模型。如附图2所示,由一个水库(1),一个蓄水池(2),375个节点(3)和469个管段(4)组成。如附图3所示,测点由3个流量计(Q1-Q3)与18个压力测点(P1-P18)组成,假设传感器测量值被均匀的零均值高斯误差所破坏,其振幅分别为传感器样本的残差平均值的1.5%。基本需水量为148L/s。最大需水量为162.8L/s,最小需水量为118.4L/s。由于泄漏量太小不会引起传感器的变化,而泄漏量太大会让居民首先发现,因此本发明选取泄漏系数在0.5-1.2之间。The first step is to establish the hydraulic model of the water supply pipe network according to the principle of flow and pressure balance. The present invention adopts EPANET to establish the hydraulic model. As shown in Figure 2, it consists of a reservoir (1), a reservoir (2), 375 nodes (3) and 469 pipe sections (4). As shown in Figure 3, the measuring points are composed of 3 flowmeters (Q1-Q3) and 18 pressure measuring points (P1-P18). Assuming that the sensor measurement value is destroyed by a uniform zero-mean Gaussian error, its amplitudes are 1.5% of the mean of the residuals of the sensor samples. The basic water demand is 148L/s. The maximum water demand is 162.8L/s, and the minimum water demand is 118.4L/s. Since the leakage amount is too small, the sensor will not change, while the leakage amount is too large for residents to find out first, so the present invention selects the leakage coefficient between 0.5-1.2.

第二步、本发明假设供水管网有两种泄漏类型,一种是单泄漏节点,另一种是两个节点的泄漏。In the second step, the present invention assumes that there are two types of leakage in the water supply pipe network, one is a single leakage node, and the other is a leakage of two nodes.

第三步、当供水管网发生泄漏报警时,初始迭代从整个供水管网开始;The third step, when a leak alarm occurs in the water supply network, the initial iteration starts from the entire water supply network;

第四步、进行泄漏区域的聚类划分,特征的选择及多级随机森林分类器模型的训练,随机森林分类训练模型如附图4所示。The fourth step is to perform cluster division of the leaked area, selection of features and training of the multi-level random forest classifier model. The random forest classification training model is shown in FIG. 4 .

单节点泄漏single node leak

当节点101发生4L/s泄漏时,每次只选取上次迭代中识别出的泄漏区域的一个节点来生成泄漏样本。When a 4L/s leak occurs at node 101, only one node in the leak area identified in the previous iteration is selected each time to generate leak samples.

选择其中一个已经识别的泄漏区域,对泄漏区域内的节点随机添加C=0.8的泄漏系数从而生成泄漏变化矩阵。采用k-means将泄漏区域聚类为两部分。One of the identified leak areas is selected, and a leak coefficient of C=0.8 is randomly added to the nodes in the leak area to generate a leak change matrix. K-means was used to cluster the leaked regions into two parts.

在每次迭代中,最后一次迭代中识别的泄漏区域的每个节点在泄漏系数范围内随机产生30个泄漏样本。因此,在第一次迭代中,节点总数为375,则共有11250个独立的泄漏样本。随着迭代次数的增加,候选泄漏区域的面积减小,用于模型训练的泄漏样本数量减少。In each iteration, 30 leak samples are randomly generated within the leak coefficient range for each node of the leak region identified in the last iteration. Therefore, in the first iteration, the total number of nodes is 375, and there are 11250 independent leak samples. As the number of iterations increases, the area of candidate leaky regions decreases, and the number of leaky samples used for model training decreases.

通过特征选择,可以减少每次迭代识别泄漏所需的特征。如附图5中的迭代2最多需要9个传感器,整个迭代需要10个不同的传感器,传感器的数目均小于传感器总数。With feature selection, you can reduce the number of features needed to identify leaks at each iteration. For example, iteration 2 in FIG. 5 requires a maximum of 9 sensors, and the entire iteration requires 10 different sensors, and the number of sensors is smaller than the total number of sensors.

当节点101发生4L/s泄漏时,其迭代过程如附图5所示,每个子迭代的识别准确率分别为99.99%、99.82%、99.25%、97.35%,则第四次迭代的最终识别准确率为99.99%×99.82%×99.25%×97.35%=96.44%。对于第五个子迭代,精度为96.67%,最终识别准确率为为96.44%×96.67%=93.22%<95%,则迭代终止;如附图6(b)所示,确定泄漏节点101(5)所在的区域(Z1)。总的迭代次数为4次。When a 4L/s leak occurs at node 101, the iterative process is shown in Figure 5. The recognition accuracy of each sub-iteration is 99.99%, 99.82%, 99.25%, and 97.35%, respectively. The final recognition of the fourth iteration is accurate The rate is 99.99%×99.82%×99.25%×97.35%=96.44%. For the fifth sub-iteration, the accuracy is 96.67%, and the final recognition accuracy is 96.44% × 96.67% = 93.22% < 95%, then the iteration is terminated; as shown in Fig. 6(b), determine the leaking node 101(5) in the zone (Z1). The total number of iterations is 4.

与传统方法相比(聚类总数预先给出,最终聚类数由试算法确定),传统方法的试验计算次数明显多于本文方法,本发明对泄漏区有更具体的分析。按传统方法若以识别准确率为95%作为训练模型标准,可识别的区域总数为18个。如果传统方法从1开始每次以1个区域为增加量进行聚类试算,则计算总数为17。但对于迭代法。由表2可知,迭代次数为4次,迭代次数小于传统算法的试算次数。虽然传统方法对泄漏区域的识别也有很好的效果,但是识别出的包含泄漏节点101(5)的泄漏区域(Z2)(如附图6(a)所示)比本文方法的泄漏区域(Z1)面积较大。相比之下,本文提出的方法不仅可以减少迭代法的试验计算总数,而且可以通过消除每次迭代中的无泄漏区域来缩小可识别区域的面积。Compared with the traditional method (the total number of clusters is given in advance, and the final number of clusters is determined by the trial algorithm), the number of experimental calculations of the traditional method is obviously more than that of the method in this paper, and the present invention has a more specific analysis of the leakage area. According to the traditional method, if the recognition accuracy rate is 95% as the training model standard, the total number of identifiable regions is 18. If the traditional method starts from 1 and performs the clustering trial calculation in increments of 1 area at a time, the total number of calculations is 17. But for iterative method. It can be seen from Table 2 that the number of iterations is 4, and the number of iterations is less than the number of trials of the traditional algorithm. Although the traditional method also has a good effect on the identification of the leakage area, the identified leakage area (Z2) containing the leakage node 101(5) (as shown in FIG. ) is larger. In contrast, the method proposed in this paper not only reduces the total number of trial computations for the iterative method, but also reduces the area of identifiable regions by eliminating leak-free regions in each iteration.

两个节点同时发生泄漏Two nodes leak at the same time

一个双节点的泄漏组合出现在节点65和281,节点65的泄漏流量是3.6L/s,节点281的泄漏流量是2.9L/s;另一个泄漏组合发生在节点132和406,节点132的泄漏流量是3.0L/s,节点406的泄漏流量是4.7L/s。A two-node leak combination occurs at nodes 65 and 281, the leak flow at node 65 is 3.6L/s, and the leak flow at node 281 is 2.9L/s; another leak combination occurs at nodes 132 and 406, and the leak at node 132 The flow is 3.0 L/s and the leakage flow at node 406 is 4.7 L/s.

对于每次迭代,选择其中一个已经识别的泄漏区域,对泄漏区域内的节点随机添加C=0.8的泄漏系数从而生成泄漏变化矩阵。采用k-means聚类将泄漏区域聚类为两部分。For each iteration, one of the identified leakage regions is selected, and a leakage coefficient of C=0.8 is randomly added to the nodes in the leakage region to generate a leakage change matrix. K-means clustering was used to cluster the leaked regions into two parts.

每个泄漏节点组合在泄漏系数范围内产生30个泄漏系数组合。同样,第一次迭代随机产生11250个独立的泄漏样本。每次迭代的样本量随着泄漏区域节点的增加而逐渐减少。例如,如果在最后一次迭代中识别的泄漏区域有375个节点,那么将会产生11250个泄漏样本,按比例算法,如果在最后一次迭代中识别的泄漏区域有50个节点,那么将会产生1500个泄漏样本。对于两个节点同时泄漏的迭代法的具体识别过程如附图7和附图8所示。附图7和附图8显示了不同的识别结果。Each leak node combination yields 30 leak factor combinations within the leak factor range. Likewise, the first iteration randomly generated 11,250 independent leak samples. The sample size at each iteration decreases gradually as the number of nodes in the leaky region increases. For example, if the leak area identified in the last iteration had 375 nodes, then 11,250 leak samples would be generated, and a proportional algorithm, if the leak area identified in the last iteration had 50 nodes, would yield 1,500 leak samples leaked samples. The specific identification process of the iterative method for simultaneous leakage of two nodes is shown in FIG. 7 and FIG. 8 . Figures 7 and 8 show different recognition results.

对于两个节点的同时泄漏,在每次迭代中对两个同时泄漏的模型进行训练时,特征选择也可以减少不必要的特征,但随着迭代次数的增加,相对于单节点泄漏的效果降低。如附图7所示,迭代1需要5个传感器,迭代4需要13个传感器,如附图8所示,迭代1需要5个传感器,迭代4需要12个传感器。对于整个迭代,附图7显示整个迭代需要13个不同的传感器,而附图8显示整个迭代需要17个不同的传感器。距离较近的相邻的泄漏组合所需的传感器总数小于与距离较远的泄漏组合所需的传感器总数。尽管所选传感器的数目都小于总传感器的数目,但与单个节点泄漏相比,效果有所下降。For simultaneous leakage of two nodes, feature selection can also reduce unnecessary features when training two models with simultaneous leakage in each iteration, but as the number of iterations increases, the effect relative to single-node leakage decreases . As shown in Figure 7, iteration 1 requires 5 sensors and iteration 4 requires 13 sensors, as shown in Figure 8, iteration 1 requires 5 sensors, and iteration 4 requires 12 sensors. For the entire iteration, Figure 7 shows that 13 different sensors are required for the entire iteration, while Figure 8 shows that 17 different sensors are required for the entire iteration. The total number of sensors required for the combination of closely spaced adjacent leaks is less than the total number of sensors required for the combination with the more distant leaks. Although the number of selected sensors is all smaller than the total number of sensors, the effect is reduced compared to single node leakage.

如附图7所示,当节点65和281发生泄漏时,前四次各子迭代的识别准确率分别为99.41%、98.83%、98.57%、99.76%,得到的最终识别准确率为99.41%×98.83%×98.57%×99.76%=96.61%。对于第五个子迭代,精度为97.22%,最终识别准确率为96.61%×97.22%=93.92%<95%,迭代停止;确定了泄漏区域和泄漏区域内的泄漏数量。As shown in Figure 7, when the nodes 65 and 281 leak, the recognition accuracy rates of the first four sub-iterations are 99.41%, 98.83%, 98.57%, and 99.76%, respectively, and the final recognition accuracy is 99.41%× 98.83%×98.57%×99.76%=96.61%. For the fifth sub-iteration, the accuracy was 97.22%, the final recognition accuracy was 96.61% × 97.22% = 93.92% < 95%, the iteration was stopped; the leak area and the number of leaks within the leak area were determined.

如附图8所示,当节点132(9)和406(8)发生泄漏时,前七次各子迭代的精度分别为99.41%、99.52%、99.64%、99.78%、98.88%、99.36%、99.67%,最终识别准确率为96.32%。第8次迭代的识别准确率为97.12%,则最终识别准确率为96.32%×97.12%=93.55%<95%,迭代停止;确定了泄漏区域和各泄漏区域内的泄漏数量。As shown in Figure 8, when the nodes 132(9) and 406(8) leak, the accuracies of the first seven sub-iterations are 99.41%, 99.52%, 99.64%, 99.78%, 98.88%, 99.36%, 99.67%, and the final recognition accuracy is 96.32%. The recognition accuracy rate of the 8th iteration is 97.12%, then the final recognition accuracy rate is 96.32%×97.12%=93.55%<95%, the iteration stops; the leak area and the number of leaks in each leak area are determined.

如附图9(a)所示,当节点65(6)和281(7)发生泄漏时,本发明所识别的泄漏区域为Z3,如附图9(b)所示,当节点132(9)和406(8)发生泄漏时,本发明所识别的泄漏区域为Z4和Z5。对于节点65(6)和281(7),由于这两个同时发生泄漏的节点距离很接近,因此很难区分它们。而对于节点132(9)和406(8),由于它们之间的距离较远,因此132(9)和406(8)的泄漏特性比节点65(6)和281(7)更明显,更容易区分,从而节点132(9)和406(8)的迭代次数也比较多。As shown in FIG. 9(a), when leakage occurs at nodes 65(6) and 281(7), the leakage area identified by the present invention is Z3. As shown in FIG. 9(b), when node 132(9) ) and 406(8) when leakage occurs, the leakage areas identified by the present invention are Z4 and Z5. For nodes 65(6) and 281(7), it is difficult to distinguish between the two simultaneously leaking nodes due to their close distance. And for nodes 132(9) and 406(8), the leakage characteristics of nodes 132(9) and 406(8) are more obvious and more obvious than nodes 65(6) and 281(7) due to the greater distance between them. It is easy to distinguish, so nodes 132(9) and 406(8) have more iterations.

本发明提出了基于迭代机器学习的供水管网泄漏区域识别方法。在每次迭代时,首先采用k-means将其中一个已经识别的泄漏区域聚类为两部分,然后确定本次迭代所有的泄漏节点的组合类型,每个泄漏节点的组合类型作为随机森林分类器训练模型的一个类别标签,随后根据泄漏节点的组合类型对泄漏区域的节点随机添加泄漏系数从而生成泄漏样本,采用生成的泄漏样本进行分类器模型的训练。如果训练的模型满足迭代标准,输入泄漏特征进行下一次迭代,如果不满足则迭代结束,输出泄漏区域和每个泄漏区域包含的泄漏节点数目。将该方法应用于一个供水网络并对其性能进行了评价,结果表明,该方法能较好地识别发生泄漏的区域及其包含的泄漏节点数目,并且提高了泄漏检测的效率和准确性。The invention proposes a method for identifying the leakage area of a water supply pipe network based on iterative machine learning. In each iteration, k-means is used to cluster one of the identified leak areas into two parts, and then the combination type of all leak nodes in this iteration is determined, and the combination type of each leak node is used as a random forest classifier A class label of the training model is trained, and then leak coefficients are randomly added to the nodes in the leak area according to the combination type of leak nodes to generate leak samples, and the generated leak samples are used to train the classifier model. If the trained model satisfies the iterative criteria, input the leaky features for the next iteration, if not, the iteration ends, and output the leaky area and the number of leaky nodes contained in each leaky area. The method is applied to a water supply network and its performance is evaluated. The results show that the method can better identify the leaking area and the number of leaking nodes it contains, and improve the efficiency and accuracy of leak detection.

Claims (1)

1.基于迭代机器学习的供水管网泄漏区域识别方法,其特征是:1. A method for identifying leakage areas of water supply network based on iterative machine learning, which is characterized by: 所述方法通过以下步骤进行:The method proceeds through the following steps: (1)有l个节点同时发生泄漏,泄漏特征为ΔSl(1) There are l nodes leaking at the same time, and the leakage characteristic is ΔS l ; (2)在第(β-1)次迭代中识别的泄漏区域为
Figure FDA0002477495410000011
每个泄漏区域内存在
Figure FDA0002477495410000012
个泄漏节点,i=1,2,…,w;
(2) The leakage area identified in the (β-1)th iteration is
Figure FDA0002477495410000011
exists within each leaked region
Figure FDA0002477495410000012
leaky nodes, i=1,2,...,w;
(3)从w个泄漏区域内选择其中一个泄漏区域
Figure FDA0002477495410000013
泄漏区域
Figure FDA0002477495410000014
内包含
Figure FDA0002477495410000015
个泄漏节点;对泄漏区域
Figure FDA0002477495410000016
内的每个节点添加相同的泄漏系数C从而生成泄漏变化矩阵
Figure FDA0002477495410000017
(3) Select one of the leakage areas from the w leakage areas
Figure FDA0002477495410000013
spill area
Figure FDA0002477495410000014
contains
Figure FDA0002477495410000015
leaking nodes;
Figure FDA0002477495410000016
The same leakage coefficient C is added to each node within to generate a leakage change matrix
Figure FDA0002477495410000017
(4)根据泄漏变化矩阵
Figure FDA0002477495410000018
采用k-means聚类将泄漏区域
Figure FDA0002477495410000019
聚类为两类,分别为区域
Figure FDA00024774954100000110
和区域
Figure FDA00024774954100000111
其余未聚类的(w-1)个区域及其包含的泄漏节点数目不变,则第β次迭代一共有(w+1)个区域;
(4) According to the leakage change matrix
Figure FDA0002477495410000018
Use k-means clustering to classify leaky regions
Figure FDA0002477495410000019
Clustering is divided into two categories, namely regions
Figure FDA00024774954100000110
and area
Figure FDA00024774954100000111
The remaining unclustered (w-1) regions and the number of leaking nodes they contain remain unchanged, then the β-th iteration has a total of (w+1) regions;
(5)产生第β次迭代泄漏节点的组合类型;对于未聚类的(w-1)个区域,其区域内部的泄漏节点数目保持不变,对于包含
Figure FDA00024774954100000112
个泄漏节点的区域
Figure FDA00024774954100000113
和区域
Figure FDA00024774954100000114
其所有的泄漏组合有
Figure FDA00024774954100000115
种,分别为:0个泄漏节点在
Figure FDA00024774954100000116
Figure FDA00024774954100000117
个泄漏节点在区域
Figure FDA00024774954100000118
1个泄漏节点在
Figure FDA00024774954100000119
个泄漏节点在区域
Figure FDA00024774954100000120
个泄漏节点在
Figure FDA00024774954100000121
0个泄漏节点在区域
Figure FDA00024774954100000122
因此对于本次迭代,一共有
Figure FDA00024774954100000123
个不同的标签;
(5) Generate the combination type of leaked nodes in the βth iteration; for unclustered (w-1) regions, the number of leaked nodes inside the region remains unchanged.
Figure FDA00024774954100000112
region of leaking nodes
Figure FDA00024774954100000113
and area
Figure FDA00024774954100000114
All its leak combinations have
Figure FDA00024774954100000115
species, respectively: 0 leaking nodes are in
Figure FDA00024774954100000116
Figure FDA00024774954100000117
leaking nodes in the region
Figure FDA00024774954100000118
1 leaking node in
Figure FDA00024774954100000119
leaking nodes in the region
Figure FDA00024774954100000120
leaking nodes in
Figure FDA00024774954100000121
0 leaking nodes in the region
Figure FDA00024774954100000122
So for this iteration, there are a total of
Figure FDA00024774954100000123
different labels;
(6)生成第β次迭代的泄漏样本;随机从区域
Figure FDA00024774954100000124
中选择
Figure FDA00024774954100000125
个节点,区域
Figure FDA00024774954100000126
中选择
Figure FDA00024774954100000127
个节点,对于未聚类的(w-1)个区域,分别从区域
Figure FDA00024774954100000128
中选择
Figure FDA00024774954100000129
个节点,
Figure FDA00024774954100000130
从而产生l个同时泄漏的节点,记为一个泄漏样本;一共产生ε个不同的泄漏样本;泄漏样本的集合称为T,特征的总数为NPQ,NP压力传感器和NQ流量传感器,其中NPQ=NP+NQ
(6) Generate leaky samples of the βth iteration; randomly from the region
Figure FDA00024774954100000124
choose
Figure FDA00024774954100000125
nodes, regions
Figure FDA00024774954100000126
choose
Figure FDA00024774954100000127
nodes, for unclustered (w-1) regions, respectively from the region
Figure FDA00024774954100000128
choose
Figure FDA00024774954100000129
nodes,
Figure FDA00024774954100000130
As a result, l nodes that leak at the same time are recorded as a leak sample; a total of ε different leak samples are generated; the set of leak samples is called T, and the total number of features is N PQ , N P pressure sensor and N Q flow sensor, where N PQ =N P +N Q ;
(7)以泄漏事件的水力模拟结果作为分类器的训练样本;采用MDA进行分类器模型特征的重要性计算,随后根据特征的重要性进行排序,以训练模型准确率不减小的原则,从非重要到重要的顺序进行特征数量的删减;所述MDA的计算分为随机森林分类器的训练及特征的平均准确率减少的计算;(7) Take the hydraulic simulation result of the leakage event as the training sample of the classifier; use MDA to calculate the importance of the classifier model features, and then sort according to the importance of the features, so that the accuracy of the training model does not decrease, from The number of features is deleted in the order from unimportant to important; the calculation of the MDA is divided into the training of the random forest classifier and the calculation of the reduction of the average accuracy of the feature; 随机森林分类器的训练过程如下所示:The training process of a random forest classifier is as follows: (a)在第(β-1)次迭代中识别的泄漏区域为
Figure FDA0002477495410000021
每个泄漏区域内存在
Figure FDA0002477495410000022
个泄漏节点,i=1,2,...,w;定义
Figure FDA0002477495410000023
l是总的泄漏节点数目;然后根据泄漏区域
Figure FDA0002477495410000024
的泄漏矩阵
Figure FDA0002477495410000025
将泄漏区域
Figure FDA0002477495410000026
聚类为区域
Figure FDA0002477495410000027
和区域
Figure FDA0002477495410000028
两部分,泄漏区域组合为
Figure FDA0002477495410000029
对于包含
Figure FDA00024774954100000210
个泄漏节点的区域
Figure FDA00024774954100000211
和区域
Figure FDA00024774954100000212
其所有的泄漏组合有
Figure FDA00024774954100000213
种,其他未聚类的(w-1)个区域包含的泄漏节点数目保持不变,则本次迭代有
Figure FDA00024774954100000214
个随机森林分类器标签,泄漏节点的组合类型分别为0个泄漏节点在区域
Figure FDA00024774954100000215
Figure FDA00024774954100000216
个泄漏节点在区域
Figure FDA00024774954100000217
1个泄漏节点在
Figure FDA00024774954100000218
个泄漏节点在区域
Figure FDA00024774954100000219
Figure FDA00024774954100000220
个泄漏节点在区域
Figure FDA00024774954100000221
0个泄漏节点在区域
Figure FDA00024774954100000222
随机从区域
Figure FDA00024774954100000223
中选择
Figure FDA00024774954100000224
个节点,区域
Figure FDA00024774954100000225
中选择
Figure FDA00024774954100000226
个节点,区域
Figure FDA00024774954100000227
中选择
Figure FDA00024774954100000228
个节点,其中
Figure FDA00024774954100000229
从而产生l个同时泄漏的节点,记为一个泄漏样本;假设存在ε个不同的泄漏样本,泄漏样本的集合称为T,特征的总数为NPQ,其中:NP个压力传感器和NQ个流量传感器,NPQ=NP+NQ
(a) The leak area identified in the (β-1)th iteration is
Figure FDA0002477495410000021
exists within each leaked region
Figure FDA0002477495410000022
leaky nodes, i=1,2,...,w; definition
Figure FDA0002477495410000023
l is the total number of leaking nodes; then according to the leaking area
Figure FDA0002477495410000024
The leakage matrix of
Figure FDA0002477495410000025
will leak area
Figure FDA0002477495410000026
cluster into regions
Figure FDA0002477495410000027
and area
Figure FDA0002477495410000028
In two parts, the leakage area is combined as
Figure FDA0002477495410000029
for containing
Figure FDA00024774954100000210
region of leaking nodes
Figure FDA00024774954100000211
and area
Figure FDA00024774954100000212
All its leak combinations have
Figure FDA00024774954100000213
The number of leaking nodes contained in other unclustered (w-1) regions remains unchanged, then this iteration has
Figure FDA00024774954100000214
random forest classifier labels, the combination type of leaking nodes is 0 leaking nodes are in the region
Figure FDA00024774954100000215
Figure FDA00024774954100000216
leaking nodes in the region
Figure FDA00024774954100000217
1 leaking node in
Figure FDA00024774954100000218
leaking nodes in the region
Figure FDA00024774954100000219
Figure FDA00024774954100000220
leaking nodes in the region
Figure FDA00024774954100000221
0 leaking nodes in the region
Figure FDA00024774954100000222
random from area
Figure FDA00024774954100000223
choose
Figure FDA00024774954100000224
nodes, regions
Figure FDA00024774954100000225
choose
Figure FDA00024774954100000226
nodes, regions
Figure FDA00024774954100000227
choose
Figure FDA00024774954100000228
nodes, where
Figure FDA00024774954100000229
Thus, l simultaneously leaking nodes are generated, which are recorded as a leaking sample; assuming that there are ε different leaking samples, the set of leaking samples is called T, and the total number of features is N PQ , among which: N P pressure sensors and N Q Flow sensor, N PQ =N P +N Q ;
(b)采用从泄漏样本中重复抽样的方法创建训练子集Ttr1,Ttr2,…,TtrM,M是分类树的数量,从泄漏样本T中随机选择Ttri的训练子集,每个分类树Ttri的训练子集的数目为ε,与总的泄漏样本的大小一致,因此,分类树Ttri的训练子集会有重复的样本。泄漏样本T中未被抽中的那部分叫做OOB,用来评估每棵树的准确性;由于随机森林分类器的准确性随着分类树数目的增加而增加并趋于一个常数,因此选择默认值M=500;(b) Create training subsets T tr1 ,T tr2 ,…,T trM by repeated sampling from leaked samples, where M is the number of classification trees, randomly select training subsets of T tri from leaked samples T, each The number of training subsets of the classification tree T tri is ε, which is consistent with the size of the total leaked samples. Therefore, the training subset of the classification tree T tri will have duplicate samples. The part of the leaked sample T that is not drawn is called OOB, which is used to evaluate the accuracy of each tree; since the accuracy of the random forest classifier increases with the number of classification trees and tends to a constant, the default valueM=500; (c)对于训练子集Ttri的每个节点,1≤i≤500,从NPQ个特征中随机选择m个子特征来创建分类树,1≤m≤NPQ;m的默认值为
Figure FDA00024774954100000230
用来计算每个特征的基尼指数;给定训练子集Ttri和连续特征ND,D=1,2,...,m,训练子集Ttri有f类样本,
Figure FDA00024774954100000231
类别h内有|fh|样本;特征ND有r个不同的值;然后,将这些值从小到大排序,并将它们标记为R={R1,R2,…,Rr},划分点t可以划分ND为两个子集
Figure FDA0002477495410000031
Figure FDA0002477495410000032
其中
Figure FDA0002477495410000033
为包含不大于t的值的样本,
Figure FDA0002477495410000034
Figure FDA0002477495410000035
为包含大于t的值的样本,相邻的值分别为Re和Re+1;其中在[Re,Re+1]中的所有值都有相同的分割结果,所以有(r-1)个分割点是候选分割点;训练子集Ttri和特征点t处的基尼系数为
(c) For each node of the training subset T tri , 1≤i≤500 , randomly select m sub-features from NPQ features to create a classification tree, 1≤m≤NPQ ; the default value of m is
Figure FDA00024774954100000230
It is used to calculate the Gini index of each feature; given a training subset T tri and continuous features N D , D=1,2,...,m, the training subset T tri has f class samples,
Figure FDA00024774954100000231
There are |f h | samples in class h; feature N D has r distinct values; then, sort these values from small to large and label them as R = {R 1 ,R 2 ,...,R r }, The dividing point t can divide ND into two subsets
Figure FDA0002477495410000031
and
Figure FDA0002477495410000032
in
Figure FDA0002477495410000033
is a sample containing values not greater than t,
Figure FDA0002477495410000034
Figure FDA0002477495410000035
For the samples containing values greater than t, the adjacent values are Re and Re +1 respectively; in which all values in [ Re , Re +1 ] have the same segmentation result, so there is (r- 1) The split points are candidate split points; the Gini coefficient at the training subset T tri and the feature point t is
Figure FDA0002477495410000036
Figure FDA0002477495410000036
其中:式中,pf表示样本属于f类的概率,
Figure FDA0002477495410000037
表示随机选取的样本被误分类的概率;基尼指数越大,样本被错误分类的可能性越大;选择t的最小的Gini指数和相应的特征进行分割,然后,构造每个分支来重复上述过程;
Where: In the formula, p f represents the probability that the sample belongs to class f,
Figure FDA0002477495410000037
Represents the probability of a randomly selected sample being misclassified; the larger the Gini index, the greater the probability of the sample being misclassified; select the smallest Gini index of t and the corresponding feature for segmentation, and then construct each branch to repeat the above process ;
(d)每个分类树均有助于训练子集Ttri的识别精度,因此500棵分类树的平均识别准确率是训练模型的识别准确率
Figure FDA0002477495410000038
(d) Each classification tree contributes to the recognition accuracy of the training subset T tri , so the average recognition accuracy of 500 classification trees is the recognition accuracy of the trained model
Figure FDA0002477495410000038
MDA的计算过程如下所示:The calculation process of MDA is as follows: 对于随机森林分类器模型的每个分类树Ttri,1≤i≤500,使用OOB数据计算OOB误差,记为OOBb1;随机打乱特征F处的袋外样本数据,1≤F≤NPQ,再次计算出袋外误差,记为OOBb2;假设森林中有500棵树,则特征F的重要性表示为:
Figure FDA0002477495410000039
For each classification tree T tri of the random forest classifier model, 1≤i≤500, use the OOB data to calculate the OOB error, denoted as OOB b1 ; randomly scramble the out-of-bag sample data at the feature F, 1≤F≤N PQ , and calculate the out-of-bag error again, denoted as OOB b2 ; assuming that there are 500 trees in the forest, the importance of feature F is expressed as:
Figure FDA0002477495410000039
(8)将
Figure FDA00024774954100000310
个泄漏节点的组合类型作为分类器模型的类别标签,利用所选特征训练随机森林分类器;如果模型第β次迭代的最终识别准确率Accβ,即每次迭代的训练模型识别准确率的乘积
Figure FDA00024774954100000311
大于95%,则将泄漏特征输入训练好的随机森林分类器,然后进行下一次迭代;如果最终识别准确率小于95%,则停止迭代,输出识别的泄漏区域和每个泄漏区域包含的泄漏节点数目。
(8) will
Figure FDA00024774954100000310
The combination type of the leaked nodes is used as the class label of the classifier model, and the random forest classifier is trained with the selected features; if the final recognition accuracy Acc β of the βth iteration of the model is the product of the recognition accuracy of the training model for each iteration
Figure FDA00024774954100000311
If it is greater than 95%, input the leak feature into the trained random forest classifier, and then proceed to the next iteration; if the final recognition accuracy rate is less than 95%, stop the iteration, and output the identified leak area and the leak nodes contained in each leak area. number.
CN202010369142.0A 2020-05-02 2020-05-02 Water supply pipe network leakage area identification method based on iterative machine learning Active CN111553811B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010369142.0A CN111553811B (en) 2020-05-02 2020-05-02 Water supply pipe network leakage area identification method based on iterative machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010369142.0A CN111553811B (en) 2020-05-02 2020-05-02 Water supply pipe network leakage area identification method based on iterative machine learning

Publications (2)

Publication Number Publication Date
CN111553811A true CN111553811A (en) 2020-08-18
CN111553811B CN111553811B (en) 2022-09-20

Family

ID=72001791

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010369142.0A Active CN111553811B (en) 2020-05-02 2020-05-02 Water supply pipe network leakage area identification method based on iterative machine learning

Country Status (1)

Country Link
CN (1) CN111553811B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115389121A (en) * 2022-08-24 2022-11-25 中国核动力研究设计院 Electric valve leakage mode identification method and device based on random forest
CN115628776A (en) * 2022-10-25 2023-01-20 杭州电子科技大学 Water supply pipe network abnormal data detection method
CN116797051A (en) * 2023-08-24 2023-09-22 青岛海洋地质研究所 Ocean carbon leakage point number evaluation method based on multi-distance spatial cluster analysis
EP4597063A1 (en) * 2024-01-30 2025-08-06 Grohe AG System and method for detecting leakages in a fluid-bearing structure

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160356666A1 (en) * 2015-06-02 2016-12-08 Umm Al-Qura University Intelligent leakage detection system for pipelines
CN107506781A (en) * 2017-07-06 2017-12-22 浙江工业大学 A kind of Human bodys' response method based on BP neural network
JP2019028839A (en) * 2017-08-01 2019-02-21 国立研究開発法人情報通信研究機構 Classifier, classifier learning method, classifier classification method
WO2020041204A1 (en) * 2018-08-18 2020-02-27 Sf17 Therapeutics, Inc. Artificial intelligence analysis of rna transcriptome for drug discovery

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160356666A1 (en) * 2015-06-02 2016-12-08 Umm Al-Qura University Intelligent leakage detection system for pipelines
CN107506781A (en) * 2017-07-06 2017-12-22 浙江工业大学 A kind of Human bodys' response method based on BP neural network
JP2019028839A (en) * 2017-08-01 2019-02-21 国立研究開発法人情報通信研究機構 Classifier, classifier learning method, classifier classification method
WO2020041204A1 (en) * 2018-08-18 2020-02-27 Sf17 Therapeutics, Inc. Artificial intelligence analysis of rna transcriptome for drug discovery

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115389121A (en) * 2022-08-24 2022-11-25 中国核动力研究设计院 Electric valve leakage mode identification method and device based on random forest
CN115389121B (en) * 2022-08-24 2023-10-24 中国核动力研究设计院 Electric valve leakage mode identification method and device based on random forest
CN115628776A (en) * 2022-10-25 2023-01-20 杭州电子科技大学 Water supply pipe network abnormal data detection method
CN116797051A (en) * 2023-08-24 2023-09-22 青岛海洋地质研究所 Ocean carbon leakage point number evaluation method based on multi-distance spatial cluster analysis
CN116797051B (en) * 2023-08-24 2023-11-14 青岛海洋地质研究所 Ocean carbon leakage point number evaluation method based on multi-distance spatial cluster analysis
EP4597063A1 (en) * 2024-01-30 2025-08-06 Grohe AG System and method for detecting leakages in a fluid-bearing structure

Also Published As

Publication number Publication date
CN111553811B (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN111553811B (en) Water supply pipe network leakage area identification method based on iterative machine learning
CN111753101B (en) A Knowledge Graph Representation Learning Method Integrating Entity Description and Type
Hu et al. Novel leakage detection and water loss management of urban water supply network using multiscale neural networks
CN106022518B (en) A Prediction Method of Pipeline Damage Probability Based on BP Neural Network
CN108388559A (en) Name entity recognition method and system, computer program of the geographical space under
McManamay et al. Updating the US hydrologic classification: an approach to clustering and stratifying ecohydrologic data
CN114022812A (en) A Multi-target Tracking Method for DeepSort Water Surface Floating Objects Based on Lightweight SSD
CN115062109B (en) Entity relationship joint extraction method based on entity-to-attention mechanism
CN113269352B (en) Urban waterlogging monitoring and early warning methods, systems and media based on mobile Internet
CN109783979A (en) A method for optimizing the layout of leakage monitoring sensors under semi-supervised conditions of urban water supply network
CN105808689A (en) Drainage system entity semantic similarity measurement method based on artificial neural network
CN116258504B (en) Bank customer relationship management system and method thereof
CN116542170A (en) Drainage pipeline siltation disease dynamic diagnosis method based on SSAE and MLSTM
CN113642772A (en) Logging reservoir identification and prediction method based on machine learning
CN118088945A (en) A method for detecting and locating leakage in urban water supply pipelines
CN113970073B (en) A ResNet-based method for precise location of water supply network leaks
CN118705555A (en) A method and device for monitoring water supply network leakage based on deep learning model
CN117540277B (en) A lost circulation warning method based on WGAN-GP-TabNet algorithm
CN117540302A (en) Heat supply pipe network leakage fault detection method and system based on random forest
CN105631465A (en) Density peak-based high-efficiency hierarchical clustering method
CN119762877B (en) A point cloud classification and segmentation method and system based on a global mask autoencoder with fused voxels.
CN114880538A (en) Attribute graph community detection method based on self-supervision
CN108388769A (en) Protein Functional Module Identification Method Based on Edge-Driven Label Propagation Algorithm
CN117473344A (en) Multidimensional time sequence clustering method combining neural network and self-organizing mapping network
CN109657034A (en) Address similarity calculating method and its system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant