CN109816033A

CN109816033A - A method for user identification in Taiwan area based on optimized supervised learning

Info

Publication number: CN109816033A
Application number: CN201910095251.5A
Authority: CN
Inventors: 唐明; 何仲潇; 王剑; 王枭; 汪晓华
Original assignee: Sichuan Energy Internet Research Institute EIRI Tsinghua University
Current assignee: Sichuan Energy Internet Research Institute EIRI Tsinghua University
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2019-05-28

Abstract

The invention relates to the field of data analysis, and discloses a method for identifying users in station areas based on optimized supervised learning. Including: determining the users with known topological relationship between users and the user's station area and their relationship, determining the corresponding label of the user data according to the user's station area and the relationship, establishing a training set, a validation set and a test set, using the cross-validation method Determine the k parameters in the KNN model, and complete the training of the model; use the trained model and the determined k value to identify and classify the voltage data to be identified, and then realize the identification of users in the station area to be identified. The invention realizes the conversion from unsupervised learning to supervised learning, reasonably sets training set, verification set and test set, and adopts cross-validation method to determine the k parameter, so as to accurately and effectively identify the station area to which the user belongs and the difference. Completely solve the problem of user attribution across stations, and lay the foundation for comprehensively guiding the operation, maintenance, emergency repair, technical transformation, planning and other fields of low-voltage station areas.

Description

A method for user identification in Taiwan area based on optimized supervised learning

技术领域technical field

本发明涉及数据分析领域，特别是一种基于优化的监督学习进行台区用户识别的方法。The invention relates to the field of data analysis, in particular to a method for identifying users in station areas based on optimized supervised learning.

背景技术Background technique

准确的基础台区档案是台区线损率分析、配网故障定位、抢修工单下发、三相不平衡分析等等一系列高级应用的重要基础。然而由于我国电力系统起步较晚、初期发展规划不完善，我国现阶段配电变压器分布散乱、配电线路错综复杂。同时，由于电网公司在多年的运营过程中信息记录遗失、更新不及时、信息不完整等原因，导致台区的档案资料往往不准确，即少部分终端用户存在台户真实挂接关系与台区档案不相符的现象。错乱的台区档案使多种高级应用无法有效开展，严重影响了电网公司建设智能电网的进程。因此，亟需高效、稳定、准确的台区拓扑辨识方法，为全面指导低压台区运行、维护、抢修、技改、规划等各领域的工作奠定基础。Accurate basic station files are an important basis for a series of advanced applications such as line loss rate analysis, distribution network fault location, emergency repair work order issuance, and three-phase unbalance analysis. However, due to the late start of my country's power system and the imperfect initial development plan, the distribution of distribution transformers in my country at this stage is scattered and the distribution lines are intricate. At the same time, due to the loss of information records, untimely update, incomplete information and other reasons in the operation process of the power grid company, the archives of the station area are often inaccurate, that is, a small number of end users have a real connection relationship with the station area and the station area. File inconsistencies. The disordered archives of the station area make it impossible to carry out various advanced applications effectively, which seriously affects the process of building a smart grid by the power grid company. Therefore, an efficient, stable and accurate topology identification method for the station area is urgently needed, which lays the foundation for comprehensively guiding the operation, maintenance, emergency repair, technical transformation, planning and other fields of the low-voltage station area.

传统的台区用户识别方法分为人工识别和利用专用的台区识别设备两种。人工识别需要依靠电力人员到现场逐户排查台区用户的归属情况，费时费力且效率极低。专用的台区识别设备主要包括台区用户识别仪，而台区用户识别仪多数基于电力载波技术直接通信与否或电流脉冲技术等来识别台区信息。载波信号通过共地、共高压、并行布线耦合的方式向周边台区传输数据，尽管信号幅值有所衰减，仍能和邻近变压器下距离较近的电表进行通信，故仍然存在“串台区”的问题。基于电力载波与脉冲载波混合方式的配电台区用户辨识虽然解决了共高压串线、共地串线、共电缆沟串线问题，但仍需人工测量，而且采用电流钳进行配电台区用户辨识的过程中可能存在安全隐患，难以满足配电台区的智能化发展需求。The traditional user identification methods in the station area are divided into manual identification and the use of special station area identification equipment. Manual identification needs to rely on electric power personnel to go to the site to check the attribution of users in the station area one by one, which is time-consuming, labor-intensive and extremely inefficient. The dedicated station area identification equipment mainly includes station area user identification instruments, and most station area user identification instruments identify station area information based on whether direct communication with power carrier technology or current pulse technology. The carrier signal transmits data to the surrounding station area by means of common ground, common high voltage, and parallel wiring coupling. Although the signal amplitude is attenuated, it can still communicate with the electric meters that are close to the adjacent transformers, so there is still a "serial station area". "The problem. Although the user identification in the distribution station area based on the hybrid method of power carrier and pulse carrier solves the problems of common high-voltage serial lines, common ground serial lines, and common cable trench serial lines, it still needs manual measurement, and the current clamp is used to identify users in the distribution station area. There may be potential safety hazards in the process, and it is difficult to meet the intelligent development needs of the distribution station area.

近年来，随着物联网技术的迅速发展，为智能电表海量数据打通了上行通道，电网公司有机会获取海量、高密度的数据。一些学者将台区配电变压器电参量以及用户端电参量进行大数据融合统计分析，从而实现台区用户的识别。现有技术主要分为两种：In recent years, with the rapid development of Internet of Things technology, the upstream channel has been opened up for the massive data of smart meters, and power grid companies have the opportunity to obtain massive and high-density data. Some scholars have performed big data fusion statistical analysis on the electrical parameters of distribution transformers in the station area and the electrical parameters of the user terminal, so as to realize the identification of users in the station area. The existing technologies are mainly divided into two types:

1.利用智能电表的量测数据，分别计算用户处智能电表的量测数据与各变压器低压侧数据的相似度，选择其中相似度最高来确定用户的台区及相别，但是在某些情况下相似度差别不明显，难以有效区分；1. Using the measurement data of the smart meter, calculate the similarity between the measurement data of the smart meter at the user and the low-voltage side data of each transformer, and select the highest similarity to determine the user's station area and phase, but in some cases The difference in similarity is not obvious, and it is difficult to distinguish effectively;

2.基于同一台区电能采集设备电压数据的高度相关性，利用k-means算法对用户电压数据进行聚类，从而实现台区用户识别(可参考已公开专利申请CN106156792A)。而聚类算法本身是一种无监督学习，该算法基于数据的内部结构寻找观察样本的自然族群。当数据质量较低时，辨识的准确度偏低，辨识结果不可靠。2. Based on the high correlation of the voltage data of the electric energy collection equipment in the same station area, the user voltage data is clustered by using the k-means algorithm, thereby realizing the user identification in the station area (refer to the published patent application CN106156792A). The clustering algorithm itself is an unsupervised learning, which finds the natural group of observation samples based on the internal structure of the data. When the data quality is low, the accuracy of identification is low, and the identification results are unreliable.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是：针对上述存在的问题，考虑到当前国家电网已通过传统的台区用户识别方法确定了部分用户所属的台区和相别，采用这部分用户作为训练对象从而采用有监督学习的方法对待辨识的用户进行分类，本发明提供了一种基于优化的监督学习进行台区用户识别的方法，用于提高台区用户识别的准确度和效率，同时降低硬件和人工成本，为全面指导低压台区运行、维护、抢修、技改、规划等各领域的工作奠定良好基础。The technical problem to be solved by the present invention is: in view of the above-mentioned existing problems, considering that the current State Grid has determined the station areas and phases to which some users belong through the traditional station area user identification method, these users are used as training objects to adopt There is a supervised learning method for classifying users to be identified, and the present invention provides a method for identifying users in a station area based on optimized supervised learning, which is used to improve the accuracy and efficiency of user identification in the station area, while reducing hardware and labor costs. , laying a good foundation for comprehensively guiding the operation, maintenance, emergency repair, technical transformation, planning and other fields of the low-voltage station area.

本发明采用的技术方案如下：The technical scheme adopted in the present invention is as follows:

一种基于优化的监督学习进行台区用户识别的方法，包括以下步骤：A method for user identification based on optimized supervised learning, comprising the following steps:

步骤S1，获取台区变压器低压侧电压数据及待识别的用户电表电压数据；Step S1, acquiring the voltage data of the low-voltage side of the transformer in the station area and the voltage data of the user's electric meter to be identified;

步骤S2，对获取的电压数据进行预处理；Step S2, preprocessing the acquired voltage data;

步骤S3，确定已知台户拓扑关系的用户及用户所属台区和相别，根据用户所属台区和相别确定用户数据的对应标签，建立训练集、验证集和测试集，采用交叉验证的方式对KNN 模型中的k参数进行确定，并完成模型的训练；Step S3, determine the users with known topological relations of the users and the station area and the relation to which the user belongs, determine the corresponding label of the user data according to the station area and phase to which the user belongs, establish a training set, a verification set and a test set, and use the cross-validation method. way to determine the k parameters in the KNN model and complete the training of the model;

步骤S4，采用已训练好的训练模型和确定的k值对待识别用户的电压数据进行识别分类，进而实现对待识别用户电压数据中台区用户的识别，并输出识别结果。Step S4, using the trained training model and the determined k value to identify and classify the voltage data of the user to be identified, thereby realizing the identification of the users in the platform area in the voltage data of the user to be identified, and outputting the identification result.

进一步的，步骤S1中l台台区变压器低压侧电压数据为其中：表示第n个时刻第f台变压器A相的低压侧电压；表示第 n个时刻第f台变压器B相的低压侧电压；表示第n个时刻第 f台变压器A相的低压侧电压；而待识别的m台用户电表电压数据其中，表示第n个时刻第f台待识别用户电表电压。Further, in step S1, the voltage data of the low-voltage side of the transformer in 1 platform area is: in: Indicates the low-voltage side voltage of the A-phase of the f-th transformer at the n-th time; Indicates the low-voltage side voltage of the B-phase of the f-th transformer at the n-th time; Indicates the low-voltage side voltage of the A-phase of the f-th transformer at the n-th time; and the voltage data of m users' electricity meters to be identified in, Indicates the voltage of the f-th user's meter to be identified at the n-th time.

进一步的，所述步骤S2中当待处理数据的维度较大时对数据进行降维处理，把多维的数据化为少数主成分进行分析，以提高算法的计算效率；或当数据维度较小时不降维直接处理。Further, in the step S2, when the dimension of the data to be processed is relatively large, the data is subjected to dimensionality reduction processing, and the multi-dimensional data is converted into a few principal components for analysis, so as to improve the calculation efficiency of the algorithm; Dimension reduction is directly processed.

进一步的，所述步骤S3中采用交叉验证的方式对KNN模型中的k参数进行确定，并完成模型的训练具体包括以下步骤：Further, in the described step S3, the k parameter in the KNN model is determined by means of cross-validation, and the training of the completed model specifically includes the following steps:

步骤S3.1，选取一部分已知台户关系的用户电压数据和对应标签、及变压器的电压数据和对应标签作为训练集；一部分已知台户关系的用户电压数据和对应标签作为验证集；剩余的已知台户关系的用户电压数据和对应标签作为测试集；Step S3.1, select a part of user voltage data and corresponding labels with known station-household relationships, and transformer voltage data and corresponding labels as training sets; a part of user voltage data and corresponding labels with known station-household relationships are selected as validation sets; the remaining The user voltage data and corresponding labels of the known station-user relationship are used as the test set;

步骤S3.2，在训练集中数据和标签已知的情况下，确定距离度量方式，输入验证集的数据，遍历所有可能出现的k值，利用优化后的KNN模型对验证集中的用户电压数据进行分类，评估输入不同的k值时，对验证集用户分类结果的准确度，并选取准确度最高时的k值作为输入参数；Step S3.2, when the data and labels in the training set are known, determine the distance measurement method, input the data in the validation set, traverse all possible k values, and use the optimized KNN model to perform the user voltage data in the validation set. Classification, evaluating the accuracy of the user classification results of the validation set when different k values are input, and selecting the k value with the highest accuracy as the input parameter;

步骤S3.3，对前一步骤中确定的k值进行判断，判断其是否满足预定的目标条件，当满足预定目标条件时继续下一步骤，当不满足预定目标条件时返回步骤S3.2、并进一步地利用训练模型对测试集数据进行分类，从而进一步确认k值的合理性。Step S3.3, the k value determined in the previous step is judged to determine whether it satisfies the predetermined target condition, when the predetermined target condition is satisfied, continue to the next step, and when the predetermined target condition is not satisfied, return to step S3.2, And further use the training model to classify the test set data, so as to further confirm the rationality of the k value.

进一步的，所述步骤S3.1中训练集、验证集和测试集的数据占总数据的比例分别为80％、 10％和10％。Further, in the step S3.1, the proportions of the data of the training set, the validation set and the test set to the total data are 80%, 10% and 10% respectively.

进一步的，其特征在于，所述利用优化后的KNN模型对验证集中的用户电压数据进行分类具体为：Further, it is characterized in that, described utilizing the optimized KNN model to classify the user voltage data in the verification set is specifically:

S3.2.1，计算验证数据与各个训练数据之间的距离，并确定距离度量方式；S3.2.1, calculate the distance between the verification data and each training data, and determine the distance measurement method;

S3.2.2，按照距离值的递增关系对训练数据进行排序；S3.2.2, sort the training data according to the increasing relationship of the distance values;

S3.2.3，选取训练数据中距离值最小的前k个点；S3.2.3, select the first k points with the smallest distance value in the training data;

S3.2.4，统计并确定前k个点所属类别的出现频率；S3.2.4, count and determine the frequency of occurrence of the category to which the first k points belong;

S3.2.5，返回前k个点中出现频率最高的分类类别作为此次验证数据的预测分类。S3.2.5, return the classification category with the highest frequency in the first k points as the predicted classification of the verification data.

进一步的，所述距离度量方式采用相关系数、余弦相似度、欧式距离中的一种。Further, the distance measurement method adopts one of correlation coefficient, cosine similarity, and Euclidean distance.

进一步的，所述距离度量方式的定义如下：Further, the definition of the distance measurement method is as follows:

所述距离L_pq采用余弦相似度来定义时，When the distance L _pq is defined by cosine similarity,

式中，z_p′为行向量z_p的转置。z_q′为行向量z_q的转置。In the formula, z _p ′ is the transpose of the row vector z _p . z _q ′ is the transpose of the row vector z _q .

所述距离L_pq采用相关系数来定义时，When the distance L _pq is defined by the correlation coefficient,

式中，为单位行向量。In the formula, is a unit row vector.

进一步的，步骤S3.3中对k值进行判断其是否满足预定的目标条件，当出现如下两种情况中的至少一种时，即认为该k值不满足预定的目标条件：Further, in step S3.3, the k value is judged whether it satisfies the predetermined target condition, and when at least one of the following two situations occurs, it is considered that the k value does not meet the predetermined target condition:

(1)准确度最高的情况对应了多个k值；(1) The case with the highest accuracy corresponds to multiple k values;

(2)准确度最高的情况只对应了1个k值；(2) The case with the highest accuracy corresponds to only 1 k value;

此时需要在步骤S3.3中利用测试集进一步进行确认k值的合理性和唯一性。At this time, it is necessary to use the test set to further confirm the rationality and uniqueness of the k value in step S3.3.

与现有技术相比，采用上述技术方案的有益效果为：Compared with the prior art, the beneficial effects of adopting the above technical solution are:

1.本发明提供的一种基于优化的监督学习进行台区用户识别的方法采用KNN算法模型进行台区用户识别，实现了无监督学习到有监督学习的转换，规避了传统方法辨识不可靠、效率低且存在安全隐患的问题，并且降低了硬件和人工成本，辨识结果更加可靠，准确率更高。1. a kind of method that carries out user identification in station area based on optimized supervised learning provided by the invention adopts KNN algorithm model to carry out user identification in station area, realizes the conversion from unsupervised learning to supervised learning, and avoids the unreliable identification of traditional methods, The efficiency is low and there are potential safety hazards, and the hardware and labor costs are reduced, the identification results are more reliable, and the accuracy rate is higher.

2.本发明提供的一种基于优化的监督学习进行台区用户识别的方法合理设置了训练集、验证集和测试集，采用交叉验证的方式对KNN算法模型中的k参数进行确定，进一步提升了算法的性能，提升了台区用户识别的准确性。2. A method for identifying users in the station area based on optimized supervised learning provided by the present invention reasonably sets a training set, a verification set and a test set, and uses a cross-validation method to determine the k parameter in the KNN algorithm model to further improve The performance of the algorithm is improved, and the accuracy of user identification in the station area is improved.

3.本发明提供的一种基于优化的监督学习进行台区用户识别的方法采用相关系数和余弦相似度对用户之间的距离进行度量，更能体现同一台区同一相变压器和用户电压数据变化趋势的一致性，能够进一步提升台区用户识别的准确性，从而准确有效地识别用户所属台区与相别，彻底解决跨台区用户归属难题，为全面指导低压台区运行、维护、抢修、技改、规划等各领域的工作奠定基础。3. A method for user identification in a station area based on optimized supervised learning provided by the present invention adopts correlation coefficient and cosine similarity to measure the distance between users, which can better reflect the voltage data changes of the same phase transformer in the same station area and users The consistency of the trend can further improve the accuracy of user identification in the station area, so as to accurately and effectively identify the station area to which the user belongs and the difference, completely solve the problem of user attribution across the station area, and comprehensively guide the operation, maintenance, emergency repair, Lay the foundation for work in various fields such as technical transformation and planning.

附图说明Description of drawings

图1是典型台区变与用户表拓扑架构示意图。Figure 1 is a schematic diagram of a typical station area change and user table topology.

图2是本发明基于优化的监督学习进行台区用户识别的方法流程示意图。FIG. 2 is a schematic flowchart of a method for identifying users in a station area based on optimized supervised learning according to the present invention.

具体实施方式Detailed ways

下面结合附图对本发明做进一步描述。The present invention will be further described below with reference to the accompanying drawings.

如图1所示为现有的典型台区变与用户表的拓扑连接关系示意图，其中配电台区用户以辐射状拓扑方式运行，由于不同时刻系统的负荷情况及运行状态不同，用户处的电压会呈现一定的波动。由于同一相的台区变压器和用户的电表之间具有确定的电气连接，用户侧的电压会随着台区变的出口电压的升高而升高，二者具有高度的相关性，变化趋势高度一致。即处于同一台区同一相别的用户，电压波动规律具有很强的相似性，而属不同台区的用户，其电气距离远，电压波动相似性较差。Figure 1 is a schematic diagram of the topology connection relationship between the transformer and the user table in the existing typical station area. The users in the distribution station area operate in a radial topology. Due to the different load conditions and operating states of the system at different times, the The voltage will show a certain fluctuation. Since there is a certain electrical connection between the transformer of the same phase and the electricity meter of the user, the voltage on the user side will increase with the increase of the outlet voltage of the station transformer. The two are highly correlated, and the variation trend is highly Consistent. That is, users in the same station area and the same phase have strong similarity in voltage fluctuation laws, while users in different station areas have long electrical distances and poor voltage fluctuation similarity.

基于配网同一台区同一相别的用户电压波动规律相似性强，而属不同台区的用户电压波动相似性较差的特点，同时考虑到目前通过传统的台区用户识别方法已确定了部分用户所属的台区和相别，因此可以基于KNN算法对未知的台区用户进行分类，从而实现台区用户的准确识别。即用户只需获取一定数量变压器低压侧及待识别用户电表电压数据，以及已知用户所属的台区和相别，通过大数据分析的方法完成台区用户拓扑关系的识别。能够有效解决传统台区用户识别方法以及无监督学习识别方法辨识不可靠、效率低且存在安全隐患的问题。Based on the characteristics that the voltage fluctuations of users of the same phase in the same station area of the distribution network are highly similar, while the voltage fluctuations of users belonging to different station areas are less similar. The station area to which the user belongs is different, so the unknown station area users can be classified based on the KNN algorithm, so as to realize the accurate identification of the station area users. That is, the user only needs to obtain a certain number of low-voltage sides of transformers and the voltage data of the user's meter to be identified, as well as the station area and phase to which the user belongs. It can effectively solve the problems of unreliable identification, low efficiency and potential safety hazards of traditional user identification methods and unsupervised learning identification methods.

如果直接采用传统KNN算法模型进行台区用户拓扑关系识别仍然会有以下缺点：If the traditional KNN algorithm model is directly used to identify the topological relationship of users in the station area, there will still be the following disadvantages:

1.对于k值的选择，没有一个固定的经验。1. There is no fixed experience for the choice of k value.

选择较小的k值，就相当于用较小的领域中的训练实例进行预测，训练误差会减小，只有与输入实例较近或相似的训练实例才会对预测结果起作用，与此同时带来的问题是泛化误差会增大，换言之，k值的减小就意味着整体模型变得复杂，容易发生过拟合；选择较大的k值，就相当于用较大领域中的训练实例进行预测，其优点是可以减少泛化误差，但缺点是训练误差会增大。这此时与输入实例距离较远(不相似的)训练实例也会对预测器作用，使预测发生错误，且k值的增大就意味着整体的模型变得简单。Selecting a smaller value of k is equivalent to using training examples in a smaller field for prediction, and the training error will be reduced. Only the training examples that are close or similar to the input examples will have an effect on the prediction results. The problem is that the generalization error will increase. In other words, the reduction of the k value means that the overall model becomes complex and overfitting is prone to occur; choosing a larger k value is equivalent to using a larger field of The advantage of training examples for prediction is that the generalization error can be reduced, but the disadvantage is that the training error will increase. At this time, the training instance that is far away from the input instance (dissimilar) will also act on the predictor, making the prediction wrong, and the increase of the k value means that the overall model becomes simpler.

2.对于距离的度量，一般采用欧式距离的方式。2. For the measurement of distance, the Euclidean distance is generally used.

由于欧式距离衡量的是多维空间中各个点之间的绝对距离，体现个体数值特征的绝对差异。而本发明的理论基础是同一台区同一相变压器和用户电压数据变化趋势的一致性，更强调电压波动规律的一致性，因此欧式距离不是非常适合此场景。Since the Euclidean distance measures the absolute distance between each point in the multi-dimensional space, it reflects the absolute difference of individual numerical characteristics. The theoretical basis of the present invention is the consistency of the voltage data change trend between the same phase transformer in the same station and the user, and the consistency of the voltage fluctuation law is emphasized. Therefore, the Euclidean distance is not very suitable for this scenario.

因此本发明实施例提供了一种基于优化的监督学习进行台区用户识别的方法，如图2 所示，所述方法包括如下步骤：Therefore, an embodiment of the present invention provides a method for identifying users in a station area based on optimized supervised learning. As shown in FIG. 2 , the method includes the following steps:

步骤a，获取台区变压器低压侧及待识别的用户电表电压数据，其中l台台区变压器低压侧电压数据为其中：表示第n个时刻第f台变压器A相的低压侧电压；表示第n个时刻第f台变压器B相的低压侧电压；表示第n个时刻第f台变压器A相的低压侧电压；而待识别的m台用户电表电压数据其中，表示第n个时刻第f台待识别用户电表电压。Step a, obtains the voltage data of the low-voltage side of the transformer in the platform area and the user's electricity meter voltage data to be identified, wherein the voltage data of the low-voltage side of the transformer in the platform area is: in: Indicates the low-voltage side voltage of the A-phase of the f-th transformer at the n-th time; Indicates the low-voltage side voltage of the B-phase of the f-th transformer at the n-th time; Indicates the low-voltage side voltage of the A-phase of the f-th transformer at the n-th time; and the voltage data of m users' electricity meters to be identified in, Indicates the voltage of the f-th user's meter to be identified at the n-th time.

需要说明的是，在本实施例采用的识别数据类型为电压数据，在另一实施例中也可以根据实际应用需求选取电流数据或其他电力相关数据，以能够用于台区用户拓扑关系实施即可，本发明对此不进行限定。It should be noted that the identification data type used in this embodiment is voltage data. In another embodiment, current data or other power-related data may also be selected according to actual application requirements, so as to be able to be used in the implementation of the topology relationship of users in the station area. Yes, the present invention is not limited to this.

步骤b，对获取的数据进行预处理。在一个实施例中，当待处理数据的维度较大时，可以选取一些常用的降维算法对变压器和用户电压数据进行降维处理，把多维的电压数据化为少数几个主成分来进行分析，以提高算法的计算效率。在另一个实施例中当数据维度较小时，也可以选择不降维直接处理，以提升数据处理的准确性。Step b, preprocessing the acquired data. In one embodiment, when the dimension of the data to be processed is relatively large, some commonly used dimension reduction algorithms can be selected to perform dimension reduction processing on the transformer and user voltage data, and the multi-dimensional voltage data can be converted into a few principal components for analysis. , to improve the computational efficiency of the algorithm. In another embodiment, when the data dimension is small, direct processing without dimensionality reduction can also be selected to improve the accuracy of data processing.

步骤c，确定已知台户拓扑关系的用户及用户所属台区和相别，根据用户所属台区和相别确定用户数据的对应标签。选取其中一部分已知台户关系的用户电压数据和对应标签(即所属的变压器及相别)及变压器的电压数据和标签作为训练集；一部分已知台户关系的用户电压数据和对应标签作为验证集；剩余的已知台户关系的用户电压数据和对应标签作为测试集。Step c, determine the users of the known station-user topology relationship and the user's station area and the difference, and determine the corresponding label of the user data according to the user's station area and the difference. Select a part of the user voltage data and corresponding labels (that is, the transformers to which they belong and their phases) and transformer voltage data and labels with known station-household relationships as the training set; part of the user voltage data and corresponding labels with known station-household relationships are used as verification. Set; the remaining user voltage data and corresponding labels of the known station-user relationship are used as the test set.

在一个实施例中，训练集、验证集和测试集的数据占比分别为80％、10％和10％，在另一个实施例中该占比也可以根据实际情况进行不同比例的设定，例如当获取的数据量非常大时，验证集和测试集的占比可以适当降低，本发明对此不做限定。In one embodiment, the data proportions of the training set, the validation set and the test set are 80%, 10% and 10% respectively. In another embodiment, the proportions can also be set in different proportions according to the actual situation. For example, when the amount of acquired data is very large, the proportion of the verification set and the test set can be appropriately reduced, which is not limited in the present invention.

步骤d，在训练集中数据和标签已知的情况下，确定距离度量方式，输入验证集的数据，遍历所有可能出现的k值，利用优化后的KNN算法模型对验证集中的用户电压数据进行分类，评估输入不同的k值时，对验证集用户分类结果的准确度，并选取准确度最高时的k值作为输入参数。Step d, when the data and labels in the training set are known, determine the distance measurement method, input the data in the validation set, traverse all possible k values, and use the optimized KNN algorithm model to classify the user voltage data in the validation set. , to evaluate the accuracy of the user classification results of the validation set when different k values are input, and select the k value with the highest accuracy as the input parameter.

其中，所述利用优化后的KNN模型算法对验证集中的用户电压数据进行分类具体为：Wherein, the use of the optimized KNN model algorithm to classify the user voltage data in the verification set is specifically:

d1，计算验证数据与各个训练数据之间的距离，并确定距离度量方式；其中距离可采用相关系数、余弦相似度、欧式距离，经测试，采用相关系数的效果好于采用余弦相似度，其次采用余弦相似度的效果好于采用欧式距离；所述距离的定义分别如下：d1, calculate the distance between the verification data and each training data, and determine the distance measurement method; the distance can use the correlation coefficient, cosine similarity, Euclidean distance, after testing, the effect of using the correlation coefficient is better than using the cosine similarity, and then The effect of using cosine similarity is better than using Euclidean distance; the definition of said distance is as follows:

(a).所述距离L_pq采用欧式距离来定义时，(a). When the distance L _pq is defined by the Euclidean distance,

式中，n'为样本的数据维度，z_pd为第p个行向量的第d维坐标，z_qd为第q个行向量的第d维坐标。In the formula, n' is the data dimension of the sample, z _pd is the d-dimensional coordinate of the p-th row vector, and z _qd is the d-dimensional coordinate of the q-th row vector.

(b).所述距离L_pq采用余弦相似度来定义时，(b). When the distance L _pq is defined by cosine similarity,

式中，z_p'为行向量z_p的转置。z_q'为行向量z_q的转置。In the formula, z _p ' is the transpose of the row vector z _p . z _q ' is the transpose of the row vector z _q .

(c).所述距离L_pq采用相关系数来定义时，(c). When the distance L _pq is defined by the correlation coefficient,

式中，为单位行向量。In the formula, is a unit row vector.

d2，按照距离值的递增关系对训练数据进行排序；d2, sort the training data according to the increasing relationship of the distance value;

d3，选取训练数据中距离值最小的前k个点；d3, select the first k points with the smallest distance value in the training data;

d4，统计并确定前k个点所属类别的出现频率；d4, count and determine the frequency of occurrence of the category to which the first k points belong;

d5，返回前k个点中出现频率最高的分类类别作为验证数据的预测分类。d5, return the classification category with the highest frequency in the first k points as the predicted classification of the validation data.

步骤e，对前一步骤中确定的k值进行判断，判断其是否满足预定的目标条件，当满足预定目标条件时继续下一步骤，当不满足预定目标条件时返回步骤S4、并进一步地利用测试集对训练模型进行测试，进一步确认k值的合理性。In step e, the k value determined in the previous step is judged to determine whether it satisfies the predetermined target condition, when the predetermined target condition is satisfied, continue to the next step, and when the predetermined target condition is not satisfied, return to step S4, and further utilize The test set is used to test the training model to further confirm the rationality of the k value.

当步骤d中利用验证集的数据确定一个k值后，此时还可能会出现如下两种情况：When a k value is determined using the data of the validation set in step d, the following two situations may occur:

此时即认为该k值不满足预定的目标条件，需要在步骤S5中进一步利用测试集进一步确认 k值的合理性和唯一性。At this time, it is considered that the k value does not meet the predetermined target condition, and the test set needs to be used to further confirm the rationality and uniqueness of the k value in step S5.

步骤f，采用已训练好的模型和确定的k值对待识别用户的电压数据进行分类，进而实现对待识别用户的电压数据中的台区用户识别，并返回识别结果。In step f, the trained model and the determined k value are used to classify the voltage data of the user to be identified, thereby realizing the identification of the users in the station area in the voltage data of the user to be identified, and returning the identification result.

最终，与台区变压器某一相识别为分到该类的用户即为隶属于该变压器该相别的用户。Ultimately, users who are identified with a certain phase of the transformer in the station area and classified as such are the users who belong to that phase of the transformer.

本发明并不局限于前述的具体实施方式。本发明扩展到任何在本说明书中披露的新特征或任何新的组合，以及披露的任一新的方法或过程的步骤或任何新的组合。如果本领域技术人员，在不脱离本发明的精神所做的非实质性改变或改进，都应该属于本发明权利要求保护的范围。The present invention is not limited to the foregoing specific embodiments. The present invention extends to any new features or any new combination disclosed in this specification, as well as any new method or process steps or any new combination disclosed. If those skilled in the art make insubstantial changes or improvements without departing from the spirit of the present invention, they should all belong to the scope of protection of the claims of the present invention.

Claims

1. a method for carrying out user identification in station area based on optimized supervised learning, is characterized in that, comprises the following steps:

Step S1, acquiring the voltage data of the low-voltage side of the transformer in the station area and the voltage data of the user's electric meter to be identified;

Step S2, preprocessing the acquired voltage data;

Step S3, determine the users with known topological relations of the users and the station area and the relation to which the user belongs, determine the corresponding label of the user data according to the station area and phase to which the user belongs, establish a training set, a verification set and a test set, and use the cross-validation method. way to determine the k parameters in the KNN model, and complete the training of the model;

Step S4, using the trained training model and the determined k value to identify and classify the voltage data of the user to be identified, thereby realizing the identification of users in the platform area in the voltage data of the user to be identified, and outputting the identification result.

2. the platform area user identification and discriminating method based on supervised learning as claimed in claim 1, it is characterized in that, in step S1, 1 platform area transformer low voltage side voltage data is in: Indicates the low-voltage side voltage of the A-phase of the f-th transformer at the n-th time; Indicates the low-voltage side voltage of the B-phase of the f-th transformer at the n-th time; Indicates the low-voltage side voltage of the A-phase of the f-th transformer at the n-th time; and the voltage data of m users' electricity meters to be identified in, Indicates the voltage of the f-th user's meter to be identified at the n-th time.

3. the station area user identification and discriminating method based on supervised learning as claimed in claim 1, is characterized in that, in described step S2, when the dimension of data to be processed is larger, data is carried out dimensionality reduction processing, multidimensional data It can be converted into a few principal components for analysis to improve the computational efficiency of the algorithm; or when the data dimension is small, it can be directly processed without dimensionality reduction.

4. the station area user identification and discriminating method based on supervised learning as claimed in claim 1, is characterized in that, adopts the mode of cross-validation in described step S3 to determine the k parameter in KNN model, and completes the training of model Specifically include the following steps:

Step S3.1, select a part of user voltage data and corresponding labels with known station-household relationships, and transformer voltage data and corresponding labels as training sets; a part of user voltage data and corresponding labels with known station-household relationships are selected as validation sets; the remaining The user voltage data and corresponding labels of the known station-user relationship are used as the test set;

Step S3.2, when the data and labels in the training set are known, determine the distance measurement method, input the data in the validation set, traverse all possible k values, and use the optimized KNN model to perform the user voltage data in the validation set. Classification, evaluating the accuracy of the user classification results of the validation set when different k values are input, and selecting the k value with the highest accuracy as the input parameter;

Step S3.3, the k value determined in the previous step is judged to determine whether it satisfies the predetermined target condition, when the predetermined target condition is satisfied, continue to the next step, and when the predetermined target condition is not satisfied, return to step S3.2, And further use the training model to classify the test set data, so as to further confirm the rationality of the k value.

5. the station area user identification and discriminating method based on supervised learning as claimed in claim 4, is characterized in that, the ratio that the data of training set, verification set and test set account for total data in described step S3.1 is respectively 80%. %, 10% and 10%.

6. the station area user identification and discriminating method based on supervised learning as claimed in claim 4 or 5, it is characterised in that described utilizing the KNN model after optimization to classify the user voltage data in the verification set is specifically:

S3.2.1, calculate the distance between the verification data and each training data, and determine the distance measurement method;

S3.2.2, sort the training data according to the increasing relationship of the distance values;

S3.2.3, select the first k points with the smallest distance value in the training data;

S3.2.4, count and determine the frequency of occurrence of the category to which the first k points belong;

S3.2.5, return the classification category with the highest frequency in the first k points as the predicted classification of the verification data.

7 . The method for identifying and discriminating users in a station area based on supervised learning according to claim 6 , wherein the distance measurement method adopts one of correlation coefficient, cosine similarity, and Euclidean distance. 8 .

8. the station area user identification and discriminating method based on supervised learning as claimed in claim 7, is characterized in that, the definition of described distance metric mode is as follows:

When the distance L _pq is defined by cosine similarity,

In the formula, z _p ′ is the transpose of the row vector z _p . z _q ′ is the transpose of the row vector z _q .

9. the station area user identification and discriminating method based on supervised learning as claimed in claim 7 is characterized in that, the definition of described distance metric mode is as follows:

When the distance L _pq is defined by the correlation coefficient,

In the formula, is a unit row vector.

10. the station area user identification and discriminating method based on supervised learning as claimed in claim 4 or 5 is characterized in that, in step S3.3, k value is judged whether it satisfies predetermined target condition, when following two kinds of When at least one of the conditions is met, it is considered that the value of k does not meet the predetermined target condition:

(1) The case with the highest accuracy corresponds to multiple k values;

(2) The case with the highest accuracy corresponds to only 1 k value;

At this time, it is necessary to use the test set to further confirm the rationality and uniqueness of the k value in step S3.3.