CN116109002A

CN116109002A - Gas equipment early warning system and method based on big data analysis

Info

Publication number: CN116109002A
Application number: CN202310179929.4A
Authority: CN
Inventors: 齐研科; 杨颖�; 贺喜; 蔡雨耕; 王一先; 赵家骏
Original assignee: Chongqing Hezhong Huiran Technology Co ltd
Current assignee: Chongqing Hezhong Huiran Technology Co ltd
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-05-12

Abstract

The invention relates to the technical field of gas early warning, in particular to a gas equipment early warning system and method based on big data analysis, wherein the method comprises the following steps: data acquisition and reading: firstly, historical pressure data are collected to an HDFS through Hadoop, and then a data set of the HDFS is read through Spark; data preprocessing: firstly converting character type features into numerical value types, filtering redundant features, and automatically identifying category features and continuity features by setting the maximum value of category numbers; dividing data: dividing the data set into a training set and a testing set; parameter setting: setting a Spark MLlib decision tree regression model parameter; conversion and assembly: feature conversion of the decision tree regression model is carried out, and the decision tree regression model training assembly is assembled on a production line; training and predicting: and training a decision tree regression model, and carrying out early warning through a prediction result of the decision tree regression model. The invention solves the technical problems that the prior art cannot accurately detect and early warn in time.

Description

Gas equipment early warning system and method based on big data analysis

技术领域technical field

本发明涉及燃气预警技术领域，具体涉及一种基于大数据分析的燃气设备预警系统及方法。The invention relates to the technical field of gas early warning, in particular to a gas equipment early warning system and method based on big data analysis.

背景技术Background technique

随着城市建设和经济建设的飞速发展，以及人民生活水平的普遍提高和石油化学工业的发展，天然气作为优质高效的清洁能源，逐步成为城镇燃气的主导气源，促进了社会经济的发展，减少了对环境的污染。天然气由气态低分子烃和非烃气体混合组成，包括甲烷(85％)和少量乙烷(9％)、丙烷(3％)、氮(2％)和丁烷(1％)，主要用途是作燃料，可制造炭黑、化学药品和液化石油气，由天然气生产的丙烷、丁烷也是现代工业的重要原料。城市中使用天然气的居民建筑越来越多、范围越来越广，特别是人员密集的居民建筑，日常生活方面使用天然气的高峰时段相对集中，如果天然气泄漏，很容易造成严重的安全事故。目前，只能笼统对燃气使用过程中的异常情况进行检测，无法精确地进行检测并及时预警，使得工作人员整改效率低。With the rapid development of urban construction and economic construction, as well as the general improvement of people's living standards and the development of the petrochemical industry, natural gas, as a high-quality and efficient clean energy, has gradually become the dominant source of urban gas, promoting social and economic development, reducing pollution to the environment. Natural gas is composed of a mixture of gaseous low molecular weight hydrocarbons and non-hydrocarbon gases, including methane (85%) and a small amount of ethane (9%), propane (3%), nitrogen (2%) and butane (1%). As fuel, it can produce carbon black, chemicals and liquefied petroleum gas. Propane and butane produced from natural gas are also important raw materials for modern industry. There are more and more residential buildings using natural gas in cities, especially in densely populated residential buildings. The peak hours of using natural gas in daily life are relatively concentrated. If natural gas leaks, it is easy to cause serious safety accidents. At present, it is only possible to detect abnormalities in the process of gas use in general, and it is impossible to accurately detect and give timely warnings, which makes the rectification efficiency of the staff low.

发明内容Contents of the invention

本发明提供一种基于大数据分析的燃气设备预警系统及方法，解决了现有技术无法精确地进行检测并及时预警的技术问题。The invention provides a gas equipment early warning system and method based on big data analysis, which solves the technical problem that the prior art cannot accurately detect and timely early warning.

本发明提供的基础方案为：基于大数据分析的燃气设备预警方法，包括：The basic solution provided by the present invention is: a gas equipment early warning method based on big data analysis, including:

S1、数据采集与读取：先通过Hadoop采集燃气设备的历史压力数据到HDFS，然后通过Spark读取HDFS的数据集；S1. Data collection and reading: first collect the historical pressure data of gas equipment to HDFS through Hadoop, and then read the HDFS data set through Spark;

S2、数据预处理：先把字符型的特征转换为数值类型并过滤冗余特征，再通过设定类别个数的最大值自动识别类别特征与连续性特征；S2. Data preprocessing: first convert character-type features into numeric types and filter redundant features, and then automatically identify category features and continuity features by setting the maximum number of categories;

S3、数据划分：将数据集划分为训练集与测试集；S3. Data division: divide the data set into training set and test set;

S4、参数设置：设置Spark MLlib决策树回归模型参数；S4, parameter setting: set Spark MLlib decision tree regression model parameters;

S5、转换与组装：进行决策树回归模型的特征转换，以及将决策树回归模型训练组装在流水线上；S5. Transformation and assembly: perform feature transformation of the decision tree regression model, and assemble the decision tree regression model training on the assembly line;

S6、训练与预测：进行决策树回归模型训练，并通过决策树回归模型的预测结果进行预警。S6. Training and prediction: conduct decision tree regression model training, and give early warning through the prediction results of the decision tree regression model.

本发明的工作原理及优点在于：通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集，把字符型的特征转换为数值类型并过滤冗余特征，再通过设定类别个数的最大值自动识别类别特征与连续性特征，将数据集划分为训练集与测试集，设置决策树回归模型参数，进行决策树回归模型涉及的特征转换，以及将模型训练组装在流水线上，进行模型训练，并将预测模型的结果进行预警，相较于目前现有技术来说，本方案通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集利用Spark MLlib决策数树回归算法训练数据集，可以把训练好的结果用于正式数据的比对，能够精确地进行检测并及时预警。The working principle and advantages of the present invention are: collect historical pressure data to HDFS through Hadoop, then read the data set of HDFS through Spark, convert character-type features into numerical types and filter redundant features, and then set the number of categories Automatically identify category features and continuous features, divide the data set into training set and test set, set the parameters of the decision tree regression model, perform feature conversion involved in the decision tree regression model, and assemble the model training on the pipeline. Model training and early warning of the results of the prediction model. Compared with the current existing technology, this solution collects historical pressure data to HDFS through Hadoop, and then reads the HDFS data set through Spark and uses Spark MLlib decision number tree regression algorithm The training data set can use the trained results for formal data comparison, and can accurately detect and give timely warnings.

本发明通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集利用Spark MLlib决策数树回归算法训练数据集，可以把训练好的结果用于正式数据的比对，解决了现有技术无法精确地进行检测并及时预警的技术问题。The present invention collects historical pressure data to HDFS through Hadoop, and then reads the data set of HDFS through Spark and uses the Spark MLlib decision tree regression algorithm to train the data set. The trained results can be used for formal data comparison, which solves the existing problem Technical problems that technology cannot accurately detect and give timely warnings.

进一步，S1中，压力数据采用的字段包括采集时的温度、压力、采集时间、是否用气高峰。Further, in S1, the fields used for the pressure data include the temperature, pressure, collection time, and whether the gas consumption is peak or not during collection.

有益效果在于：温度、压力、用气高峰与燃气危险相关性大，同时易于检测，有利于精确地进行检测并及时预警。The beneficial effect is that temperature, pressure, gas consumption peaks are highly correlated with gas hazards, and at the same time, it is easy to detect, which is conducive to accurate detection and timely early warning.

进一步，S2中，将压力数据的压力特征作为标志，把不同值小于等于24的特征识别为类别特征，把压力数据的不同值大于24的特征识别为连续性特征，并对分类特征索引化或数值化。Further, in S2, the pressure feature of the pressure data is used as a mark, and the features with different values less than or equal to 24 are identified as category features, and the features with different values of pressure data greater than 24 are identified as continuous features, and the classification features are indexed or Numerical.

有益效果在于：不同值就是同一个算式得出不同的结果，比如一个结果是正数，一个结果是负数，由于数据集中大部分是分类特征，有些是连续型特征，使用决策树模型回归时，通过设定类别个数的最大值自动识别类别特征与连续性特征，并对分类特征索引化或数值化，有利于进行数据集的划分与决策树回归模型的训练；分类特征是用来表示分类的特征，它不像数值类特征是连续的，分类特征是离散的，比如说IP地址、用户的账号ID等；在关系数据库中，索引是一种单独的、物理的对数据库表中一列或多列的值进行排序的一种存储结构，它是某个表中一列或若干列值的集合和相应的指向表中物理标识这些值的数据页的逻辑指针清单，索引化的作用相当于生产类似图书的目录，可以根据目录中的页码快速找到所需的内容；数值化是指将许多复杂多变的信息转变为可以度量的数字、数据，便于进行储存、分析与处理。The beneficial effect is that different values mean that the same formula can give different results. For example, one result is a positive number and the other is a negative number. Since most of the data sets are categorical features and some are continuous features, when using the decision tree model regression, pass Set the maximum number of categories to automatically identify category features and continuous features, and index or digitize the classification features, which is conducive to the division of data sets and the training of decision tree regression models; classification features are used to represent classification Features, unlike numerical features, which are continuous, classification features are discrete, such as IP addresses, user account IDs, etc.; in a relational database, an index is a separate, physical pair of one or more columns in a database table It is a storage structure for sorting the values of columns, which is a collection of one or several column values in a certain table and the corresponding list of logical pointers to the data pages that physically identify these values in the table. The function of indexing is equivalent to producing similar In the catalog of books, you can quickly find the required content according to the page number in the catalog; digitization refers to transforming many complex and changeable information into measurable numbers and data, which is convenient for storage, analysis and processing.

进一步，还包括S7、数据存储：将S6中决策树回归模型的预测结果存入Redis，用于预警。Further, it also includes S7 and data storage: the prediction results of the decision tree regression model in S6 are stored in Redis for early warning.

有益效果在于：将预测结果存入Redis，便于比对以随时用于预警。The beneficial effect is that the prediction result is stored in Redis, which is convenient for comparison and can be used for early warning at any time.

进一步，S3中，将数据集划分为训练集与测试集时，随机进行划分。Further, in S3, when the data set is divided into a training set and a test set, the division is performed randomly.

有益效果在于：随机划分更具有统计学意义，有利于提高决策树回归模型训练的精确性。The beneficial effect is that the random division has more statistical significance, and is beneficial to improving the accuracy of the decision tree regression model training.

基于上述基于大数据分析的燃气设备预警方法，本发明还提供一种基于大数据分析的燃气设备预警系统，包括：Based on the above gas equipment early warning method based on big data analysis, the present invention also provides a gas equipment early warning system based on big data analysis, including:

数据采集与读取模块，用于先通过Hadoop采集历史压力数据到HDFS，然后通过Spark读取HDFS的数据集；The data acquisition and reading module is used to first collect historical pressure data to HDFS through Hadoop, and then read the HDFS data set through Spark;

所述数据采集与读取模块连接有数据预处理模块，所述数据预处理模块用于先把字符型的特征转换为数值类型并过滤冗余特征，再通过设定类别个数的最大值自动识别类别特征与连续性特征；The data acquisition and reading module is connected with a data preprocessing module, and the data preprocessing module is used to first convert character-type features into numeric types and filter redundant features, and then automatically set the maximum number of categories Identify categorical features and continuous features;

所述数据预处理模块连接有数据划分模块，所述数据划分模块用于将数据集划分为训练集与测试集；The data preprocessing module is connected with a data division module, and the data division module is used to divide the data set into a training set and a test set;

所述数据划分模块连接有参数设置模块，所述参数设置模块用于设置SparkMLlib决策树回归模型参数；Described data division module is connected with parameter setting module, and described parameter setting module is used for setting SparkMLlib decision tree regression model parameter;

所述参数设置模块连接有转换与组装模块，所述转换与组装模块用于进行决策树回归模型涉及的特征转换，以及将决策树回归模型训练组装在流水线上；The parameter setting module is connected with a conversion and assembly module, and the conversion and assembly module is used for performing the feature conversion involved in the decision tree regression model, and assembling the decision tree regression model training on the assembly line;

所述转换与组装模块连接有训练与预测模块，所述训练与预测模块用于进行决策树回归模型训练，并通过决策树回归模型的预测结果进行预警；The conversion and assembly module is connected with a training and prediction module, and the training and prediction module is used for decision tree regression model training, and carries out early warning through the prediction result of the decision tree regression model;

所述训练与预测模块连接有数据存储模块，所述数据存储模块用于将决策树回归模型的预测结果存入Redis用于预警。The training and prediction module is connected with a data storage module, and the data storage module is used to store the prediction results of the decision tree regression model into Redis for early warning.

本发明的工作原理及优点在于：通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集，把字符型的特征转换为数值类型并过滤冗余特征，再通过设定类别个数的最大值自动识别类别特征与连续性特征，将数据集划分为训练集与测试集，设置决策树回归模型参数，进行决策树回归模型涉及的特征转换，以及将模型训练组装在流水线上，进行模型训练，并将预测模型的结果进行预警，相较于目前现有技术来说，本方案通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集，利用Spark MLlib的决策树回归算法训练数据集，把训练好的结果放入Redis用于正式数据的比对，能够精确地进行检测并及时预警。The working principle and advantages of the present invention are: collect historical pressure data to HDFS through Hadoop, then read the data set of HDFS through Spark, convert character-type features into numerical types and filter redundant features, and then set the number of categories Automatically identify category features and continuous features, divide the data set into training set and test set, set the parameters of the decision tree regression model, perform feature conversion involved in the decision tree regression model, and assemble the model training on the pipeline. Model training and early warning of the results of the prediction model. Compared with the current existing technology, this solution collects historical pressure data to HDFS through Hadoop, then reads the HDFS data set through Spark, and uses Spark MLlib's decision tree regression Algorithm training data sets, put the trained results into Redis for formal data comparison, and can accurately detect and give timely warnings.

附图说明Description of drawings

图1为本发明基于大数据分析的燃气设备预警方法实施例的流程图。Fig. 1 is a flow chart of an embodiment of the early warning method for gas equipment based on big data analysis in the present invention.

具体实施方式Detailed ways

下面通过具体实施方式进一步详细的说明：Further detailed explanation through specific implementation mode below:

实施例1Example 1

实施例基本如附图1所示，基于大数据分析的燃气设备预警方法，包括：The embodiment is basically shown in Figure 1, the early warning method for gas equipment based on big data analysis, including:

S1、数据采集与读取：先通过Hadoop采集历史压力数据到HDFS，然后通过Spark读取HDFS的数据集；所采用的字段包括采集时的温度、压力、采集时间、是否用气高峰，温度、压力、用气高峰与燃气危险相关性大，同时易于检测，有利于精确地进行检测并及时预警，比如说，采集时的温度(tempreture)、压力(press)、具体时间(精确到秒)(date)、月份(month)、是否用气高峰(is_peak)等；S1. Data collection and reading: first collect historical pressure data to HDFS through Hadoop, and then read the HDFS data set through Spark; the fields used include temperature, pressure, collection time, peak gas consumption, temperature, Pressure, gas consumption peaks are highly correlated with gas hazards, and are easy to detect, which is conducive to accurate detection and timely warning, for example, temperature (tempreture), pressure (press), specific time (accurate to seconds) ( date), month (month), whether the peak gas consumption (is_peak), etc.;

在本实施例中，Hadoop是一个由Apache基金会所开发的分布式系统基础架构，用户可以在不了解分布式底层细节的情况下开发分布式程序，充分利用集群的威力进行高速运算和存储，Hadoop实现了一个分布式文件系统(Distributed File System)，也即是HDFS(Hadoop Distributed File System)，HDFS具有高容错性，能提供高吞吐量(highthroughput)来访问应用程序的数据，适合那些有着超大数据集(large data set)的应用程序；Hadoop框架最核心的设计就是HDFS和MapReduce，HDFS为海量的数据提供存储，MapReduce为海量的数据提供计算，其中，MapReduce是一种编程模型，用于大规模数据集(大于1TB)的并行运算，"Map(映射)"和"Reduce(归约)"是它们的主要思想，都是从函数式编程语言里借来，具有从矢量编程语言里借来的特性，极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系统上，当前的软件实现是指定一个Map(映射)函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce(归约)函数，用来保证所有映射的键值对中的每一个共享相同的键组；Spark，也即Apache Spark，是一种专为大规模数据处理而设计的快速通用的计算引擎，Spark是UC Berkeley AMP lab(加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark拥有Hadoop MapReduce的优点，但不同于MapReduce的是——Job中间输出结果可以保存在内存中，不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。In this embodiment, Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution, and make full use of the power of the cluster for high-speed computing and storage. Hadoop Implemented a distributed file system (Distributed File System), that is, HDFS (Hadoop Distributed File System), HDFS has high fault tolerance, can provide high throughput (high throughput) to access application data, suitable for those with large data The core design of the Hadoop framework is HDFS and MapReduce. HDFS provides storage for massive data, and MapReduce provides calculation for massive data. Among them, MapReduce is a programming model for large-scale Parallel computing of data sets (greater than 1TB), "Map (mapping)" and "Reduce (reduction)" are their main ideas, both borrowed from functional programming languages, with features borrowed from vector programming languages feature, which greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function, which is used to convert a set of key values To map a new set of key-value pairs, specify a concurrent Reduce (reduction) function to ensure that each of all mapped key-value pairs shares the same key group; Spark, also known as Apache Spark, is a dedicated A fast and general-purpose computing engine designed for large-scale data processing. Spark is a general-purpose Hadoop MapReduce-like parallel framework open sourced by UC Berkeley AMP lab (AMP Lab of the University of California, Berkeley). Spark has the advantages of Hadoop MapReduce, but different The advantage of MapReduce is that the intermediate output results of jobs can be stored in memory, and there is no need to read and write HDFS. Therefore, Spark is more suitable for MapReduce algorithms that require iteration, such as data mining and machine learning.

S2、数据预处理：先把字符型的特征转换为数值类型并过滤冗余特征，再通过设定类别个数的最大值自动识别类别特征与连续性特征；在本实施例中，将压力特征作为标志，具体操作如下，首先：S2. Data preprocessing: first convert character-type features into numeric types and filter redundant features, and then automatically identify category features and continuity features by setting the maximum number of categories; in this embodiment, pressure features As a sign, the specific operation is as follows, first:

1.将温度的特征转换为浮点类型1. Convert the characteristics of temperature to floating point type

val datal＝rawdata.select(val datal = rawdata.select(

rawdata("tempreture").cast("Double")，rawdata("tempreture").cast("Double"),

2.将是否用气高峰的特征转换为浮点类型2. Convert the feature of whether the peak gas consumption is into a floating point type

rawdata("is_peak").cast("Double")，rawdata("is_peak").cast("Double"),

3.将具体时间的特征转换为浮点类型3. Convert the characteristics of specific time to floating point type

rawdata("date").cast("String")，rawdata("date").cast("String"),

4.将月份的特征转换为浮点类型4. Convert the feature of the month to a floating point type

rawdata("month").cast("Double")，rawdata("month").cast("Double"),

5.将压力的特征转换为浮点类型，并修改列名5. Convert the characteristics of the pressure to a floating point type and modify the column name

rawdata("press").cast("Double").alias("label"))rawdata("press").cast("Double").alias("label"))

其次，数据集中大部分是分类特征，有些值是连续型特征，例如温度，使用决策树回归模型时，可以通过设定类别个数的最大值，自动识别哪些特征作为类别特征，哪些作为连续性特征，这里把不同值小于等于24的特征作为类别特征，大于24的特征视为连续性特征，并对分类特征索引化或数值化；例如说：Secondly, most of the data sets are categorical features, and some values are continuous features, such as temperature. When using the decision tree regression model, you can automatically identify which features are categorical features and which are continuous by setting the maximum number of categories. Features, here the features with different values less than or equal to 24 are regarded as categorical features, and the features greater than 24 are regarded as continuous features, and the categorical features are indexed or numericalized; for example:

Val featureIndexer＝newVal featureIndexer = new

VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(24)VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(24)

使用线性回归算法前，需要对类别特征使用One Hot Encoder转换为二元向量(或one-hot编码)将前4个类别字段或特征转换为二元向量，如下：Before using the linear regression algorithm, you need to use One Hot Encoder to convert the category features to binary vectors (or one-hot encoding) to convert the first 4 category fields or features into binary vectors, as follows:

val data2＝new One Hot Encoder().setInputCol("").setutputCol("")val data2 = new One Hot Encoder().setInputCol("").setoutputCol("")

由于One Hot Encoder不是Estimator，这里需要对采用回归算法的数据另外进行处理，先建立一个流水线，把以上转换组装到这个流水线上，如下：Since the One Hot Encoder is not an Estimator, the data using the regression algorithm needs to be processed separately. First, a pipeline is established, and the above conversions are assembled into this pipeline, as follows:

val pipeline en＝new Pipeline().setStages(Array(data2,data3,data4))pipelineval pipeline en = new Pipeline().setStages(Array(data2,data3,data4))pipeline

把原来的1个及转换后的3个二元特征向量，拼接成一个feature向量，命名为features_lr：Splice the original 1 and converted 3 binary feature vectors into a feature vector, named features_lr:

VectorAssembler().setInputCols(Array("is_peakVec","dateVec","monthVec","tempretur e",)).setOutputCol("features_lr")VectorAssembler().setInputCols(Array("is_peakVec","dateVec","monthVec","tempreture",)).setOutputCol("features_lr")

S3、数据划分：将数据集划分为训练集与测试集；在将数据集划分为训练集与测试集时，随机进行划分用于决策树回归模型，这样更具有统计学意义，有利于提高决策树回归模型训练的精确性；如下：S3. Data division: divide the data set into training set and test set; when dividing the data set into training set and test set, randomly divide it into decision tree regression model, which is more statistically significant and conducive to improving decision-making The accuracy of tree regression model training; as follows:

val Array(trainingData,testData)＝data1.randomSplit(Array(0.7,0.3),12)val Array(trainingData,testData)＝data1.randomSplit(Array(0.7,0.3),12)

对data_lr数据集进行随机划分，用于线性回归模型：Randomly partition the data_lr dataset for linear regression models:

val Array(trainingData_lr,testData_lr)＝data_lr.randomSplit(Array(0.7,0.3),12)val Array(trainingData_lr,testData_lr)=data_lr.randomSplit(Array(0.7,0.3),12)

S4、参数设置：设置Spark MLlib决策树回归模型参数；比如说：S4, parameter setting: set Spark MLlib decision tree regression model parameters; for example:

featuresCol:特征列名，默认值为features；featuresCol: feature column name, the default value is features;

labelCol:标签列名，默认为label；labelCol: label column name, the default is label;

predictionCol:预测结果列名，默认为prediction；predictionCol: prediction result column name, the default is prediction;

maxDepth:树的最大深度，默认值为5，maxDepth: the maximum depth of the tree, the default value is 5,

maxBins:连续特征离散化的最大数量，以及选择每个节点分裂特征的方式，默认值为32；maxBins: The maximum number of discretization of continuous features, and the way to select the split features of each node, the default value is 32;

minInstancesPerNode:分裂后各节点最少包含的实例数量，默认值为1；minInstancesPerNode: the minimum number of instances contained in each node after splitting, the default value is 1;

minInfoGain:分裂节点时所需最小信息增益，默认值为0.0；minInfoGain: The minimum information gain required when splitting a node, the default value is 0.0;

maxMemoryInMB:分配给直方图聚合的最大内存，默认为256MB,maxMemoryInMB: The maximum memory allocated for histogram aggregation, the default is 256MB,

cacheNodeIds＝False,cacheNodeIds=False,

checkpointInterval:设置检査点间隔(>＝1)，或不设置检查点(-1)默认值为10checkpointInterval: set checkpoint interval (>=1), or not set checkpoint (-1) default value is 10

impurity:计算信息增益的准则，默认值为variance；impurity: the criterion for calculating information gain, the default value is variance;

seed:随机种子，默认为None，seed: random seed, default is None,

varianceCol:预测的有偏样本偏差的列名，默认无；varianceCol: the column name of the predicted biased sample deviation, the default is none;

val dt＝newval dt = new

DecisionTreeRegressor().setLabelCol("label").setFeaturesCol("indexedFeatures").setMa xBins(64).setMaxDepth(15)DecisionTreeRegressor().setLabelCol("label").setFeaturesCol("indexedFeatures").setMaxBins(64).setMaxDepth(15)

MLlib从功能上说与Scikit-Learn等机器学习库非常类似，但计算引擎采用的是Spark，即所有计算过程均实现了分布式；MLlib is functionally similar to machine learning libraries such as Scikit-Learn, but the calculation engine uses Spark, that is, all calculation processes are distributed;

S5、转换与组装：进行决策树回归模型涉及的特征转换，以及将决策树回归模型训练组装在流水线上；如下：S5. Transformation and assembly: Perform feature transformation involved in the decision tree regression model, and assemble the decision tree regression model training on the pipeline; as follows:

val pipeline_lr＝new Pipeline().setStages(Array(assembler_lr,lr))val pipeline_lr = new Pipeline().setStages(Array(assembler_lr,lr))

S6、训练与预测：进行决策树回归模型训练，并通过决策树回归模型的预测结果进行预警；如下：S6. Training and prediction: conduct decision tree regression model training, and give early warning through the prediction results of the decision tree regression model; as follows:

训练模型：val model＝pipeline.fit(trainingData)Training model: val model = pipeline.fit(trainingData)

预测模型：val predictions＝model.transform(testData)Prediction model: val predictions = model.transform(testData)

S7、数据存储：将决策树回归模型的预测结果存入Redis用于预警；S7, data storage: store the prediction results of the decision tree regression model in Redis for early warning;

将预测结果存入Redis，便于比对以随时预警，Redis(Remote DictionaryServer)，即远程字典服务，是一个开源的使用ANSI C语言编写、支持网络、可基于内存亦可持久化的日志型、Key-Value数据库，并提供多种语言的API。Store the prediction results in Redis, which is convenient for comparison and early warning at any time. Redis (Remote DictionaryServer), that is, remote dictionary service, is an open source log type, Key -Value database, and provide APIs in multiple languages.

在本实施例中，通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集，把字符型的特征转换为数值类型并过滤冗余特征，再通过设定类别个数的最大值自动识别类别特征与连续性特征，将数据集划分为训练集与测试集，设置决策树回归模型参数，进行决策树回归模型涉及的特征转换，以及将模型训练组装在流水线上，进行模型训练，并将预测模型的结果进行预警，相较于目前现有技术来说，本方案通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集利用Spark MLlib决策数树回归算法训练数据集，可以把训练好的结果用于正式数据的比对，能够精确地进行检测并及时预警；除此之外，压力设备采集的数据也可以通过Flink CDC实时同步到Kafka，用FlinkSQL实时处理数据，然后用处理的数据和存放在Redis的数据进行匹配，一旦匹配上就通过短信或者电话方式告警。In this embodiment, collect historical pressure data to HDFS through Hadoop, then read the HDFS data set through Spark, convert character-type features into numeric types and filter redundant features, and then set the maximum number of categories Automatically identify category features and continuous features, divide the data set into training set and test set, set the parameters of the decision tree regression model, perform feature conversion involved in the decision tree regression model, and assemble the model training on the pipeline for model training. And the results of the prediction model will be warned. Compared with the current existing technology, this solution collects historical pressure data to HDFS through Hadoop, and then reads the HDFS data set through Spark and uses Spark MLlib decision number tree regression algorithm to train the data set , the trained results can be used for formal data comparison, which can accurately detect and give timely warnings; in addition, the data collected by pressure equipment can also be synchronized to Kafka in real time through Flink CDC, and the data can be processed in real time with FlinkSQL. Then use the processed data to match the data stored in Redis, and once matched, send an alarm by text message or phone.

实施例2Example 2

与实施例1不同之处仅在于，基于上述基于大数据分析的燃气设备预警方法，本发明还提供一种基于大数据分析的燃气设备预警系统，包括：The only difference from Embodiment 1 is that, based on the above gas equipment early warning method based on big data analysis, the present invention also provides a gas equipment early warning system based on big data analysis, including:

数据预处理模块，用于先把字符型的特征转换为数值类型并过滤冗余特征，再通过设定类别个数的最大值自动识别类别特征与连续性特征；The data preprocessing module is used to convert character-type features into numerical types and filter redundant features, and then automatically identify category features and continuity features by setting the maximum number of categories;

数据划分模块，用于将数据集划分为训练集与测试集；The data division module is used to divide the data set into a training set and a test set;

参数设置模块，用于设置Spark MLlib决策树回归模型参数；Parameter setting module, used to set Spark MLlib decision tree regression model parameters;

转换与组装模块，用于进行决策树回归模型涉及的特征转换，以及将决策树回归模型训练组装在流水线上；The conversion and assembly module is used to perform feature conversion involved in the decision tree regression model, and to assemble the decision tree regression model training on the pipeline;

训练与预测模块，用于进行决策树回归模型训练，并通过决策树回归模型的预测结果进行预警；The training and forecasting module is used for training the decision tree regression model and giving early warning through the prediction results of the decision tree regression model;

数据存储模块，用于将决策树回归模型的预测结果存入Redis用于预警。The data storage module is used to store the prediction results of the decision tree regression model in Redis for early warning.

在本实施例中，通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集，把字符型的特征转换为数值类型并过滤冗余特征，再通过设定类别个数的最大值自动识别类别特征与连续性特征，将数据集划分为训练集与测试集，设置决策树回归模型参数，进行决策树回归模型涉及的特征转换，以及将模型训练组装在流水线上，进行模型训练，并将预测模型的结果进行预警，相较于目前现有技术来说，本方案通过Hadoop采集历史压力数据到HDFS，再通过Spark读取HDFS的数据集，利用Spark MLlib的决策树回归算法训练数据集，把训练好的结果放入Redis用于正式数据的比对，能够精确地进行检测并及时预警。In this embodiment, collect historical pressure data to HDFS through Hadoop, then read the HDFS data set through Spark, convert character-type features into numeric types and filter redundant features, and then set the maximum number of categories Automatically identify category features and continuous features, divide the data set into training set and test set, set the parameters of the decision tree regression model, perform feature conversion involved in the decision tree regression model, and assemble the model training on the pipeline for model training. And the results of the prediction model will be warned. Compared with the current existing technology, this solution collects historical pressure data to HDFS through Hadoop, then reads the HDFS data set through Spark, and uses Spark MLlib's decision tree regression algorithm to train data. Set, put the trained results into Redis for formal data comparison, which can accurately detect and give timely warnings.

以上所述的仅是本发明的实施例，方案中公知的具体结构及特性等常识在此未作过多描述，所属领域普通技术人员知晓申请日或者优先权日之前发明所属技术领域所有的普通技术知识，能够获知该领域中所有的现有技术，并且具有应用该日期之前常规实验手段的能力，所属领域普通技术人员可以在本申请给出的启示下，结合自身能力完善并实施本方案，一些典型的公知结构或者公知方法不应当成为所属领域普通技术人员实施本申请的障碍。应当指出，对于本领域的技术人员来说，在不脱离本发明结构的前提下，还可以作出若干变形和改进，这些也应该视为本发明的保护范围，这些都不会影响本发明实施的效果和专利的实用性。本申请要求的保护范围应当以其权利要求的内容为准，说明书中的具体实施方式等记载可以用于解释权利要求的内容。What is described above is only an embodiment of the present invention, and the common knowledge such as the specific structure and characteristics known in the scheme is not described too much here, and those of ordinary skill in the art know all the common knowledge in the technical field to which the invention belongs before the filing date or the priority date Technical knowledge, being able to know all the existing technologies in this field, and having the ability to apply conventional experimental methods before this date, those of ordinary skill in the art can improve and implement this plan based on their own abilities under the inspiration given by this application, Some typical known structures or known methods should not be obstacles for those of ordinary skill in the art to implement the present application. It should be pointed out that for those skilled in the art, under the premise of not departing from the structure of the present invention, some modifications and improvements can also be made, which should also be regarded as the protection scope of the present invention, and these will not affect the implementation of the present invention. Effects and utility of patents. The scope of protection required by this application shall be based on the content of the claims, and the specific implementation methods and other records in the specification may be used to interpret the content of the claims.

Claims

1. The gas equipment early warning method based on big data analysis is characterized by comprising the following steps:

s1, data acquisition and reading: firstly, acquiring historical pressure data of gas equipment to an HDFS through Hadoop, and then reading a data set of the HDFS through Spark;

s2, data preprocessing: firstly converting character type features into numerical value types, filtering redundant features, and automatically identifying category features and continuity features by setting the maximum value of category numbers;

s3, data division: dividing the data set into a training set and a testing set;

s4, parameter setting: setting a Spark MLlib decision tree regression model parameter;

s5, conversion and assembly: feature conversion of the decision tree regression model is carried out, and the decision tree regression model training assembly is assembled on a production line;

s6, training and predicting: and training a decision tree regression model, and carrying out early warning through a prediction result of the decision tree regression model.

2. The gas equipment early warning method based on big data analysis according to claim 1, wherein the fields adopted by the pressure data in S1 include temperature, pressure, acquisition time, and whether gas peak is used or not at the time of acquisition.

3. The gas equipment pre-warning method based on big data analysis according to claim 2, wherein the pressure characteristics of the pressure data are used as a sign in S2, the characteristics with different values smaller than or equal to 24 of the pressure data are identified as category characteristics, the characteristics with different values larger than 24 are identified as continuity characteristics, and the category characteristics are indexed or digitized.

4. The gas equipment early warning method based on big data analysis according to claim 3, wherein the data set is divided into a training set and a test set in S3 randomly.

5. The gas equipment early warning method based on big data analysis according to claim 4, further comprising S7, data storage: and (3) storing the prediction result of the decision tree regression model in the S6 into Redis for early warning.

6. Gas equipment early warning system based on big data analysis, its characterized in that includes:

the data acquisition and reading module is used for acquiring historical pressure data to the HDFS through Hadoop and then reading a data set of the HDFS through Spark;

the data acquisition and reading module is connected with a data preprocessing module, and the data preprocessing module is used for converting character type features into numerical value types and filtering redundant features, and automatically identifying category features and continuity features by setting the maximum value of category numbers;

the data preprocessing module is connected with a data dividing module, and the data dividing module is used for dividing a data set into a training set and a testing set;

the data dividing module is connected with a parameter setting module, and the parameter setting module is used for setting parameters of a Spark MLlib decision tree regression model;

the parameter setting module is connected with a conversion and assembly module, and the conversion and assembly module is used for carrying out feature conversion related to the decision tree regression model and assembling the decision tree regression model training on the assembly line;

the conversion and assembly module is connected with a training and predicting module, and the training and predicting module is used for training the decision tree regression model and performing early warning through the predicting result of the decision tree regression model;

the training and predicting module is connected with a data storage module, and the data storage module is used for storing the predicting result of the decision tree regression model into the Redis for early warning.