CN109344201A

CN109344201A - A system and method for evaluating database performance load based on machine learning

Info

Publication number: CN109344201A
Application number: CN201811207264.9A
Authority: CN
Inventors: 张明明; 钱琳; 俞俊; 朱广新; 邵星星
Original assignee: Information And Communication Branch Of Jiangsu Electric Power Co Ltd; NARI Group Corp; NARI Technology Co Ltd
Current assignee: Information And Communication Branch Of Jiangsu Electric Power Co Ltd; NARI Group Corp; NARI Technology Co Ltd
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2019-02-15

Abstract

The invention provides a database performance load evaluation system and evaluation method based on machine learning. The machine learning algorithm training data is used to generate a performance load learning model. After data processing, these massive data are used as training sets and machine learning technology is used for training. Generate a performance and load learning model, use this model to predict the newly generated feature data, and evaluate the performance and load of the database. On the one hand, the conclusion of the evaluation system is more reasonable than the analysis of the characteristics of the expert model, and it will not miss important performance load indicators, and the problem location of database performance and load is more accurate; on the other hand, it reduces the need for database operation and maintenance personnel. Knowledge and ability requirements can greatly save labor costs and improve work efficiency.

Description

A kind of database performance load evaluation system and method based on machine learning

Technical field

The present invention is a kind of database performance load evaluation system and method based on machine learning, belongs to artificial intelligence machine Device learning areas is related to database O&M.

Background technique

Database application requires lower response time and high-performance, so the property of will do it before disposing database application It can load testing.But with the lasting operation of database application, user is more and more, and data volume is increasing, leads to data The performance in library and load increase, if being intervened not in time and being handled, may result in the different degrees of event such as system failure Barrier occurs.So the performance of real-time monitoring data and load become quite important now.

The performance load monitoring tools of database are very various at present, but these tools are generally only that simple displaying is some Crucial operating index can not really reflect the operation conditions of database sometimes.When high response and low performance occurs in database When situation, DBA analysis expert is usually still needed, these monitoring tools do not play the positioning and analysis of problem very big Effect.

Although performance expert model commonly used in the art and load expert model are for common performance load monitoring tool For, be greatly improved in terms of analytical database problem and positioning failure, but it there is also following some disadvantages:

1) expert model is the accumulation of advanced DBA knowledge and experience over more years, this working for database operation maintenance personnel Experience and it is proficient in degree and has very high requirement, virtually improves human cost.

2) expert model is in addition to rule definition, and there are also complicated scripts and code, this needs stable developer and fortune Troop is tieed up, has higher requirement to personnel's circulation.

3) expert model is the experience accumulation of advanced DBA, it defines whether really to be fitted for some indexs and regular The performance of database and load have uncertainty, because each DBA can some deviations for the understanding of index and rule.Such as There are the abnormal conditions of some small probabilities in fruit, and the index of expert model is not related to, then just needing advanced DBA from a large amount of Historical data in discovery influence performance and load factor, this process Wang Wang is very time-consuming.

In summary illustrate, the technical solution for needing one kind new is to solve the above problems.

Summary of the invention

Goal of the invention: the present invention discloses a kind of database performance load evaluation system based on machine learning and assessment side Method generates performance load learning model using machine learning algorithm training data, is subject to pre- assessment to new data using this model Estimate.On the one hand, the conclusion of the assessment system is more reasonable than analysis of the expert model to feature, will not more omit important performance Loading index, it is more accurate to position to the problem of database performance and load；On the other hand it reduces again to database operation maintenance personnel Knowledge and Capability Requirement, human cost can be greatly saved, working efficiency is provided.

Technical solution: in order to achieve the above objectives, the present invention is based on the database performance load evaluation systems of machine learning can It adopts the following technical scheme that

A kind of database performance load evaluation system based on machine learning, comprising:

Data acquisition module grabs characteristic to report to neutralize in database journal from database awr；

Data preprocessing module, to delete the single value tag in characteristic, missing feature is deleted, high correlation Feature is deleted, while to fill missing data, make data normalization；

Training data model module generates model to training data；

Model evaluation module collects assessment models using verifying to the evaluation index according to different machine learning models；

Model tuning module, to adjust the hyper parameter of model to model automated tuning；

Model prediction module is divided into offline prediction and on-line prediction to prediction model；Offline prediction refers to utilizing instruction Practice the test set that collection is separated；On-line prediction refers to the collected real time data of data collector, need by scaler into It is being predicted after row data normalization.

Further, in data preprocessing module, single value tag is deleted are as follows: when a certain column characteristic value is all identical, Directly delete；

It lacks feature to delete are as follows: then delete this feature column when characteristic series missing ratio reaches specific threshold；

High correlated characteristic is deleted are as follows: calculates the correlation between characteristic variable by Pearson correlation coefficient, building is related Relational matrix deletes one of feature when finding that feature correlation is higher than some threshold value；

Missing Data Filling includes: to go to fill with the previous row data of missing values；It goes to fill with the average value of missing characteristic series； Performance and load are A, B, C, D, E according to grade classification, fill missing values according to the average value of each grade.

Data normalization are as follows: scaling is normalized in initial characteristic data, using deviation standardization and standard difference standard Change；Scaler during data normalization is stored in local, needs to utilize the contracting when predicting freshly harvested data Put the data normalization that device carries out equal extent.

Further, in training data model module, data set is divided into training set, verifying collection and test set.Wherein instruct Practice collection, it is respectively 70%, 15%, 15% that verifying collection and test set, which account for the ratio of data set,；

This system uses Integrated Algorithm training pattern, and gradient is promoted tree algorithm GBM and got a promotion based on gradient descent algorithm Exponential model.Regression model uses training aids LGBMRegressor, and disaggregated model uses training aids LGBMClassifier.

Further, early stop technology and CheckPoint technology are used during assessment models；

Early stop: deconditioning in advance sees in discovery training process that the validation error for verifying collection will not become again When change, training can be automatically stopped after k wheel, and k is preset value；

CheckPoint technology: the technology can automatically save training pattern optimal in training process.

Further, the hyper parameter in model tuning module includes:

Iteration wheel number epochs

Learning rate learning_rate

Setting tree depth max_depth

Learner leaf maximum number max_leaves

The quantity number n_estimators of weak learner

Loss function objective.

The utility model has the advantages that the present invention is based on the database performance load evaluation systems of machine learning from the database of multiple examples Extract multiple characteristic indexs about performance and load in AWR report and database journal, by data mart modeling (data cleansing and Conversion), it is trained using these mass datas as training set and with machine learning techniques, ultimately generates performance and load is learned Model is practised, newly generated characteristic is predicted using this model, assesses the performance and loading condition of database.The assessment 4~6 or so, performance classification model and load are classified for the performance regression model of system and the root-mean-square error of load regression model Model accuracy rate reaches 99.3 or so, has reached production application requirement.The system is artificial intelligence in database application Primary innovation and application, reduce the human cost of enterprise, reach good economic benefit.

And the corresponding above-mentioned database performance load evaluation system based on machine learning, the present invention also provides be based on machine The technical solution of the database performance load evaluation method of study:

A kind of database performance load evaluation method based on machine learning, comprising the following steps:

(1), data acquisition: report to neutralize in database journal grabbing characteristic from database awr；

(2), data prediction: the single value tag in characteristic is deleted, and missing feature is deleted, high correlation feature It deletes, while to be filled missing data, make data normalization；

(3) training data generates model；

(4), assessment models: according to the evaluation index of different machine learning models, collect assessment models using verifying；

(5), the hyper parameter of model model tuning: is adjusted to model automated tuning；

(6), model prediction: it is divided into offline prediction and on-line prediction；Offline prediction refers to separating using training set Test set；On-line prediction refers to the collected real time data of data collector, needs to carry out data normalization by scaler It is being predicted afterwards.

In step (2), single value tag is deleted are as follows: when a certain column characteristic value is all identical, is directly deleted；

In step (3), data set is divided into training set, verifying collection and test set.Wherein training set, verifying collection and test set The ratio for accounting for data set is respectively 70%, 15%, 15%；

In step (4), early stop technology and CheckPoint technology are used；

Hyper parameter in step (5) includes:

Iteration wheel number epochs

Learning rate learning_rate

Setting tree depth max_depth

Learner leaf maximum number max_leaves

The quantity number n_estimators of weak learner

Loss function objective.

The utility model has the advantages that the beneficial effect of the corresponding above-mentioned database performance load evaluation system based on machine learning, this is negative It is more reasonable than analysis of the expert model to feature to carry the conclusion that appraisal procedure is made, more will not omit important performance load and refer to Mark, it is more accurate to position to the problem of database performance and load；On the other hand the knowledge to database operation maintenance personnel is reduced again And Capability Requirement, it can be greatly saved human cost, working efficiency is provided.

Detailed description of the invention

Fig. 1 is the flow chart of the database performance load evaluation method in this assessment system based on machine learning.

Specific embodiment

Incorporated by reference to shown in Fig. 1,

The present invention provides a kind of database performance load evaluation system based on machine learning, comprising:

Training data model module generates model to training data；

In data preprocessing module, single value tag is deleted are as follows: when a certain column characteristic value is all identical, is directly deleted；

In training data model module, data set is divided into training set, verifying collection and test set.Wherein training set, verifying It is respectively 70%, 15%, 15% that collection and test set, which account for the ratio of data set,；

Early stop technology and CheckPoint technology are used during assessment models；

Hyper parameter in model tuning module includes:

Iteration wheel number epochs

Learning rate learning_rate

Setting tree depth max_depth

Learner leaf maximum number max_leaves

The quantity number n_estimators of weak learner

Loss function objective

Please in conjunction with shown in Fig. 1, the database performance load evaluation method in this assessment system based on machine learning includes:

1, data set is obtained.

The performance load index with log log is reported using the awr of PostgreSQL database access oracle database, Data are grabbed to grab according to average 3 minutes.Characteristic in awr tables of data has 62 dimensions, and the characteristic in log sheet has Data in PostgreSQL tables of data are stored in this according to the sequencing of crawl time using python script by 200 dimensions In ground csv file.

2, data prediction.

Single value tag is deleted: when a certain column characteristic value is all identical, any help no to the fitting of model also increases Add calculation amount, can directly delete.

Missing feature is deleted: then being deleted this feature column when characteristic series missing ratio reaches specific threshold, is lacked in this system It loses threshold value and is set as 60%.

High correlated characteristic is deleted: being calculated the correlation between characteristic variable by Pearson correlation coefficient, is constructed related close It is matrix, deletes one of feature when finding that feature correlation is higher than some threshold value, the setting of this system relevance threshold It is 0.9.

Missing Data Filling: there is missing data in obtaining data engineering and be inevitable, we only go as far as possible It is fitted missing values.Three kinds of methods are used in this system:

It goes to fill with the previous row data of missing values.

It goes to fill with the average value of missing characteristic series.

Performance and load are A, B, C, D, E according to grade classification, fill missing values according to the average value of each grade.

Data normalization: scaling is normalized in initial characteristic data, this system uses two methods: deviation standardization (Min-max normalization) and standard deviation standardize (zero-mean normalization).Data normalization process In scaler be stored in local, need that the scaler is utilized to carry out equal extent when predicting freshly harvested data Data normalization.

In disaggregated model, need to divide performance rate and load etc. by section according to performance scores and load score Grade, specific as follows:

3, training data generates model

Data set is divided into training set, verifying collection and test set.Wherein training set, verifying collection and test set account for data set Ratio is respectively 70%, 15%, 15%.

4, assessment models

Early stop technology and CheckPoint technology are used during assessment models.

Early stop: deconditioning in advance sees in discovery training process that the validation error for verifying collection will not become again When change, training can be automatically stopped after k wheel, and k is settable.Invalid training process, training for promotion effect will not thus occur Rate.

CheckPoint technology: the technology can automatically save training pattern optimal in training process, even if being abnormal Training is caused to stop that training result can also be saved in time.

5, model tuning

Model tuning function adjusts the hyper parameter of model using hyopt, and major parameter includes:

Iteration wheel number epochs

Learning rate learning_rate

Setting tree depth max_depth

Learner leaf maximum number max_leaves

The quantity number n_estimators of weak learner

Loss function objective

6, model prediction

Offline prediction: being predicted in model after tuning using the data in training set, and

Prediction in real time: grabbing data from database in real time and predicted, shows prediction result and is inserted into the new data Into historical data, in case doing training set use when later period more new model.

In addition, there are many concrete methods of realizing and approach of the invention, the above is only a preferred embodiment of the present invention. It should be pointed out that for those skilled in the art, without departing from the principle of the present invention, can also do Several improvements and modifications out, these modifications and embellishments should also be considered as the scope of protection of the present invention.What is be not known in the present embodiment is each The available prior art of component part is realized.

Claims

1. A system for evaluating database performance load based on machine learning, comprising:

Data acquisition module to capture feature data from database awr reports and database logs;

The data preprocessing module is used to delete single-value features, missing features, and high-correlation features in the feature data, and at the same time, it is used to fill in missing data and normalize data;

The training data model module is used to train the data generation model;

The model evaluation module is used to evaluate the model using the validation set according to the evaluation indicators of different machine learning models;

The model tuning module is used to automatically tune the model and adjust the hyperparameters of the model;

The model prediction module is used to predict the model, which is divided into offline prediction and online prediction; offline prediction refers to the test set separated from the training set; online prediction refers to the real-time data collected by the data collector, which needs to be processed by the scaler. Prediction after normalization.

2. The database performance load evaluation system according to claim 1, wherein, in the data preprocessing module, the single-value feature deletion is: when all the feature values of a certain column are the same, directly delete;

The deletion of missing features is: when the missing ratio of the feature column reaches a certain threshold, the feature column is deleted;

Deletion of highly correlated features is: Calculate the correlation between feature variables through the Pearson correlation coefficient, build a correlation matrix, and delete one of the features when the feature correlation is found to be higher than a certain threshold;

Filling of missing values includes: filling with the previous row of missing values; filling with the average value of missing feature columns; performance and load are divided into A, B, C, D, E according to grades, and filling missing according to the average value of each grade value.

Data normalization is: normalize and scale the original feature data, using dispersion normalization and standard deviation normalization; the scaler in the data normalization process is saved locally, and the scaler needs to be used to perform the same function when predicting the newly collected data. A degree of data normalization.

3. The database performance load evaluation system according to claim 2, wherein in the training data model module, the data set is divided into a training set, a verification set and a test set. The training set, validation set and test set account for 70%, 15% and 15% of the dataset respectively;

This system uses the ensemble algorithm to train the model, and the gradient boosting tree algorithm GBM obtains the boosted number model based on the gradient descent algorithm. The regression model uses the trainer LGBMRegressor, and the classification model uses the trainer LGBMClassifier.

4. database performance load evaluation system according to claim 3, is characterized in that: use early stop technology and CheckPoint technology in evaluating model process;

early stop: stop training in advance, and see that when the validation error of the validation set will not change during the training process, the training will automatically stop after k rounds, and k is the default value;

CheckPoint technology: This technology automatically saves the optimal training model during the training process.

5. The database performance load evaluation system according to claim 4, wherein the hyperparameters in the model tuning module comprise:

Iteration rounds epochs

learning rate learning_rate

Set tree depth max_depth

The maximum number of learner leaves max_leaves

Number of weak learners n_estimators

Loss function objective.

6. A method for evaluating database performance load based on machine learning, comprising the following steps:

(1), data acquisition: grab characteristic data from the database awr report and the database log;

(2) Data preprocessing: delete single-value features, missing features, and high-correlation features in the feature data, and at the same time, it is used to fill in missing data and normalize data;

(3), training data generation model;

(4) Evaluation model: According to the evaluation indicators of different machine learning models, use the validation set to evaluate the model;

(5) Model tuning: automatically tune the model and adjust the hyperparameters of the model;

(6) Model prediction: divided into offline prediction and online prediction; offline prediction refers to the test set separated from the training set; online prediction refers to the real-time data collected by the data collector, which needs to be normalized by the scaler. making predictions.

7. The database performance load evaluation system according to claim 6, wherein in step (2), the single value feature deletion is: when a certain column feature value is all the same, it is directly deleted;

8. The database performance load evaluation system according to claim 2, wherein in step (3), the data set is divided into a training set, a verification set and a test set. The training set, validation set and test set account for 70%, 15% and 15% of the dataset respectively;

9. database performance load assessment system according to claim 3, is characterized in that: in step (4), use early stop technology and CheckPoint technology;

10. The database performance load evaluation system according to claim 4, wherein the hyperparameter in step (5) comprises:

Iteration rounds epochs

learning rate learning_rate

Set tree depth max_depth

The maximum number of learner leaves max_leaves

Number of weak learners n_estimators

Loss function objective.