A kind of database performance load evaluation system and method based on machine learning
Technical field
The present invention is a kind of database performance load evaluation system and method based on machine learning, belongs to artificial intelligence machine
Device learning areas is related to database O&M.
Background technique
Database application requires lower response time and high-performance, so the property of will do it before disposing database application
It can load testing.But with the lasting operation of database application, user is more and more, and data volume is increasing, leads to data
The performance in library and load increase, if being intervened not in time and being handled, may result in the different degrees of event such as system failure
Barrier occurs.So the performance of real-time monitoring data and load become quite important now.
The performance load monitoring tools of database are very various at present, but these tools are generally only that simple displaying is some
Crucial operating index can not really reflect the operation conditions of database sometimes.When high response and low performance occurs in database
When situation, DBA analysis expert is usually still needed, these monitoring tools do not play the positioning and analysis of problem very big
Effect.
Although performance expert model commonly used in the art and load expert model are for common performance load monitoring tool
For, be greatly improved in terms of analytical database problem and positioning failure, but it there is also following some disadvantages:
1) expert model is the accumulation of advanced DBA knowledge and experience over more years, this working for database operation maintenance personnel
Experience and it is proficient in degree and has very high requirement, virtually improves human cost.
2) expert model is in addition to rule definition, and there are also complicated scripts and code, this needs stable developer and fortune
Troop is tieed up, has higher requirement to personnel's circulation.
3) expert model is the experience accumulation of advanced DBA, it defines whether really to be fitted for some indexs and regular
The performance of database and load have uncertainty, because each DBA can some deviations for the understanding of index and rule.Such as
There are the abnormal conditions of some small probabilities in fruit, and the index of expert model is not related to, then just needing advanced DBA from a large amount of
Historical data in discovery influence performance and load factor, this process Wang Wang is very time-consuming.
In summary illustrate, the technical solution for needing one kind new is to solve the above problems.
Summary of the invention
Goal of the invention: the present invention discloses a kind of database performance load evaluation system based on machine learning and assessment side
Method generates performance load learning model using machine learning algorithm training data, is subject to pre- assessment to new data using this model
Estimate.On the one hand, the conclusion of the assessment system is more reasonable than analysis of the expert model to feature, will not more omit important performance
Loading index, it is more accurate to position to the problem of database performance and load;On the other hand it reduces again to database operation maintenance personnel
Knowledge and Capability Requirement, human cost can be greatly saved, working efficiency is provided.
Technical solution: in order to achieve the above objectives, the present invention is based on the database performance load evaluation systems of machine learning can
It adopts the following technical scheme that
A kind of database performance load evaluation system based on machine learning, comprising:
Data acquisition module grabs characteristic to report to neutralize in database journal from database awr;
Data preprocessing module, to delete the single value tag in characteristic, missing feature is deleted, high correlation
Feature is deleted, while to fill missing data, make data normalization;
Training data model module generates model to training data;
Model evaluation module collects assessment models using verifying to the evaluation index according to different machine learning models;
Model tuning module, to adjust the hyper parameter of model to model automated tuning;
Model prediction module is divided into offline prediction and on-line prediction to prediction model;Offline prediction refers to utilizing instruction
Practice the test set that collection is separated;On-line prediction refers to the collected real time data of data collector, need by scaler into
It is being predicted after row data normalization.
Further, in data preprocessing module, single value tag is deleted are as follows: when a certain column characteristic value is all identical,
Directly delete;
It lacks feature to delete are as follows: then delete this feature column when characteristic series missing ratio reaches specific threshold;
High correlated characteristic is deleted are as follows: calculates the correlation between characteristic variable by Pearson correlation coefficient, building is related
Relational matrix deletes one of feature when finding that feature correlation is higher than some threshold value;
Missing Data Filling includes: to go to fill with the previous row data of missing values;It goes to fill with the average value of missing characteristic series;
Performance and load are A, B, C, D, E according to grade classification, fill missing values according to the average value of each grade.
Data normalization are as follows: scaling is normalized in initial characteristic data, using deviation standardization and standard difference standard
Change;Scaler during data normalization is stored in local, needs to utilize the contracting when predicting freshly harvested data
Put the data normalization that device carries out equal extent.
Further, in training data model module, data set is divided into training set, verifying collection and test set.Wherein instruct
Practice collection, it is respectively 70%, 15%, 15% that verifying collection and test set, which account for the ratio of data set,;
This system uses Integrated Algorithm training pattern, and gradient is promoted tree algorithm GBM and got a promotion based on gradient descent algorithm
Exponential model.Regression model uses training aids LGBMRegressor, and disaggregated model uses training aids LGBMClassifier.
Further, early stop technology and CheckPoint technology are used during assessment models;
Early stop: deconditioning in advance sees in discovery training process that the validation error for verifying collection will not become again
When change, training can be automatically stopped after k wheel, and k is preset value;
CheckPoint technology: the technology can automatically save training pattern optimal in training process.
Further, the hyper parameter in model tuning module includes:
Iteration wheel number epochs
Learning rate learning_rate
Setting tree depth max_depth
Learner leaf maximum number max_leaves
The quantity number n_estimators of weak learner
Loss function objective.
The utility model has the advantages that the present invention is based on the database performance load evaluation systems of machine learning from the database of multiple examples
Extract multiple characteristic indexs about performance and load in AWR report and database journal, by data mart modeling (data cleansing and
Conversion), it is trained using these mass datas as training set and with machine learning techniques, ultimately generates performance and load is learned
Model is practised, newly generated characteristic is predicted using this model, assesses the performance and loading condition of database.The assessment
4~6 or so, performance classification model and load are classified for the performance regression model of system and the root-mean-square error of load regression model
Model accuracy rate reaches 99.3 or so, has reached production application requirement.The system is artificial intelligence in database application
Primary innovation and application, reduce the human cost of enterprise, reach good economic benefit.
And the corresponding above-mentioned database performance load evaluation system based on machine learning, the present invention also provides be based on machine
The technical solution of the database performance load evaluation method of study:
A kind of database performance load evaluation method based on machine learning, comprising the following steps:
(1), data acquisition: report to neutralize in database journal grabbing characteristic from database awr;
(2), data prediction: the single value tag in characteristic is deleted, and missing feature is deleted, high correlation feature
It deletes, while to be filled missing data, make data normalization;
(3) training data generates model;
(4), assessment models: according to the evaluation index of different machine learning models, collect assessment models using verifying;
(5), the hyper parameter of model model tuning: is adjusted to model automated tuning;
(6), model prediction: it is divided into offline prediction and on-line prediction;Offline prediction refers to separating using training set
Test set;On-line prediction refers to the collected real time data of data collector, needs to carry out data normalization by scaler
It is being predicted afterwards.
In step (2), single value tag is deleted are as follows: when a certain column characteristic value is all identical, is directly deleted;
It lacks feature to delete are as follows: then delete this feature column when characteristic series missing ratio reaches specific threshold;
High correlated characteristic is deleted are as follows: calculates the correlation between characteristic variable by Pearson correlation coefficient, building is related
Relational matrix deletes one of feature when finding that feature correlation is higher than some threshold value;
Missing Data Filling includes: to go to fill with the previous row data of missing values;It goes to fill with the average value of missing characteristic series;
Performance and load are A, B, C, D, E according to grade classification, fill missing values according to the average value of each grade.
Data normalization are as follows: scaling is normalized in initial characteristic data, using deviation standardization and standard difference standard
Change;Scaler during data normalization is stored in local, needs to utilize the contracting when predicting freshly harvested data
Put the data normalization that device carries out equal extent.
In step (3), data set is divided into training set, verifying collection and test set.Wherein training set, verifying collection and test set
The ratio for accounting for data set is respectively 70%, 15%, 15%;
This system uses Integrated Algorithm training pattern, and gradient is promoted tree algorithm GBM and got a promotion based on gradient descent algorithm
Exponential model.Regression model uses training aids LGBMRegressor, and disaggregated model uses training aids LGBMClassifier.
In step (4), early stop technology and CheckPoint technology are used;
Early stop: deconditioning in advance sees in discovery training process that the validation error for verifying collection will not become again
When change, training can be automatically stopped after k wheel, and k is preset value;
CheckPoint technology: the technology can automatically save training pattern optimal in training process.
Hyper parameter in step (5) includes:
Iteration wheel number epochs
Learning rate learning_rate
Setting tree depth max_depth
Learner leaf maximum number max_leaves
The quantity number n_estimators of weak learner
Loss function objective.
The utility model has the advantages that the beneficial effect of the corresponding above-mentioned database performance load evaluation system based on machine learning, this is negative
It is more reasonable than analysis of the expert model to feature to carry the conclusion that appraisal procedure is made, more will not omit important performance load and refer to
Mark, it is more accurate to position to the problem of database performance and load;On the other hand the knowledge to database operation maintenance personnel is reduced again
And Capability Requirement, it can be greatly saved human cost, working efficiency is provided.
Detailed description of the invention
Fig. 1 is the flow chart of the database performance load evaluation method in this assessment system based on machine learning.
Specific embodiment
Incorporated by reference to shown in Fig. 1,
The present invention provides a kind of database performance load evaluation system based on machine learning, comprising:
Data acquisition module grabs characteristic to report to neutralize in database journal from database awr;
Data preprocessing module, to delete the single value tag in characteristic, missing feature is deleted, high correlation
Feature is deleted, while to fill missing data, make data normalization;
Training data model module generates model to training data;
Model evaluation module collects assessment models using verifying to the evaluation index according to different machine learning models;
Model tuning module, to adjust the hyper parameter of model to model automated tuning;
Model prediction module is divided into offline prediction and on-line prediction to prediction model;Offline prediction refers to utilizing instruction
Practice the test set that collection is separated;On-line prediction refers to the collected real time data of data collector, need by scaler into
It is being predicted after row data normalization.
In data preprocessing module, single value tag is deleted are as follows: when a certain column characteristic value is all identical, is directly deleted;
It lacks feature to delete are as follows: then delete this feature column when characteristic series missing ratio reaches specific threshold;
High correlated characteristic is deleted are as follows: calculates the correlation between characteristic variable by Pearson correlation coefficient, building is related
Relational matrix deletes one of feature when finding that feature correlation is higher than some threshold value;
Missing Data Filling includes: to go to fill with the previous row data of missing values;It goes to fill with the average value of missing characteristic series;
Performance and load are A, B, C, D, E according to grade classification, fill missing values according to the average value of each grade.
Data normalization are as follows: scaling is normalized in initial characteristic data, using deviation standardization and standard difference standard
Change;Scaler during data normalization is stored in local, needs to utilize the contracting when predicting freshly harvested data
Put the data normalization that device carries out equal extent.
In training data model module, data set is divided into training set, verifying collection and test set.Wherein training set, verifying
It is respectively 70%, 15%, 15% that collection and test set, which account for the ratio of data set,;
This system uses Integrated Algorithm training pattern, and gradient is promoted tree algorithm GBM and got a promotion based on gradient descent algorithm
Exponential model.Regression model uses training aids LGBMRegressor, and disaggregated model uses training aids LGBMClassifier.
Early stop technology and CheckPoint technology are used during assessment models;
Early stop: deconditioning in advance sees in discovery training process that the validation error for verifying collection will not become again
When change, training can be automatically stopped after k wheel, and k is preset value;
CheckPoint technology: the technology can automatically save training pattern optimal in training process.
Hyper parameter in model tuning module includes:
Iteration wheel number epochs
Learning rate learning_rate
Setting tree depth max_depth
Learner leaf maximum number max_leaves
The quantity number n_estimators of weak learner
Loss function objective
Please in conjunction with shown in Fig. 1, the database performance load evaluation method in this assessment system based on machine learning includes:
1, data set is obtained.
The performance load index with log log is reported using the awr of PostgreSQL database access oracle database,
Data are grabbed to grab according to average 3 minutes.Characteristic in awr tables of data has 62 dimensions, and the characteristic in log sheet has
Data in PostgreSQL tables of data are stored in this according to the sequencing of crawl time using python script by 200 dimensions
In ground csv file.
2, data prediction.
Single value tag is deleted: when a certain column characteristic value is all identical, any help no to the fitting of model also increases
Add calculation amount, can directly delete.
Missing feature is deleted: then being deleted this feature column when characteristic series missing ratio reaches specific threshold, is lacked in this system
It loses threshold value and is set as 60%.
High correlated characteristic is deleted: being calculated the correlation between characteristic variable by Pearson correlation coefficient, is constructed related close
It is matrix, deletes one of feature when finding that feature correlation is higher than some threshold value, the setting of this system relevance threshold
It is 0.9.
Missing Data Filling: there is missing data in obtaining data engineering and be inevitable, we only go as far as possible
It is fitted missing values.Three kinds of methods are used in this system:
It goes to fill with the previous row data of missing values.
It goes to fill with the average value of missing characteristic series.
Performance and load are A, B, C, D, E according to grade classification, fill missing values according to the average value of each grade.
Data normalization: scaling is normalized in initial characteristic data, this system uses two methods: deviation standardization
(Min-max normalization) and standard deviation standardize (zero-mean normalization).Data normalization process
In scaler be stored in local, need that the scaler is utilized to carry out equal extent when predicting freshly harvested data
Data normalization.
In disaggregated model, need to divide performance rate and load etc. by section according to performance scores and load score
Grade, specific as follows:
3, training data generates model
Data set is divided into training set, verifying collection and test set.Wherein training set, verifying collection and test set account for data set
Ratio is respectively 70%, 15%, 15%.
This system uses Integrated Algorithm training pattern, and gradient is promoted tree algorithm GBM and got a promotion based on gradient descent algorithm
Exponential model.Regression model uses training aids LGBMRegressor, and disaggregated model uses training aids LGBMClassifier.
4, assessment models
Early stop technology and CheckPoint technology are used during assessment models.
Early stop: deconditioning in advance sees in discovery training process that the validation error for verifying collection will not become again
When change, training can be automatically stopped after k wheel, and k is settable.Invalid training process, training for promotion effect will not thus occur
Rate.
CheckPoint technology: the technology can automatically save training pattern optimal in training process, even if being abnormal
Training is caused to stop that training result can also be saved in time.
5, model tuning
Model tuning function adjusts the hyper parameter of model using hyopt, and major parameter includes:
Iteration wheel number epochs
Learning rate learning_rate
Setting tree depth max_depth
Learner leaf maximum number max_leaves
The quantity number n_estimators of weak learner
Loss function objective
6, model prediction
Offline prediction: being predicted in model after tuning using the data in training set, and
Prediction in real time: grabbing data from database in real time and predicted, shows prediction result and is inserted into the new data
Into historical data, in case doing training set use when later period more new model.
In addition, there are many concrete methods of realizing and approach of the invention, the above is only a preferred embodiment of the present invention.
It should be pointed out that for those skilled in the art, without departing from the principle of the present invention, can also do
Several improvements and modifications out, these modifications and embellishments should also be considered as the scope of protection of the present invention.What is be not known in the present embodiment is each
The available prior art of component part is realized.