Summary of the invention
Goal of the invention: in view of the above problems, the present invention proposes that a kind of disk storage capacity based on time series models is pre-
Survey method is predicted the storage demand variation that system in future may face, to realize the supervision to entire storage environment
And early warning.
Technical solution: to achieve the purpose of the present invention, the technical scheme adopted by the invention is that: one kind being based on time series
The disk storage capacity prediction technique of model, comprising steps of
(1) database for establishing storage disk capacity service condition monitors the operation of storage system, provides needed for prediction
Data;
(2) reading database obtains historical data, carries out data processing, constructs the prediction mould based on time series models
Type is predicted;
(3) it is compared according to capacity prediction result with residual capacity, assessment judges whether to meet predetermined threshold, and judges
Early warning whether is issued, operation maintenance personnel is reminded to safeguard disk system.
Further, the step 2 includes:
(2.1) using the time of the identification number of attribute and acquisition index as condition, performance data is extracted, obtains the past
A period of time disk service condition data;
(2.2) data analysis and processing are carried out;
(2.3) using treated, data are modeled, and carry out model testing and modification;
(2.4) data prediction is carried out using the model by examining.
(2.5) 3 statistics figureofmerits: mean absolute error, root-mean-square error and average absolute percentage error, weighing apparatus are used
Measure model prediction accuracy.
Further, the step (2.2) includes:
(2.2.1) carries out periodicity analysis, the stationarity of heuristic data;
(2.2.2) data cleansing, rejects the repeated data of disk size, and using the disk size of Servers-all as
One fixed value;
(2.2.3) carries out attribute reconstruct, and tri- attribute values of NAME, TARGET_ID, ENTITY are merged, new category is constructed
Property.
Further, the step (2.3) includes:
(2.3.1) determines model used by making auto-correlation and partial correlation figure judgement data stationarity to data;
(2.3.2) if auto-correlation is hangover, AR algorithm is then used in partial correlation truncation;If auto-correlation truncation, partial correlation
MA algorithm is then used in hangover;If auto-correlation and partial correlation are all hangovers, ARMA algorithm is used;
(2.3.3) carries out the parameter Estimation of model using maximum-likelihood method, estimates the value of parameters;
(2.3.4) carries out model using BIC information criterion to determine rank for each different models, determines p, q parameter, from
And select optimal models.
The utility model has the advantages that disk size intelligent predicting technology of the invention, according to the historical data feature selecting different time
Series model avoids the limitation of single method prediction result, realizes the Storage Estimation demand to different disk system, tool
There are very strong adaptability and popularity.
The intelligentized Predicting Technique of present invention height, can more accurately predict disk size in actual test, can
It realizes making effective use of for disk system, reduces the manpower and financial resources cost of system O&M.
Specific embodiment
Below with reference to embodiment, further description of the technical solution of the present invention.
Due to disk storage capacity demand, the service history of the past period is depended not only on, is also depended on current
System running state, thus only the prediction technique based on historical data (such as front based on Historical Monitoring data
Disk size prediction technique) it is not enough to cope with disk size requirement forecasting problem, and taken on to capacity prediction methods
The prediction model with feedback modifiers function of effect.
For this two o'clock, it is based on history data in magnetic disk, the methods of cleaning reconstruct processing is carried out to data, using time series
Analytic approach constructs reasonable prediction model, the disk storage capacity prediction technique based on time series models is formed, to predict
The size of application system server disk use space, provides the early warning of customization for administrator.
Disk storage capacity prediction technique of the present invention based on time series models, may face system in future
Storage demand variation predicted, to realize supervision and early warning to entire storage environment, comprising steps of
(1) storage disk capacity is established using database, monitors the operation of storage system, to provide number needed for prediction
According to;
(2) the prediction module reading database based on time series models obtains historical data, carries out data processing, mould
Type construction, prediction;And judge whether to meet predetermined threshold by assessment;
Including four key steps:
(2.1) data acquisition;
In order to extract data in magnetic disk, with the time of identification number (TARGET_ID) and acquisition index of attribute
(COLLECTTIME) it is condition, performance data is extracted, this step mainly passes through system command and directly acquires over
A period of time disk service condition numerical value.
(2.2) data processing;
1, periodicity analysis, the stationarity of heuristic data are carried out;
The present invention is modeled using time Sequence Analysis Method, for the needs of modeling, needs the stationarity of heuristic data.It is logical
The stationarity of data can tentatively be found by crossing timing diagram, for the use size of server disk, as unit of day, to data
Visualized operation is carried out, the service condition of disk does not have periodicity under normal circumstances, they show Retarder theory increasing
It is long, Trendline is presented, preliminary to judge, data are non-stable.
2, the repeated data of disk size is rejected in data cleansing, and using the disk size of Servers-all as one
Fixed value facilitates model pre-warning;
In practical business, monitoring system timing can be collected the information of disk daily.But under normal circumstances, magnetic
The capacity attribute of disk is a definite value, therefore can have the repeated data of disk size in disk original data.In data cleansing
In the process, the repeated data of disk size is rejected, and using the disk size of Servers-all as a fixed value, it is convenient
Model pre-warning.
3, attribute reconstruct is carried out, tri- attribute values of NAME, TARGET_ID, ENTITY are merged, new attribute is constructed;
In data storage, disk size is as unit of KB.Because the disc information of every server can be by table
Tri- attributes of NAME, TARGET_ID, ENTITY distinguish, and the above three attribute of every server is constant, institute
Can merge this three attribute values, new attribute is constructed.
(2.3) Construction of A Model;
By treated, data are divided into two parts, and a part is modeling sample data, and a part is model verify data.Choosing
Last 5 data for evidence of fetching is verify data, other data are modeling sample data.
1, model selects;
Requirement due to ARIMA/ARMA model to time series is leveling style, it is therefore desirable to carry out stationary test.This
Invention determines model used by making auto-correlation and partial correlation figure judgement data stationarity to data.If auto-correlation is to drag
AR algorithm is then used in tail, partial correlation truncation;If auto-correlation truncation, MA algorithm is then used in partial correlation hangover;If auto-correlation and partially
Related is all hangover, then uses ARMA algorithm.The parameter Estimation that model is carried out using maximum-likelihood method, estimates the value of parameters.
For each different models, model is carried out using BIC information criterion to determine rank, determines p, q parameter, to select optimal models.
There is no Stochastic Trends or deterministic trend to determine in original data sequence, needs to carry out stationarity inspection to data
The phenomenon that testing, otherwise will generating " shadowing property ".The present invention carries out stationary test, such as original sequence using ADF method
Column are attributed to steadily after 1 order difference, and d value is determined as 1 at this time.
AR model is known as autoregression model (Auto Regressive model);MA model is known as moving average model(MA model)
(Moving Average model);ARMA is known as ARMA model (Auto Regressive and Moving
Average model);ARIMA model is known as difference ARMA model.
AR model
If any number of some time series can be expressed as following regression equation, which is obeyed
The autoregressive process of p rank can be expressed as AR (p):
Wherein, xt,xt-1,xt-2,……,xt-pFor different time points record index value,For
Autoregressive coefficient, utWhite noise as the time series.
The autoregressive process AR (1) of referred to as 1 rank,Referred to as 2 ranks from
Regression process AR (2).
It can be found that AR model utilizes the correlativity (auto-correlation) of numerical value early period and later period numerical value, establishing includes early period
The regression equation of numerical value and later period numerical value achievees the purpose that prediction, therefore becomes autoregressive process.Here white noise can be with
It is understood as the random fluctuation of time series numerical value, the summation of these random fluctuations can be equal to 0.
MA model
If any number of some time series can be expressed as following regression equation, which is obeyed
The moving average process of q rank can be expressed as MA (q):
Wherein, ut,ut-1,ut-2,……ut-qIndicate the white noise item of different time points, θ1,θ2,θ3,……,θqFor movement
Regression equation coefficient, xtIndicate the corresponding index value of time point t.
It can be found that the index value at some time point is equal to the weighted sum of white noise sequence, if in regression equation, it is white
Noise only has two, then the moving average process is 2 rank moving average process MA (2).Compare autoregressive process and movement is flat
Equal process solves white noise in autoregression variance it is found that moving average process can be used as the supplement of autoregressive process in fact
The combination of Solve problems, the two just becomes autoregressive moving-average (ARMA) process.
Arma modeling
ARMA model consists of two parts: autoregression part and rolling average part, therefore includes two
Order can be expressed as ARMA (p, q), and p is Autoregressive, and q is moving average order, and regression equation indicates are as follows:
From regression equation it is found that ARMA model combines the advantage of two models of AR and MA, in ARMA mould
In type, autoregressive process is responsible for quantifying the relationship between current data and Primary Stage Data, and moving average process is responsible for solving random
The Solve problems of item are changed, therefore, the model is more effectively and common.
ARIMA model
ARIMA model can be used in the analysis of nonhomogeneous nonstationary time series, and here homogeneous refers to originally unstable
Time series after d difference become stationary time series.Although many time serieses itself are unstable, pass through
After difference (index value of adjacent time point subtracts each other), the new time series of formation reforms into stationary time series.Cause
This, difference ARMA model is write as ARIMA (p, d, q).P represents Autoregressive;D represents difference number;Q generation
Table moving average order.
2, model testing;
After model determines, examine whether its residual sequence is white noise.If not white noise, illustrate in residual error there is also
Useful information needs to modify model or further extracts.
Whether it has been extracted to verify information useful in sequence and has finished, has needed to carry out white noise verification to sequence.Such as
Infructescence column verify as white noise sequence, just illustrate that information useful in sequence has been extracted and finish, remaining is random entirely
Disturbance, can not be predicted and be used.The present invention carries out white noise verification using the method for LB statistic.
(2.4) model prediction;
1, it is predicted using the model by examining, obtains following 5 days predicted values, and compared with actual value, it is preceding
Last 5 data are not used in modeling in we in face, we are given a forecast verifying with these data.
2, in order to evaluate the quality of Time series forecasting model effect, this experiment using 3 measurement model prediction accuracies statistics
Figureofmerit: mean absolute error, root-mean-square error and average absolute percentage error.Never ipsilateral reflects this 3 indexs
The precision of prediction of algorithm.
(3) it is compared according to the result that capacity prediction provides with residual capacity, makes and whether issue early warning, remind O&M people
Member carries out maintenance or dilatation to disk system.
The invention adopts the above technical scheme, can effectively solve the problem that the history-dependent problem of the information in requirement forecasting,
Disk size can be more accurately predicted in actual test.
Above-mentioned disk size intelligent predicting technology is avoided according to the different time series models of historical data feature selecting
The limitation of single method prediction result realizes the Storage Estimation demand to different disk system, has very strong adaptability
And popularity.
The intelligentized Predicting Technique of height provided by the invention, it can be achieved that disk system makes effective use of, and for because
Service delay machine caused by disk is insufficient makes early warning, greatly reduces manpower consumption, reduces the manpower and financial resources of system O&M
Cost.