IT operation and maintenance service management method based on big data
Technical Field
The invention relates to the field of IT operation and maintenance management, in particular to an IT operation and maintenance service management method based on big data.
Background
In today's digital and information intensive environments, the IT infrastructure of an enterprise has become the basic support for ITs operations and strategies. IT operation and maintenance service management based on big data is becoming an important means for organizations to ensure their system stability, security and efficient operation. With the rapid development of internet of things (IoT), cloud computing, and big data technologies, the amount of data facing enterprises has increased explosively, and the operating environment has also become more complex and dynamic. These changes present new challenges, including: how to collect and process mass data effectively, detect system abnormality rapidly, predict system performance trend accurately, and make intelligent decision in changeable environment. Meanwhile, the security and reliability of enterprise systems directly affect their market competitiveness and customer satisfaction.
Disclosure of Invention
In order to solve the problems, the invention aims to provide the IT operation and maintenance service management method based on big data, which not only effectively improves the operation efficiency and reliability of the system, but also improves the complexity of IT operation and maintenance and effective management under the dynamically changing environment.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
An IT operation and maintenance service management method based on big data comprises the following steps:
S1, collecting data from logs, monitoring tools and user behaviors, and preprocessing by using a Pandas library of Python, wherein the preprocessing comprises data cleaning and missing value processing;
s2, carrying out anomaly detection by combining the data set preprocessed by the K-Means and the DBSCAN, carrying out secondary data cleaning based on the anomaly detection result, and optimizing a data cleaning strategy;
S3, predicting the trend of the system performance index based on the ARIMA model based on the data after the secondary data cleaning;
S4, constructing a static user portrait according to the user behavior data by using a hierarchical clustering algorithm, and dynamically updating the user portrait to adapt to the latest behavior data based on path dynamic analysis;
s5, optimizing resource allocation and task scheduling by using a reinforcement learning algorithm based on the trend of the system performance index and the user portraits;
And S6, integrating external safety information, applying to safety event analysis, and automatically modifying a protection strategy according to a safety situation by utilizing a self-adaptive strategy adjustment mechanism.
Further, S1 is specifically:
using Logstar as a log collector, extracting log data from different servers and application programs, and sending the log data to an elastic search for storage and indexing;
collecting network and system performance indexes by utilizing Prometaus and Zabbix;
deriving user access and interaction data from a user behavior analysis tool;
Using APACHE KAFKA as a data flow platform, and transmitting the data of all data sources to a database in real time;
Data is loaded from the database using the Pandas library and the repeat value is removed using the drop duplicates () function of Pandas, unnecessary rows or columns are filtered out according to certain conditions, and the missing values are processed with forward padding.
Further, S2 is specifically:
Carrying out standardization processing on the preprocessed data, carrying out cluster analysis on the standardized data by using a K-Means algorithm, selecting a cluster number, initializing a cluster center, and repeating the following steps until convergence:
Wherein C j is the set of all member samples of cluster j, and μ j is the center of the j-th cluster; x i is the eigenvector of the ith data point, c i is the cluster to which the ith data point is assigned;
Calculating Euclidean distance from each point to the center of the cluster to which the point belongs:
;
Where m is the total dimension representing the feature space; Representing data points A value in a kth dimension; representing the value of cluster center mu j in the kth dimension;
finding out abnormal points according to the distance threshold;
;
wherein, Representing data pointsA cluster to which the cluster belongs; x d represents an outlier; d u denotes a distance threshold;
Density clustering of data by DBSCAN, for each data point, detecting it The number of points in the neighborhood, if the number of points reaches min_samples, the number of points is a core point; the method includes the steps that a boundary point which does not meet the core point standard but belongs to a neighborhood of a certain core point is a noise point, and the noise point is marked as abnormal;
in combination with the abnormal data sets obtained from K-Means and DBSCAN, intersection and independent portions are analyzed and secondary cleaning is performed to improve the quality of the data sets.
Further, in combination with the abnormal data sets obtained from K-Means and DBSCAN, intersections and independent portions are analyzed and secondary cleaning is performed to improve the quality of the data sets, as follows: identifying data points marked as abnormal by K-Means and DBSCAN at the same time, and modifying the numerical data by using a data modification strategy as clear abnormal points; for the data points which are only identified as abnormal by the K-Means, checking whether the data points belong to edge values or not, and if analysis is not affected, filtering; for the data points which are only identified as abnormal by the DBSCAN, confirming that the data is normal, and retaining; for abnormal data that is uncorrectable or invalid for analysis, removing from the dataset;
further, S3 is specifically:
Checking the stationarity of the data by ADF, if not stationary, applying differential operations to remove trends and seasonal effects;
dividing data into a training set and a testing set, wherein the training set is used for model fitting, and the testing set is used for model verification;
Selecting appropriate parameters (p, d, q) according to the behavior characteristics of the autocorrelation function ACF and the partial autocorrelation function PACF; p is the autoregressive partial order; d is the degree of difference; q is the moving average partial order;
constructing an ARIMA model by using the selected parameters, fitting the denoised training data, and training the training model to obtain parameter estimation and model fitting results;
;
wherein, An observation at time t for the time series; c is a constant term; Coefficients that are autoregressive parts; p is the order of the autoregressive portion, Is the corresponding index variable; Is that An observation of the order; For the coefficients of the moving average portion, q is the order of the moving average portion, Is the corresponding index variable; is a past error term; An error term at time t;
adopting Ljung-Box to check whether the residual error is white noise or not so as to confirm that the model captures the data structure;
;
Wherein Q is Ljung-Box statistic; n is the sample size, i.e. the number of observations of the time series data; as autocorrelation coefficients of residual, hysteresis Is a coefficient of autocorrelation of (a); h is the number of lags tested.
Further, S4 is specifically:
extracting key features from the user behavior log, and constructing the key features into feature vectors for cluster analysis;
Analyzing the user behavior data by using a Ward variance minimization method to obtain a clustering result;
classifying users into different groups according to the clustering result, calculating the characteristic mean value of each group, and forming a characteristic set representing the group portrait to obtain a static user portrait;
analyzing a user behavior path, and identifying an interaction mode of a user by modeling the user behavior as a state transition process, and adding the interaction mode as a dynamic characteristic into a static user portrait to obtain a dynamic user portrait;
The user portrait is adjusted by using a real-time data updating mechanism, the dynamic updating is carried out by a weighted average method, and new and old data are combined:
Updated_Profile=α×Old_Profile+(1−α)×New_Data;
wherein updated_profile represents the Updated user representation, old_profile represents the original user representation, new_data represents the New Data, and α is the weight.
Further, the Ward variance minimization method is adopted to analyze the user behavior data, and a clustering result is obtained, and the method is specifically as follows:
extracting key behavior characteristics including session duration, page access number and click rate to form a feature matrix
X is a group; the Euclidean distance is used as a standard distance measurement method, the distance between different users is calculated, and a distance matrix is constructed;
Clustering by using Ward algorithm according to the calculated distance matrix, clustering by calculating the dispersion of each cluster, combining two clusters each time, so that the square sum WCSS increment of errors in the combined clusters is minimum, generating a cluster tree, and constructing a tree diagram;
;
Wherein WCSS is WCSS increment; x is a group of A user; Cluster Is the average value of (2);
Based on the dendrogram, the cluster number is determined by CH index and Dunn index:
wherein, In order to be an inter-cluster scatter matrix,Is an intra-cluster scatter matrix; Is the cluster number; Is the total number of data;
the Dunn index is the ratio of the minimum distance between two nearest neighbor clusters to the maximum cluster diameter selected.
Further, analyzing the behavior path of the user, and identifying the interaction mode of the user in the application by modeling the user behavior as a state transition process (such as a Markov chain); determining all possible user behaviors as states in the Markov chain;
Counting the frequency of each behavior conversion, and calculating the transition probability between states;
wherein, Is in state ofTransition to StateProbability of (2); c () is a transfer counter; Representing acquisition status Is a total number of transfers;
Based on the calculated transition probabilities between the states, a transition matrix is constructed, and a high-frequency transition path is identified by using the transition matrix, so that a user interaction mode is determined.
Further, S5 is specifically:
Defining a state space as a combination of a system load level and a user path position, and defining an action set as a resource adjustment and user guidance strategy; combining the user portrait information with the system state characteristics to form the state input of reinforcement learning, so that the scheduling decision is more personalized; the action space is adjusted according to different user groups;
The bonus function R (s, a) combines resource utilization and path completion rate:
R(s,a)=w1×Resource_Efficiency(s,a)+w2×Path_Completion_Success(s);
Wherein resource_efficiency (s, a) is the Resource utilization rate; path_completion_success(s) Path Completion rate; w 1 and w 2 are corresponding weights;
Q value update:
;
wherein Q (s, a) is a value estimate of performing action a in state ss; alpha is the learning rate; gamma is a discount factor, and the current value of future rewards is weighted within the range of 0-gamma <1; is the maximum expected Q value for all possible actions a 'at the next state s';
Collecting behavior data and performance indexes, obtaining an optimization strategy through Q-Learning adjustment strategy and training, and improving resource allocation efficiency and user experience based on the optimization strategy;
Implementation in real-time streaming -Action selection balancing in a greedy policy-driven model.
Further, S6 is specifically:
and setting performance and experience thresholds by using a CEP module of the Flink, analyzing the data stream in real time, and triggering an alarm or automatically adjusting rules when abnormality occurs.
Packaging feedback information into FLINK PIPELINE, and automatically adjusting according to trigger rulesLearning rate and update frequency of state features.
The invention has the following beneficial effects:
1. The invention not only effectively improves the running efficiency and reliability of the system, but also improves the complexity of IT operation and maintenance and effective management under the dynamically changing environment.
2. According to the invention, abnormal data is identified and corrected through a multi-stage data optimization method and through secondary cleaning, and the overall quality of a data set is improved through deleting or expanding operation, so that the data is ensured to be more accurate and more efficiently added with an analysis flow for decision making, and the intelligent improvement of efficient and accurate abnormal detection and data cleaning is realized;
3. The invention can effectively identify different user behavior patterns by using the Ward method, construct user portraits on the basis, combine accurate feature extraction and Euclidean distance to realize efficient data segmentation and group analysis, dynamically respond to the static frequency and quantitative state change of the user by introducing path analysis, can also identify the behavior patterns of the user in the system, and be embodied in user portraits updating and resource optimizing decisions, finally combine a Markov process and a reinforcement learning algorithm, can cope with the problems of user experience and performance coordination in complex application environments, realize personalized resource management and user guiding mechanisms, and improve the efficiency and user satisfaction of the system.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the attached drawings and specific examples:
Referring to fig. 1, in this embodiment, an IT operation and maintenance service management method based on big data is provided, which includes the following steps:
s1, collecting data (system logs, application program logs, network monitoring tools (such as Prometheus, zabbix) and user behavior logs) from various sources of logs, monitoring tools and user behaviors, and preprocessing by using a Pandas library of Python, including data cleaning and missing value processing;
s2, carrying out anomaly detection by combining the data set preprocessed by the K-Means and the DBSCAN, carrying out secondary data cleaning based on the anomaly detection result, and optimizing a data cleaning strategy;
S3, predicting the trend of the system performance index based on the ARIMA model based on the data after the secondary data cleaning;
s4, constructing a static user portrait according to the user behavior data by using a hierarchical cluster analysis algorithm, and dynamically updating the user portrait to adapt to the latest behavior data based on path dynamic analysis;
s5, optimizing resource allocation and task scheduling by using a reinforcement Learning algorithm (such as Q-Learning) based on the trend of the system performance index and the user portrait;
and S6, integrating external security information (CTI), applying to security event analysis, and automatically modifying a protection strategy according to a security situation by utilizing an adaptive strategy adjustment mechanism.
In this embodiment, S1 is specifically:
using Logstar as a log collector, extracting log data from different servers and application programs, and sending the log data to an elastic search for storage and indexing;
collecting network and system performance indexes by utilizing Prometaus and Zabbix;
deriving user access and interaction data from a user behavior analysis tool;
Using APACHE KAFKA as a data flow platform, and transmitting the data of all data sources to a database in real time;
Data is loaded from the database using the Pandas library and the repeat value is removed using the drop duplicates () function of Pandas, unnecessary rows or columns are filtered out according to certain conditions, and the missing values are processed with forward padding.
In this embodiment, S2 is specifically:
Carrying out standardization processing on the preprocessed data, carrying out cluster analysis on the standardized data by using a K-Means algorithm, selecting a cluster number, initializing a cluster center, and repeating the following steps until convergence:
Wherein C j is the set of all member samples of cluster j, and μ j is the center of the j-th cluster; x i is the eigenvector of the ith data point, c i is the cluster to which the ith data point is assigned;
Calculating Euclidean distance from each point to the center of the cluster to which the point belongs:
;
Where m is the total dimension representing the feature space; Representing data points A value in a kth dimension; Representing the value of cluster center mu j in the kth dimension;
finding out abnormal points according to the distance threshold;
;
wherein, Representing data pointsA cluster to which the cluster belongs; x d represents an outlier; d u denotes a distance threshold;
Density clustering of data by DBSCAN, for each data point, detecting it The number of points in the neighborhood, if the number of points reaches min_samples, the number of points is a core point; the method includes the steps that a boundary point which does not meet the core point standard but belongs to a neighborhood of a certain core point is a noise point, and the noise point is marked as abnormal;
in combination with the abnormal data sets obtained from K-Means and DBSCAN, intersection and independent portions are analyzed and secondary cleaning is performed to improve the quality of the data sets.
In this embodiment, intersection and independent parts are analyzed in combination with the abnormal data sets obtained from K-Means and DBSCAN, and a secondary cleaning is performed to improve the quality of the data sets, concretely as follows: identifying data points marked as abnormal by K-Means and DBSCAN at the same time, and modifying the digital data by using a data modification strategy (such as interpolation and median substitution) for the specific abnormal points; for the data points which are only identified as abnormal by the K-Means, checking whether the data points belong to edge values or not, and if analysis is not affected, filtering; for the data points which are only identified as abnormal by the DBSCAN, confirming that the data is normal, and retaining; for abnormal data that is uncorrectable or invalid for analysis, removing from the dataset;
in this embodiment, S3 is specifically:
Checking the stationarity of the data by ADF, if not stationary, applying differential operations to remove trends and seasonal effects;
dividing data into a training set and a testing set, wherein the training set is used for model fitting, and the testing set is used for model verification;
Selecting appropriate parameters (p, d, q) according to the behavior characteristics of the autocorrelation function ACF and the partial autocorrelation function PACF; p is the autoregressive partial order; d is the degree of difference; q is the moving average partial order;
constructing an ARIMA model by using the selected parameters, fitting the denoised training data, and training the training model to obtain parameter estimation and model fitting results;
;
wherein, An observation at time t for the time series; c is a constant term; Coefficients that are autoregressive parts; p is the order of the autoregressive portion, Is the corresponding index variable; Is that An observation of the order; For the coefficients of the moving average portion, q is the order of the moving average portion, Is the corresponding index variable; is a past error term; An error term at time t;
adopting Ljung-Box to check whether the residual error is white noise or not so as to confirm that the model captures the data structure;
;
Q is Ljung-Box statistic; n is the sample size, i.e. the number of observations of the time series data; as autocorrelation coefficients of residual, hysteresis Is a coefficient of autocorrelation of (a); h is the number of lags tested.
In this embodiment, S4 is specifically:
Extracting key features such as session time length, page stay time, click times and the like from a user behavior log, and constructing the key features into feature vectors for cluster analysis;
Analyzing the user behavior data by using a Ward variance minimization method to obtain a clustering result;
Classifying users into different groups according to the clustering result, calculating the characteristic mean value of each group, and forming a characteristic set representing the group portrait to obtain a static user portrait; (user portraits include demographics, consumption habits, traffic usage patterns, personalized interest preferences, etc.);
Analyzing a user behavior path, and identifying an interaction mode of the user by modeling the user behavior as a state transition process (such as a Markov chain), and adding the interaction mode as a dynamic characteristic into a static user portrait to obtain a dynamic user portrait;
(updating the feature matrix and the state transition probability in real time, adjusting the user portrait) and using a real-time data updating mechanism to adjust the user portrait, dynamically updating by a weighted average method, and combining new and old data:
Updated_Profile=α×Old_Profile+(1−α)×New_Data;
wherein updated_profile represents the Updated user representation, old_profile represents the original user representation, new_data represents the New Data, and α is the weight.
In this embodiment, a Ward variance minimization method is adopted to analyze user behavior data, and a clustering result is obtained, which is specifically as follows:
extracting key behavior characteristics including session duration, page access number and click rate to form a feature matrix
X is a group; the Euclidean distance is used as a standard distance measurement method, the distance between different users is calculated, and a distance matrix is constructed;
Clustering by using Ward algorithm according to the calculated distance matrix, clustering by calculating the dispersion of each cluster, combining two clusters each time, so that the square sum WCSS increment of errors in the combined clusters is minimum, generating a cluster tree, and constructing a tree diagram;
;
Wherein WCSS is WCSS increment; x is a group of A user; Cluster Is the average value of (2);
Based on the dendrogram, the cluster number is determined by CH index and Dunn index:
wherein, In order to be an inter-cluster scatter matrix,Is an intra-cluster scatter matrix; Is the cluster number; Is the total number of data;
the Dunn index is the ratio of the minimum distance between two nearest neighbor clusters to the maximum cluster diameter selected.
In this embodiment, the behavior path of the user is analyzed, and the interaction mode of the user in the application is identified by modeling the user behavior as a state transition process (such as a markov chain); determining all possible user behaviors as states in the Markov chain;
Counting the frequency of each behavior conversion, and calculating the transition probability between states;
wherein, Is in state ofTransition to StateProbability of (2); c () is a transfer counter; Representing acquisition status Is a total number of transfers;
Based on the calculated transition probabilities between the states, a transition matrix is constructed, and a high-frequency transition path is identified by using the transition matrix, so that a user interaction mode is determined.
In this embodiment, S5 is specifically:
Defining a state space as a combination of a system load level and a user path position, and defining an action set as a resource adjustment and user guidance strategy; combining the user portrait information with the system state characteristics to form the state input of reinforcement learning, so that the scheduling decision is more personalized; the action space is adjusted according to different user groups;
The bonus function R (s, a) combines resource utilization and path completion rate:
R(s,a)=w1×Resource_Efficiency(s,a)+w2×Path_Completion_Success(s);
Wherein resource_efficiency (s, a) is the Resource utilization rate; path_completion_success(s) Path Completion rate; w 1 and w 2 are corresponding weights;
Q value update:
;
wherein Q (s, a) is a value estimate of performing action a in state s; alpha is the learning rate; gamma is a discount factor, and the current value of future rewards is weighted within the range of 0-gamma <1; is the maximum expected Q value for all possible actions a 'at the next state s';
Collecting behavior data and performance indexes, obtaining an optimization strategy through Q-Learning adjustment strategy and training, and improving resource allocation efficiency and user experience based on the optimization strategy;
Implementation in real-time streaming -Action selection balancing in a greedy policy-driven model.
In this embodiment, S6 is specifically:
And setting performance and experience thresholds by using a CEP (complex event processing) module of the Flink, analyzing the data stream in real time, and triggering an alarm or automatically adjusting rules when abnormality occurs.
Packaging feedback information into FLINK PIPELINE, and automatically adjusting according to trigger rulesLearning rate and update frequency of state features.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.