[go: up one dir, main page]

CN119603175A - Server failure prediction method, device, electronic device and storage medium - Google Patents

Server failure prediction method, device, electronic device and storage medium Download PDF

Info

Publication number
CN119603175A
CN119603175A CN202411724931.6A CN202411724931A CN119603175A CN 119603175 A CN119603175 A CN 119603175A CN 202411724931 A CN202411724931 A CN 202411724931A CN 119603175 A CN119603175 A CN 119603175A
Authority
CN
China
Prior art keywords
server
time
fault
operation index
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202411724931.6A
Other languages
Chinese (zh)
Inventor
林榆云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Technologies Co Ltd
Original Assignee
New H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd filed Critical New H3C Technologies Co Ltd
Priority to CN202411724931.6A priority Critical patent/CN119603175A/en
Publication of CN119603175A publication Critical patent/CN119603175A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/149Network analysis or design for prediction of maintenance
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请提供一种服务器故障预测方法、装置、电子设备及存储介质。本申请获得服务器在运行时运行指标对应的时间序列数据,将该运行指标对应的时间序列数据分割为多个时间窗口,并获得每一时间窗口对应的目标动态特征,进一步基于运行指标下的每一时间窗口对应的目标动态特征预测服务器的故障概率,并在故障概率大于预设阈值的情况下输出服务器故障报警指示,根据服务器运行时的运行指标对应的时间序列数据,即当前时刻之前的服务器运行情况,预测出服务器的故障概率,实现了在故障发生之前对服务器可能发生的故障进行预测。

The present application provides a server fault prediction method, device, electronic device and storage medium. The present application obtains time series data corresponding to the operation index of the server during operation, divides the time series data corresponding to the operation index into multiple time windows, and obtains the target dynamic features corresponding to each time window, further predicts the failure probability of the server based on the target dynamic features corresponding to each time window under the operation index, and outputs a server fault alarm indication when the failure probability is greater than a preset threshold, predicts the failure probability of the server based on the time series data corresponding to the operation index of the server during operation, that is, the operation status of the server before the current moment, and realizes the prediction of possible failures of the server before the failure occurs.

Description

Server fault prediction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of fault prediction technologies, and in particular, to a method and apparatus for predicting a server fault, an electronic device, and a storage medium.
Background
Currently, a baseboard management controller (BMC: baseboard Management Controller) included in a server can provide basic fault detection and alarm functions, for example, in the case that the server has failed, the fault can be detected and an alarm can be sent out to remind related personnel to maintain the server. However, the BMC can only detect a failure condition if a server has failed, and cannot predict a possible failure if a failure has not occurred.
Disclosure of Invention
In view of the above, the present application provides a server failure prediction method, apparatus, electronic device, and storage medium, so as to predict a server failure.
The technical scheme provided by the application is as follows:
according to an embodiment of the first aspect of the present application, there is provided a server failure prediction method, including:
obtaining time sequence data corresponding to an operation index of the server during operation, wherein the operation index is used for describing the operation condition of the server;
Dividing the time sequence data into a plurality of time windows aiming at the time sequence data corresponding to the operation index, and obtaining a target dynamic characteristic for describing the operation index under each time window, wherein the target dynamic characteristic is used for identifying local fluctuation and short-term trend of the operation index under each time window;
Predicting the fault probability of the server based on the target dynamic characteristics corresponding to each time window under the operation index, and outputting a server fault alarm indication under the condition that the fault probability is larger than a preset threshold.
Optionally, the obtaining time series data corresponding to the operation index of the server during operation includes:
Acquiring operation indexes acquired in a set time period before the current moment by a baseboard management controller BMC deployed in the server;
The operation index comprises one or more of CPU temperature, fan rotating speed, memory utilization rate and power supply power.
Optionally, the target dynamic feature comprises a CNN feature, or at least one of a statistical feature and a frequency domain feature, and the CNN feature, wherein the statistical feature is used for identifying the overall trend and distribution situation of the operation index, the frequency domain feature is used for identifying the potential periodic fluctuation situation of the operation index, and the CNN feature is determined by the following steps:
For each time window under the operation index, carrying out standardized processing on time sequence data included in the time window to obtain preprocessed time sequence data;
And inputting the preprocessed time sequence data into a trained convolutional neural network CNN to obtain CNN characteristics corresponding to the time window.
Optionally, predicting the fault probability of the server based on the target dynamic feature corresponding to each time window under the operation index includes:
Inputting target dynamic characteristics corresponding to each time window under the operation index into a trained long-short-term memory network LSTM to obtain LSTM network output, wherein the LSTM network output is used for representing long-term dependency relationship between time sequence data corresponding to the operation index;
And outputting and inputting the LSTM network to the deployed full-connection layer to obtain the fault probability of the server.
Optionally, the LSTM network is obtained by training based on time series data, a system log and server history fault information corresponding to an operation index obtained in a training stage when the server operates;
The system log comprises one or more of an operation log, an event log and an SDS log, and the server history fault information comprises the history fault record of the server or other servers and the running condition of the server in a designated time period before and after the fault occurrence time.
Optionally, the method further comprises:
When the server is detected to be faulty, acquiring operation indexes acquired in a set time period before the moment of the fault, and training the LSTM network by taking the operation indexes acquired in the set time period as training data.
According to an embodiment of the second aspect of the present application, there is provided a server failure prediction apparatus including:
The data acquisition unit is used for acquiring time sequence data corresponding to the running index of the server in running, wherein the running index is used for describing the running condition of the server;
The device comprises a feature acquisition unit, a feature analysis unit and a feature analysis unit, wherein the feature acquisition unit is used for dividing time sequence data corresponding to the operation index into a plurality of time windows to acquire target dynamic features for describing the operation index under each time window, the target dynamic features are used for identifying local fluctuation and short-term trend of the operation index under each time window, the target dynamic features comprise CNN features or at least one of statistical features and frequency domain features and the CNN features, the statistical features are used for identifying the overall trend and distribution condition of the operation index, and the frequency domain features are used for identifying the potential periodic fluctuation condition of the operation index;
The fault prediction unit is used for predicting the fault probability of the server based on the target dynamic characteristics corresponding to each time window under the operation index, and outputting a server fault alarm indication when the fault probability is greater than a preset threshold.
Optionally, the data obtaining unit is specifically configured to:
Acquiring operation indexes acquired in a set time period before the current moment by a baseboard management controller BMC deployed in the server;
the operation indexes comprise one or more of CPU temperature, fan rotating speed, memory utilization rate and power supply power;
And/or the target dynamic feature comprises a CNN feature, or at least one of a statistical feature and a frequency domain feature, and the CNN feature, wherein the statistical feature is used for identifying the overall trend and distribution situation of the operation index, the frequency domain feature is used for identifying the potential periodic fluctuation situation of the operation index, and the feature obtaining unit is specifically used for:
For each time window under the operation index, carrying out standardized processing on time sequence data included in the time window to obtain preprocessed time sequence data;
inputting the preprocessed time sequence data into a trained convolutional neural network CNN to obtain CNN characteristics corresponding to the time window;
And/or, the fault prediction unit is specifically configured to:
Inputting target dynamic characteristics corresponding to each time window under the operation index into a trained long-short-term memory network LSTM to obtain LSTM network output, wherein the LSTM network output is used for representing long-term dependency relationship between time sequence data corresponding to the operation index;
Inputting the LSTM network output to the deployed full-connection layer to obtain the fault probability of the server;
And/or the LSTM network is obtained by training based on time sequence data, system logs and server history fault information which are acquired in a training stage and correspond to the operation indexes during the operation of the server;
the system log comprises one or more of an operation log, an event log and an SDS log, and the server history fault information comprises the history fault records of the server or other servers and the running conditions of the servers in a designated time period before and after the fault occurrence time;
And/or, the fault prediction unit is further configured to:
When the server is detected to be faulty, acquiring operation indexes acquired in a set time period before the moment of the fault, and training the LSTM network by taking the operation indexes acquired in the set time period as training data.
According to an embodiment of a third aspect of the present application, there is provided an electronic apparatus including:
Processor, and
A computer readable storage medium having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the method according to the first aspect.
According to an embodiment of a fourth aspect of the present application, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the method according to the first aspect.
According to the technical scheme, the time series data corresponding to the running index of the server in running are obtained, the time series data corresponding to the running index is divided into a plurality of time windows, the target dynamic characteristics corresponding to each time window are obtained, the fault probability of the server is further predicted based on the target dynamic characteristics corresponding to each time window under the running index, the fault alarm indication of the server is output under the condition that the fault probability is larger than the preset threshold, the fault probability of the server is predicted according to the time series data corresponding to the running index of the server in running, namely the running condition of the server before the current moment, and the possible fault of the server is predicted before the fault occurs.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow chart of a server fault prediction method according to an embodiment of the present application;
fig. 2 is a schematic diagram of an overall structure of a server failure prediction network according to an embodiment of the present application;
FIG. 3 is a schematic overall flow chart of a server failure prediction method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 5 is a block diagram of a server failure prediction apparatus according to an embodiment of the present application.
Detailed Description
In order to better understand the technical solution provided by the embodiments of the present application and make the above objects, features and advantages of the embodiments of the present application more obvious, the technical solution in the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a flowchart of a server failure prediction method according to an embodiment of the present application.
As shown in fig. 1, the method may include the steps of:
step 101, obtaining time series data corresponding to operation indexes of the server in operation.
In this embodiment, it is necessary to predict possible failures of the server based on the server data of the server before the current time, and therefore, it is first necessary to obtain relevant parameters of the server up to the current time, such as time series data corresponding to the running index of the server at the running time.
The length of the time period corresponding to the acquired time series data may be set according to practical situations, which is not limited by the present application.
As one embodiment, the method for obtaining time series data corresponding to the running index of the server at the running time may include:
acquiring operation indexes acquired in a set time period before the current moment by a baseboard management controller BMC deployed in a server;
The operation index may include one or more of a CPU temperature, a fan speed, a memory usage rate, and a power supply power.
In this embodiment, various operation indexes of the server, such as indexes for describing the operation condition of the server, such as CPU temperature, fan rotation speed, memory usage rate, power supply power, etc., may be collected by a baseboard management controller BMC disposed in the server, or various operation indexes of the server may be collected by a server management tool or a remote monitoring platform.
So far, the description of step 101 is ended, and step 102 is performed as follows.
Step 102, dividing the time series data into a plurality of time windows according to the time series data corresponding to the operation index, and obtaining the target dynamic characteristics for describing the operation index under each time window.
After the time series data corresponding to the operation indexes are collected, the time series data corresponding to each index can be subjected to sliding window processing, namely, the time series data corresponding to each index is divided into a plurality of time windows.
As an embodiment, the lengths of the time windows may be set to the same fixed length, and the time series data included in each time window may overlap to improve the utilization of the time series data. In this case, the overlapping data helps to capture more local variations and timing information.
Further, after the time series data is divided into a plurality of time windows, for each time window, a target dynamic feature corresponding to the time window may be obtained.
Because the time sequence data corresponding to the operation index is divided to obtain a plurality of time windows, and the target dynamic characteristics are determined according to the time sequence data corresponding to each time window, the target dynamic characteristics can identify the local fluctuation and short-term trend of the operation index under each time window.
As an embodiment, the target dynamic feature may include only CNN features, or may include at least one of statistical features and frequency domain features, and CNN features, where the statistical features are used to identify overall trends and distribution conditions of the index, and the frequency domain features are used to identify potential periodic fluctuations of the index.
As an embodiment, before the target dynamic feature corresponding to the time window is obtained, normalization processing may be performed with respect to time series data included in each time window, so that different types of data corresponding to various operation indexes eliminate dimension effects.
Specifically, the time series data included in each time window may be normalized by the following formula:
Where x' is data obtained after normalization processing, x is raw data, μ is a mean value of time-series data included in the time window, and σ is a standard deviation of the time-series data included in the time window.
The above normalization processing is performed for each value of the time-series data included in each time window, and the time-series data after the normalization processing can be obtained. In the present embodiment, the process of normalizing time-series data is actually preprocessing the data before feature extraction, and the normalized time-series data is subsequently written as preprocessed time-series data.
As one example, CNN characteristics may be determined by:
For each time window under the operation index, carrying out standardized processing on time sequence data included in the time window to obtain preprocessed time sequence data;
And inputting the preprocessed time sequence data into the trained convolutional neural network CNN to obtain CNN characteristics corresponding to the time window.
In this embodiment, the preprocessed time-series data may be used as an input of a trained CNN network, so as to perform feature extraction on the preprocessed time-series data through the CNN network, to obtain a CNN feature corresponding to the time window.
As one example, CNN features may be extracted according to the following formula:
Where y i,j denotes the resulting CNN feature, x i+m,j+n denotes the value of the input data (in matrix form) at position (i+m, j+n), and w m,n denotes the value of the convolution kernel w (in matrix form) at position (m, n).
As one example, the statistical features may include mean, standard deviation, maximum, minimum, etc., features, where the mean and standard deviation may be determined by the following formulas:
Where Mean represents the Mean, n represents the total number of data included in the time window, and x i represents the value of the ith data in the time window.
Where StdDev denotes a standard deviation, n denotes the total number of data included in the time window, x i denotes the value of the ith data in the time window, and μ is the Mean (i.e., mean) of the time-series data included in the time window.
As one example, the frequency domain characteristics may be determined by fourier transform:
Wherein X (f) is a frequency domain feature of a time window, X (t) is time sequence data included in the time window, and e -j2πft/N is a fourier transform related parameter, which is not described herein. The frequency domain features can be used to identify potential periodic fluctuations in the time series data comprised by the time window.
In this embodiment, the CNN feature may be used as a target dynamic feature alone, or at least one of the statistical feature or the frequency domain feature may be combined with the CNN feature to form a feature vector with a higher dimension, so as to serve as the target dynamic feature, so that the target dynamic feature improves the capability of identifying the overall trend and the distribution condition of the index and identifying the potential periodic fluctuation condition of the index, and further improves the fault prediction accuracy of the server.
As an embodiment, the method for combining the statistical feature and the frequency domain feature with the CNN feature to obtain the target dynamic feature may include:
ffusion=[fCNN,fstats,ffreq]
Where f fusion denotes the target dynamic characteristics, f CNN denotes the CCN characteristics, f stats denotes the statistical characteristics, and f freq denotes the frequency domain characteristics. Similarly, the CNN feature may be combined with the statistical feature, or the CNN feature may be combined with the frequency domain feature, which is not limited by the present application.
To this end, the description of step 102 is ended, and step 103 is performed as follows.
And step 103, predicting the fault probability of the server based on the target dynamic characteristics corresponding to each time window under the operation index, and outputting a server fault alarm indication under the condition that the fault probability is larger than a preset threshold.
After determining the target dynamic characteristics corresponding to each time window under the operation index, the fault probability of the server can be predicted based on the target dynamic characteristics corresponding to each time window under the operation index.
As one embodiment, the method for predicting the failure probability of the server based on the target dynamic characteristics corresponding to each time window under the operation index may include:
Inputting the target dynamic characteristics corresponding to each time window under the operation index into a trained long-short-term memory network LSTM to obtain LSTM network output, wherein the LSTM network output is used for representing the long-term dependency relationship between time sequence data corresponding to the operation index;
And outputting and inputting the LSTM network to the deployed full-connection layer to obtain the fault probability of the server.
In this embodiment, a trained LSTM network may be deployed in advance in a server, and long-term dependencies between time-series data corresponding to operation indexes are captured through the LSTM network, and further, the long-term dependencies between the time-series data corresponding to the operation indexes are converted into failure probabilities of the server through a full connection layer, so as to implement prediction of server failures.
In this embodiment, the LSTM network is obtained by training based on time-series data, a system log, and server history fault information corresponding to an operation index obtained during operation of the server in a training phase;
The system log comprises one or more of an operation log, an event log and an SDS log, and the server history fault information comprises the history fault record of the server or other servers and the server running conditions in a designated time period before and after the fault occurrence time.
It should be noted that, the time series data corresponding to the operation index obtained in the training stage are all time series data corresponding to the period in which the health state of the server is already clarified (i.e. whether the server fails in the period), based on this, the time series data corresponding to the operation index can be used as training data, and the health state (normal or failure) of the server in the period can be used as a training label to train the CNN network and the LSTM network.
In the training phase, in addition to the time-series data corresponding to the period in which the health state of the server is already clarified (i.e., whether the server fails in the period) as sample data, the CNN network and the LSTM network may be trained using the system log (operation log, event log, SDS log, etc.) and the server history failure information as sample data.
It is easy to understand that training the CNN network and the LSTM network by using the system log and the server history fault information (including the history fault record of the present server or other servers and the running conditions of the servers within a specified period of time before and after the fault occurrence time) as sample data can provide more dimensional references for the faults and parameters before and after the faults that may occur to the server, so that the network prediction accuracy obtained by the final training is higher.
In this embodiment, the LSTM can effectively capture long-term dependencies in time-series data corresponding to the operation index through its gating mechanism (forget gate, input gate, and output gate). This capability makes LSTM suitable for predicting future server failures and provides early warning for administrators to take timely safeguards.
The LSTM network is briefly described as follows:
The LSTM network includes a forget gate, an input gate, and an output gate for adjusting the cell states in the LSTM network (history information stored in the network) to output the current hidden state (server failure prediction result).
Specifically, the forgetting door is used for deciding whether to forget certain information at the previous moment according to the running condition of the current server. For example, when the CPU temperature suddenly increases, the forget gate may decide to retain more past temperature information in order to capture a trend that may lead to overheating.
The output formula of the forgetting gate is as follows:
ft=σ(Wf·[ht-1,xt]+bf)
Where f t denotes the output of the forget gate, which determines how much information in the cell state needs to be "forgotten" or discarded at the last time. In hardware state prediction, this can be seen as determining how much the current hardware state affects the future state.
Sigmoid activates the function with an output value between 0 and 1.0 indicates complete forgetfulness, and 1 indicates complete retention.
And W f, a weight matrix which represents parameters used for calculating forgetting doors in the neural network of the layer and is used for adjusting the forgetting degree according to input data and the hiding state at the last moment. The matrix is obtained by optimizing the tuning parameters in the training process.
And h t-1, the hidden state of the previous moment contains the memory information of the network on the hardware state at the previous moment, and the memory information is calculated by the prediction output or the hidden state at the previous moment.
X t input data at the current time, here hardware state data collected from the BMC platform, such as temperature, load, etc., enters LSTM after preprocessing and feature extraction.
And b f, controlling the forgotten deviation by forgetting the bias item of the door, and helping the model to adjust the forgetting degree.
In this embodiment, the input gate is used to write important information at the current time into the cell state. For example, if a sudden rise in CPU power consumption is detected, the input gate will calculate the cell state that needs to be updated to reflect the effect that such a change may have on future failure predictions.
The output formula of the input gate is as follows:
it=σ(Wi·[ht-1,xt]+bi)
Candidate cell status:
Where i t denotes the output of the input gate, which determines how much information the input data needs to be written to the cell state at the current time. For hardware state prediction tasks, this means that we need to decide which hardware data (e.g. sensor readings, load changes) is helpful for the current prediction.
W i the weight matrix of the input gate determines how the combination of the input data at the current moment and the hidden state at the previous moment affects the output of the input gate, thereby affecting the updating of the cell state.
Candidate cell status, a "potential" updated value, represents new information that is candidates in the current input and hidden states. The range of values between-1 and 1 is ensured by the treatment of the tanh activation function.
W C the weight matrix of candidate cell state, and the weight matrix and the input data and the hidden state at the previous moment together determine the new information to be added in the cell state.
B i and b C, bias terms for input gates and candidate cell states, help to adjust the effect of input on cell state updates.
Further, through the output results of the forgetting gate and the input gate, the cell state can be updated, so that the model can be ensured to accurately accumulate and update the historical information of the running condition of the server, and the accuracy of fault prediction is improved.
The cell status update formula is:
Wherein C t represents the state of the cell at the current time, including all long-term memories from the last time to the current time. It is the result of the interaction of the forget gate and the input gate. In hardware prediction, the cell state represents a history of hardware states, and updates based on current inputs. Such as reflecting accumulated CPU operation history
And C t-1, the cell state at the previous moment represents the hardware state memory at the previous moment.
And f t·Ct-1, forgetting a door control part, determining information to be reserved in the cell state at the last moment.
The portion of the input gate controls determines which portion of the input data needs to be added to the cell state at the current time.
In this embodiment, the output gate is used to control the output of the LSTM unit, i.e. generate a prediction result at the current time, for example, predict the probability that the server is in an abnormal state.
Outputting a gate control signal formula:
ot=σ(Wo·[ht-1,xt]+bo)
Hidden state update formula:
ht=ot·tanh(Ct)
Where o t denotes the control signal of the output gate, which determines which information in the cell state will eventually be output to the next layer or as a final prediction result.
H t is the hidden state, which is also the output of the current time, representing the prediction of the server state at the next time, W o is the weight matrix of the output gate, controlling the contribution of the hidden state to the current prediction.
And b o, outputting a bias term of the gate, and adjusting output deviation.
Thus, the introduction of the LSTM network is completed.
After obtaining the LSTM network characteristics output by the LSTM network, the characteristics also need to be transferred to a full connection layer, so that the long-term dependence characteristics captured by the LSTM can be linearly transformed and mapped through the full connection layer, and the long-term dependence characteristics can correspond to server fault prediction.
As one example, the output of the fully connected layer may be calculated by the following formula:
y=Softmax(Wd·ht+bd)
Wherein y is a server fault probability prediction value, softmax () is an activation function, h t is a hidden state output by the LSTM network, W d is a weight matrix of the full connection layer, and b d is a bias vector.
It is easy to understand that in the training process, parameters such as each weight matrix and bias items in the input gate, the forgetting gate and the output gate formulas of the LSTM model and the full connection layer can be adjusted.
After predicting the failure probability of the server, if the failure probability is determined to be greater than a preset threshold, a server failure alarm indication can be output to inform relevant personnel to take precautions.
This ends the description of step 103.
In this embodiment, the cross entropy loss function may be used for training both CNN networks and LSTM networks to measure classification errors:
Wherein Loss is a Loss value, y i is a predicted value, Is a true value.
In this embodiment, an optimization algorithm, such as Adam optimizer or SGD optimizer, may be used to adjust model parameters to minimize the loss function, and to evaluate the model's accuracy, precision, recall, etc. metrics through a validation set to adjust the model's superparameter.
As one embodiment, when a server is detected to fail, operation indexes acquired in a set time period before the time of failure are obtained, and the operation indexes acquired in the set time period are used as training data to train the LSTM network.
When the server is detected to be faulty, the actually-occurring server fault condition can be fed back to the network, and the relevant parameters of the network are updated so as to improve the accuracy of prediction.
This concludes the description of the server failure prediction method in fig. 1.
The method and the device acquire time sequence data corresponding to the running indexes of the server in running, divide the time sequence data corresponding to each index into a plurality of time windows, acquire the target dynamic characteristics corresponding to each time window, predict the fault probability of the server further based on the target dynamic characteristics corresponding to each time window under the running indexes, output the fault alarm indication of the server under the condition that the fault probability is larger than the preset threshold, and predict the fault probability of the server according to the time sequence data corresponding to the running indexes of the server in running, namely the running condition of the server before the current moment, so as to predict the possible fault of the server before the fault occurs.
Specifically, the application can process the collected multidimensional time series data (such as temperature, power consumption, voltage and the like) of key hardware in the server by arranging a CNN network and an LSTM network in the server so as to predict the fault probability of the server. The CNN network is used for extracting characteristics of multi-dimensional time sequence data, capturing local modes and spatial relations in the data, combining the local modes and the spatial relations with frequency domain characteristics and statistical characteristics to serve as target dynamic characteristics, and the LSTM network is used for further processing the target dynamic characteristics, carrying out time sequence modeling and capturing long-term dependency relations in the data.
The server fault prediction method provided by the application is described below through a specific embodiment.
Example 1
Referring to fig. 2 and fig. 3, fig. 2 is a schematic diagram of an overall structure of a server failure prediction network according to an embodiment of the present application, and fig. 3 is a schematic flow chart of an overall server failure prediction method according to an embodiment of the present application.
The server fault prediction method provided by the embodiment comprises the following steps:
Step 301, data acquisition.
In this embodiment, the time-series data, such as the CPU temperature and the power consumption time-series data, corresponding to a plurality of indicators in the server operation may be collected by the BMC.
Step 302, data preprocessing.
In this embodiment, after the collected time-series data is divided into a plurality of time windows, the time-series data included in each time window may be preprocessed, for example, the time-series data may be normalized, and the normalized time-series data may be used as input data.
And step 303, extracting the characteristics.
In this embodiment, the time-series data after the normalization processing may be used as an input in the input layer (corresponding to the input layer in fig. 2), and input into the CNN network, and the processing such as pooling and node expansion may be performed through the CNN network (corresponding to the CNN layer in fig. 2), so as to obtain the target dynamic feature.
And 304, predicting faults.
In this embodiment, the extracted multiple target dynamic features may be input into an LSTM network (corresponding to the LSTM layer in fig. 2), so as to obtain LSTM network features, so as to characterize long-term dependency relationships between the target dynamic features.
The LSTM network characteristics are further input to the fully connected layer, so that the LSTM network characteristics are converted into server failure prediction probabilities (corresponding to the fully connected layer in fig. 2) through the fully connected layer, and output by the output layer (corresponding to the output layer in fig. 2).
This concludes the description of embodiment 1.
Referring to fig. 4, fig. 4 is a schematic block diagram of an electronic device according to an embodiment of the present application. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, and a computer readable storage medium, although other hardware required for the service is possible. The processor reads corresponding computer program instructions from the computer readable storage medium to run, and forms the terminal interaction device on a logic level. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present application, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.
Referring to fig. 5, fig. 5 is a block diagram of a server failure prediction apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus may include a data obtaining unit 501, a feature obtaining unit 502, and a failure prediction unit 503. Specifically, the device comprises:
A data obtaining unit 501, configured to obtain time series data corresponding to an operation index of the server during operation;
The feature obtaining unit 502 is configured to divide time series data corresponding to an operation index into a plurality of time windows, and obtain a target dynamic feature for describing the operation index under each time window;
The fault prediction unit 503 is configured to predict a fault probability of the server based on the target dynamic feature corresponding to each time window under the operation index, and output a server fault alarm indication if the fault probability is greater than a preset threshold.
Optionally, the data obtaining unit 501 is specifically configured to:
acquiring operation indexes acquired in a set time period before the current moment by a baseboard management controller BMC deployed in a server;
The operation indexes comprise one or more of CPU temperature, fan rotating speed, memory utilization rate and power supply power;
And/or the target dynamic feature includes a CNN feature, or at least one of a statistical feature and a frequency domain feature, and the CNN feature, where the statistical feature is used to identify an overall trend and a distribution situation of the operation index, and the frequency domain feature is used to identify a potential periodic fluctuation situation of the operation index, and the feature obtaining unit 502 is specifically configured to:
For each time window under the operation index, carrying out standardized processing on time sequence data included in the time window to obtain preprocessed time sequence data;
Inputting the preprocessed time sequence data into a trained convolutional neural network CNN to obtain CNN characteristics corresponding to the time window;
And/or, the fault prediction unit 503 is specifically configured to:
Inputting the target dynamic characteristics corresponding to each time window under the operation index into a trained long-short-term memory network LSTM to obtain LSTM network output, wherein the LSTM network output is used for representing the long-term dependency relationship between time sequence data corresponding to the operation index;
Inputting LSTM network output to the deployed full connection layer to obtain the fault probability of the server;
and/or the LSTM network is obtained by training based on time sequence data, system logs and server history fault information which are acquired in a training stage and correspond to operation indexes during operation of the server;
The system log comprises one or more of an operation log, an event log and an SDS log, and the server history fault information comprises the history fault record of the server or other servers and the server running conditions in a designated time period before and after the fault occurrence time;
And/or, the fault prediction unit 503 is further configured to:
when the server is detected to be faulty, acquiring operation indexes acquired in a set time period before the moment of the fault, and training the LSTM network by taking the operation indexes acquired in the set time period as training data.
This concludes the description of the server failure prediction apparatus in fig. 5.
Correspondingly, the embodiment of the application also provides a computer readable storage medium, and a plurality of computer instructions are stored on the computer readable storage medium, and when the computer instructions are executed, the method disclosed by the example of the application can be realized.
By way of example, the above-described computer-readable storage media may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, and the like. For example, the computer readable storage medium may be RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state disk, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the present application.

Claims (10)

1. A method for predicting server failure, the method comprising:
obtaining time sequence data corresponding to an operation index of the server during operation, wherein the operation index is used for describing the operation condition of the server;
Dividing the time sequence data into a plurality of time windows aiming at the time sequence data corresponding to the operation index, and obtaining a target dynamic characteristic for describing the operation index under each time window, wherein the target dynamic characteristic is used for identifying local fluctuation and short-term trend of the operation index under each time window;
Predicting the fault probability of the server based on the target dynamic characteristics corresponding to each time window under the operation index, and outputting a server fault alarm indication under the condition that the fault probability is larger than a preset threshold.
2. The method according to claim 1, wherein the obtaining time-series data corresponding to an operation index of the server at the time of operation includes:
Acquiring operation indexes acquired in a set time period before the current moment by a baseboard management controller BMC deployed in the server;
The operation index comprises one or more of CPU temperature, fan rotating speed, memory utilization rate and power supply power.
3. The method of claim 1, wherein the target dynamic characteristics include CNN characteristics, or at least one of statistical characteristics and frequency domain characteristics, and CNN characteristics, wherein the statistical characteristics are used for identifying overall trends and distribution conditions of the operation index, the frequency domain characteristics are used for identifying potential periodic fluctuation conditions of the operation index, and wherein the CNN characteristics are determined by:
For each time window under the operation index, carrying out standardized processing on time sequence data included in the time window to obtain preprocessed time sequence data;
And inputting the preprocessed time sequence data into a trained convolutional neural network CNN to obtain CNN characteristics corresponding to the time window.
4. The method of claim 1, wherein predicting the failure probability of the server based on the target dynamic characteristics corresponding to each time window under the operation index comprises:
Inputting target dynamic characteristics corresponding to each time window under the operation index into a trained long-short-term memory network LSTM to obtain LSTM network output, wherein the LSTM network output is used for representing long-term dependency relationship between time sequence data corresponding to the operation index;
And outputting and inputting the LSTM network to the deployed full-connection layer to obtain the fault probability of the server.
5. The method of claim 4, wherein the LSTM network is trained based on time series data, system logs, and server history fault information corresponding to an operation index of the server during operation obtained in a training phase;
The system log comprises one or more of an operation log, an event log and an SDS log, and the server history fault information comprises the history fault record of the server or other servers and the running condition of the server in a designated time period before and after the fault occurrence time.
6. The method according to claim 4, characterized in that the method further comprises:
When the server is detected to be faulty, acquiring operation indexes acquired in a set time period before the moment of the fault, and training the LSTM network by taking the operation indexes acquired in the set time period as training data.
7. A server failure prediction apparatus, comprising:
The data acquisition unit is used for acquiring time sequence data corresponding to the running index of the server in running, wherein the running index is used for describing the running condition of the server;
The device comprises a feature acquisition unit, a feature analysis unit and a feature analysis unit, wherein the feature acquisition unit is used for dividing time sequence data corresponding to the operation index into a plurality of time windows to acquire target dynamic features for describing the operation index under each time window, the target dynamic features are used for identifying local fluctuation and short-term trend of the operation index under each time window, the target dynamic features comprise CNN features or at least one of statistical features and frequency domain features and the CNN features, the statistical features are used for identifying the overall trend and distribution condition of the operation index, and the frequency domain features are used for identifying the potential periodic fluctuation condition of the operation index;
The fault prediction unit is used for predicting the fault probability of the server based on the target dynamic characteristics corresponding to each time window under the operation index, and outputting a server fault alarm indication when the fault probability is greater than a preset threshold.
8. The apparatus according to claim 7, wherein the data obtaining unit is specifically configured to:
Acquiring operation indexes acquired in a set time period before the current moment by a baseboard management controller BMC deployed in the server;
the operation indexes comprise one or more of CPU temperature, fan rotating speed, memory utilization rate and power supply power;
And/or the target dynamic feature comprises a CNN feature, or at least one of a statistical feature and a frequency domain feature, and the CNN feature, wherein the statistical feature is used for identifying the overall trend and distribution situation of the operation index, the frequency domain feature is used for identifying the potential periodic fluctuation situation of the operation index, and the feature obtaining unit is specifically used for:
For each time window under the operation index, carrying out standardized processing on time sequence data included in the time window to obtain preprocessed time sequence data;
inputting the preprocessed time sequence data into a trained convolutional neural network CNN to obtain CNN characteristics corresponding to the time window;
And/or, the fault prediction unit is specifically configured to:
Inputting target dynamic characteristics corresponding to each time window under the operation index into a trained long-short-term memory network LSTM to obtain LSTM network output, wherein the LSTM network output is used for representing long-term dependency relationship between time sequence data corresponding to the operation index;
Inputting the LSTM network output to the deployed full-connection layer to obtain the fault probability of the server;
And/or the LSTM network is obtained by training based on time sequence data, system logs and server history fault information which are acquired in a training stage and correspond to the operation indexes during the operation of the server;
the system log comprises one or more of an operation log, an event log and an SDS log, and the server history fault information comprises the history fault records of the server or other servers and the running conditions of the servers in a designated time period before and after the fault occurrence time;
And/or, the fault prediction unit is further configured to:
When the server is detected to be faulty, acquiring operation indexes acquired in a set time period before the moment of the fault, and training the LSTM network by taking the operation indexes acquired in the set time period as training data.
9. An electronic device, comprising:
Processor, and
A computer readable storage medium having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the method of any of claims 1 to 6.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the method of any of claims 1 to 6.
CN202411724931.6A 2024-11-27 2024-11-27 Server failure prediction method, device, electronic device and storage medium Withdrawn CN119603175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411724931.6A CN119603175A (en) 2024-11-27 2024-11-27 Server failure prediction method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411724931.6A CN119603175A (en) 2024-11-27 2024-11-27 Server failure prediction method, device, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN119603175A true CN119603175A (en) 2025-03-11

Family

ID=94835483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411724931.6A Withdrawn CN119603175A (en) 2024-11-27 2024-11-27 Server failure prediction method, device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN119603175A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120386700A (en) * 2025-06-26 2025-07-29 苏州元脑智能科技有限公司 Server power supply detection method and device, storage medium and electronic equipment
CN120950339A (en) * 2025-10-16 2025-11-14 广州大一互联网络科技有限公司 A server fault monitoring method and system based on data analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109814523A (en) * 2018-12-04 2019-05-28 合肥工业大学 Fault diagnosis method based on CNN-LSTM deep learning method and multi-attribute time series data
CN110348513A (en) * 2019-07-10 2019-10-18 北京华电天仁电力控制技术有限公司 A kind of Wind turbines failure prediction method based on deep learning
CN118550791A (en) * 2024-05-10 2024-08-27 北京声智科技有限公司 Cloud server operation and maintenance management method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109814523A (en) * 2018-12-04 2019-05-28 合肥工业大学 Fault diagnosis method based on CNN-LSTM deep learning method and multi-attribute time series data
CN110348513A (en) * 2019-07-10 2019-10-18 北京华电天仁电力控制技术有限公司 A kind of Wind turbines failure prediction method based on deep learning
CN118550791A (en) * 2024-05-10 2024-08-27 北京声智科技有限公司 Cloud server operation and maintenance management method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120386700A (en) * 2025-06-26 2025-07-29 苏州元脑智能科技有限公司 Server power supply detection method and device, storage medium and electronic equipment
CN120386700B (en) * 2025-06-26 2025-08-26 苏州元脑智能科技有限公司 Method and device for detecting server power supply, storage medium and electronic equipment
CN120950339A (en) * 2025-10-16 2025-11-14 广州大一互联网络科技有限公司 A server fault monitoring method and system based on data analysis

Similar Documents

Publication Publication Date Title
JP7105932B2 (en) Anomaly detection using deep learning on time series data related to application information
US11558272B2 (en) Methods and systems for predicting time of server failure using server logs and time-series data
CN117390536B (en) Operation and maintenance management methods and systems based on artificial intelligence
CN110413227B (en) Method and system for predicting remaining service life of hard disk device on line
US11288577B2 (en) Deep long short term memory network for estimation of remaining useful life of the components
CN119603175A (en) Server failure prediction method, device, electronic device and storage medium
CN120258775A (en) An intelligent management system for energy consumption optimization and fault self-diagnosis of cleaning equipment
CN114528934A (en) Time series data abnormity detection method, device, equipment and medium
CN105593864B (en) Analytical equipment degradation for maintenance equipment
US20200103886A1 (en) Computer System and Method for Evaluating an Event Prediction Model
CN118521159A (en) Power consumption information collection terminal quality detection method and system based on deep learning
CN119577659B (en) A method, system, medium and program product for generating an operation and maintenance mode analysis model
CN112148768A (en) Index time series abnormity detection method, system and storage medium
CN112737834A (en) Cloud hard disk fault prediction method, device, equipment and storage medium
CN110858072B (en) Method and device for determining running state of equipment
CN118312923B (en) Intelligent park-oriented equipment measurement method and computer equipment
CN118761005A (en) Fault diagnosis method, device and computer equipment based on digital twin
CN117930024A (en) Intelligent battery health state monitoring method based on informer model
CN117514649B (en) A method and device for monitoring the health status of a wind turbine generator set
CN119494028A (en) Water pump group health status monitoring method and system based on graph neural network
US20240053739A1 (en) Remaining useful life prediction for machine components
CN120336989B (en) Hardware fault positioning method, device, equipment and storage medium
CN120746305B (en) Intelligent integrated service method and system for anti-overflow anti-static controller
CN120822153A (en) Chip fault prediction method, device, equipment and computer-readable medium
Gu et al. Improved similarity-based residual life prediction method based on grey Markov model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20250311