TWI725543B

TWI725543B - A method for predicting system anomalies caused by rare events

Info

Publication number: TWI725543B
Application number: TW108132773A
Authority: TW
Inventors: 林彥呈; 林風; 葉恩豪; 鄭枸澺; 陳人豪; 鄭建威
Original assignee: 中華電信股份有限公司
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2021-04-21
Also published as: TW202111535A

Abstract

The present invention is a method for predicting system anomalies caused by rare events, and used for predicting whether a system anomalies is about to occur in the future by integrating machine learning with various data collected from the system. A better representation for the original monitoring data can get by using data pre-processing and data feature selection and processing proposed in the present invention. Without changing the existing system, the data processing technology can further improve the accuracy of a prediction model and reduce a training model time. In addition, the integration of machine learning technology and the complete training mechanism for model can greatly reduce the human participation and the whole mechanism can be automated, and can to find out the cause that is most likely to cause the system anomalies, so as to quickly and effectively achieve the goal of preventing the abnormality in advance.

Description

Methods of predicting rare events that cause system anomalies

本發明係關於預測系統異常之技術，詳而言之，係關於一種預測造成系統異常之稀有事件的方法。 The present invention relates to a technology for predicting system abnormalities, in particular, it relates to a method for predicting rare events that cause system abnormalities.

對於一般營運系統來說，出現系統異常的機率非常低，但是一旦出現系統異常，通常要持續很久才能夠排除，簡單來說，透過觀察長期蒐集的系統監控資料以及分析系統異常特性，將發現此類異常會持續很久但幾乎沒發生過，換言之，此類預測之困難點在於資料的極不平衡(highly-imbalanced)，也就是系統異常的量極為稀少，相對於沒有異常的情況，對於系統管理者而言，從極少量資料預測出系統異常是困難的，除了須資料蒐集與整併、擷取資料特徵以及數據預測，將成為系統管理者重要課題。 For general operating systems, the probability of system abnormality is very low, but once a system abnormality occurs, it usually takes a long time to eliminate it. Simply put, by observing the long-term collected system monitoring data and analyzing the abnormal characteristics of the system, you will find this Such anomalies will last for a long time but almost never happen. In other words, the difficulty of such prediction lies in the highly-imbalanced data, that is, the amount of system anomalies is extremely rare. Compared with the situation without anomalies, the system management From a very small amount of data, it is difficult to predict system abnormalities. In addition to data collection and integration, data feature extraction and data prediction, it will become an important issue for system administrators.

綜上，若能找出一種快速且有效地預測系統異常之技術，特別是針對極稀有系統異常之預測，在不變動現有系統以及減少人為參與下，能讓整個預測機制完全自動化完成，此將成為本技術領域人員急欲追求解決方案之目標。 In summary, if a technology can be found to quickly and effectively predict system abnormalities, especially for extremely rare system abnormalities, the entire prediction mechanism can be fully automated without changing the existing system and reducing human involvement. Become the goal that people in the technical field are eager to pursue solutions.

本發明之目的係提出一種預測系統異常之方法，主要技術包含三個面向，分別為資料前處理、資料特徵選擇與處理以及機器學習，透過對系統監控資料進行整合、數值化和正規化以生成可供系統異常預測之模型，藉以成功預測出系統異常。 The purpose of the present invention is to propose a method for predicting system abnormalities. The main technology includes three aspects, namely data pre-processing, data feature selection and processing, and machine learning, which are generated by integrating, digitizing and regularizing system monitoring data. A model that can be used to predict system abnormalities, so as to successfully predict system abnormalities.

為達到上述目的與其他目的，本發明係提出一種預測造成系統異常之稀有事件的方法，係包括：自各伺服器蒐集原始監控資料，再以時間為主鍵進行該原始監控資料之整合，以將該原始監控資料整合成一筆資料；填補該原始監控資料中之缺漏部分；將該原始監控資料中之文字部分進行數值化，以令該原始監控資料皆成為數值資料；將該原始監控資料中所有數值進行正規化；以該原始監控資料中之一段時間內的連續資料為基礎，將該原始監控資料分群並離散化，以統計該段時間內之連續資料數量總數，俾做為表示該原始監控資料之參數，再對表示該原始監控資料之參數進行訓練，以建立異常分類預測模型；以及利用該異常分類預測模型之模型特性取得造成異常發生的至少一個參數，以由該至少一個參數以及關聯性高的參數預測出系統異常之可能發生原因。 In order to achieve the above and other purposes, the present invention proposes a method for predicting rare events that cause system anomalies, which includes: collecting original monitoring data from each server, and then integrating the original monitoring data with time as the key to integrate The original monitoring data is integrated into a data; the missing parts in the original monitoring data are filled; the text part of the original monitoring data is digitized so that the original monitoring data becomes numerical data; all of the original monitoring data The value is normalized; based on the continuous data in the original monitoring data within a period of time, the original monitoring data is grouped and discretized to count the total number of continuous data in the period of time, as a representation of the original monitoring And then train the parameters representing the original monitoring data to establish an anomaly classification prediction model; and use the model characteristics of the anomaly classification prediction model to obtain at least one parameter that caused the anomaly, so that the at least one parameter and the correlation High-performance parameters predict the possible causes of system abnormalities.

於前述方法中，該填補該原始監控資料中之缺漏部分之步驟係包括以特定數值填補該缺漏部分，以及對於連續的數值填補上平均值或特定數值，且該將該原始監控資料中之文字部分進行數值化之步驟係將原本以該特定數值填補該缺漏部分者採用編號取代。 In the aforementioned method, the step of filling the missing part in the original monitoring data includes filling the missing part with a specific value, and filling in an average value or a specific value for continuous values, and the text in the original monitoring data Part of the digitization step is to replace the missing part with the specific value with a serial number.

於前述方法中，該將該原始監控資料中之文字部分進行數值化之步驟係將不同種類之文字資料內容以從0開始之數值序列循序編排。 In the aforementioned method, the step of digitizing the text part of the original monitoring data is to sequentially arrange the content of different types of text data in a numerical sequence starting from 0.

於前述方法中，該將該原始監控資料分群之步驟係採用k-means分群演算法以將該原始監控資料進行分群。 In the foregoing method, the step of grouping the original monitoring data is to use a k-means grouping algorithm to group the original monitoring data.

於前述方法中，該對表示該原始監控資料之參數進行訓練之步驟係包括使用隨機森林演算法將該隨機森林演算法之超參數設定出一個範圍，再利用網格搜尋方式，於該範圍內以一定間隔進行參數搜尋。 In the foregoing method, the step of training the parameters representing the original monitoring data includes using a random forest algorithm to set the hyperparameters of the random forest algorithm to a range, and then using a grid search method to set the parameters within the range Search for parameters at certain intervals.

於前述方法中，該訓練表示該原始監控資料之參數，以建立該異常分類預測模型之步驟復包括使用欠抽樣(under-sampling)方式以減低該原始監控資料之數量。 In the aforementioned method, the training represents the parameters of the original monitoring data, and the step of establishing the anomaly classification prediction model includes using an under-sampling method to reduce the amount of the original monitoring data.

於前述方法中，該對表示該原始監控資料之參數進行訓練之步驟係包括使用隨機森林演算法以對於連續異常的資料僅判定第一個時間點為異常，其餘時間點視為正常。 In the aforementioned method, the step of training the parameters representing the original monitoring data includes using a random forest algorithm to determine only the first time point as abnormal for the continuous abnormal data, and the remaining time points are regarded as normal.

於前述方法中，該利用該異常分類預測模型之模型特性取得造成異常發生的至少一個參數之步驟係指利用信息增益(Information Gain)演算法進行特徵選取，以將對系統異常影響較大者之參數作為該至少一個參數。 In the foregoing method, the step of using the model characteristics of the anomaly classification prediction model to obtain at least one parameter that caused the anomaly refers to using the Information Gain algorithm to perform feature selection, so as to select the one that has a greater impact on the abnormality of the system. The parameter serves as the at least one parameter.

於前述方法中，該利用該異常分類預測模型之模型特性取得造成異常發生的至少一個參數之步驟復包括利用線性門閥單元控制前一個預測資料對結果的影響程度，並使用殘差網路將該結果反饋至當前要預測的結果。 In the foregoing method, the step of using the model characteristics of the abnormal classification prediction model to obtain at least one parameter that caused the abnormality includes using a linear gate valve unit to control the degree of influence of the previous prediction data on the result, and using a residual network The result is fed back to the current result to be predicted.

綜上可知，本發明將從系統中收集的各種數據與機器學習結合，預測未來是否即將發生系統異常，藉由本發明所述之資料前處理以及資料特徵選擇與處理，讓原始資料有更好的表示方法，且對系統環境沒有特別要求，亦不須要改動現有的系統，資料的處理技術能夠進一步的提升異常分類預測模型的準確度與降低訓練模型的時間，且透過融合機器學習技術與整套的模型訓練機制，可以大量減少人為的參與，整個機制可完全自動化完成，找出最有可能造成系統異常的原因，以達到提前預防異常發生的目標，因此，本發明所述的預測造成系統異常之稀有事件的方法能改善現有機制，特別是在系統異常極為稀少的情況下，仍能快速且有效地成功預測出大部分的錯誤。 In summary, the present invention combines various data collected in the system with machine learning to predict whether the system will be abnormal in the future. Through the data pre-processing and data feature selection and processing described in the present invention, the original data can be better Representation method, and does not have the system environment Special requirements and no need to modify the existing system. The data processing technology can further improve the accuracy of the anomaly classification prediction model and reduce the training model time. By fusing machine learning technology and a complete model training mechanism, it can greatly reduce human effort. The entire mechanism can be fully automated to find out the most likely cause of system abnormalities to achieve the goal of preventing abnormalities in advance. Therefore, the method of predicting rare events that cause system abnormalities in the present invention can improve the existing mechanism , Especially in the case of extremely rare system abnormalities, it can still successfully predict most of the errors quickly and effectively.

S11~S17‧‧‧步驟 S11~S17‧‧‧Step

S30~S39‧‧‧流程 S30~S39‧‧‧Process

第1圖為本發明之預測造成系統異常之稀有事件的方法的步驟圖。 Figure 1 is a step diagram of the method for predicting rare events that cause system anomalies according to the present invention.

第2圖說明本發明資料前處理考量分類問題的示意圖。 Figure 2 illustrates a schematic diagram of the classification problem considered in the data pre-processing of the present invention.

第3圖為本發明預測造成系統異常之稀有事件的流程圖。 Figure 3 is a flow chart of the present invention for predicting rare events that cause system abnormalities.

第4圖說明本發明判斷是否需要更新模型的示意圖。 Figure 4 illustrates a schematic diagram of the present invention for judging whether the model needs to be updated.

第5圖說明本發明之資料選擇與整併的示意圖。 Figure 5 illustrates a schematic diagram of data selection and merging of the present invention.

第6圖說明本發明之遺失資料處理的示意圖。 Figure 6 illustrates a schematic diagram of the lost data processing of the present invention.

第7圖說明本發明一實施例中系統正常與即將異常之CPU使用率的示意圖。 Figure 7 illustrates a schematic diagram of the CPU usage rate of the system in normal and about to be abnormal in an embodiment of the present invention.

第8圖說明本發明之連續資料離散化的示意圖。 Figure 8 illustrates a schematic diagram of the discretization of continuous data according to the present invention.

第9A和9B圖說明本發明之預測目標項目資料處理的示意圖。 Figures 9A and 9B illustrate a schematic diagram of the forecast target item data processing of the present invention.

第10圖說明本發明之使用線性門閥單元建立殘差網路的示意圖。 Figure 10 illustrates the use of linear gate valve units to build a residual network of the present invention intention.

以下藉由特定的具體實施形態說明本發明之技術內容，熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之優點與功效。然本發明亦可藉由其他不同的具體實施形態加以施行或應用。 The following describes the technical content of the present invention with specific specific embodiments. Those familiar with the art can easily understand the advantages and effects of the present invention from the content disclosed in this specification. However, the present invention can also be implemented or applied by other different specific embodiments.

第1圖為本發明之預測造成系統異常之稀有事件的方法的步驟圖。於步驟S11中，自各伺服器蒐集原始監控資料，再以時間為主鍵進行該原始監控資料之整合，以將該原始監控資料整合成一筆資料。於此步驟中，可設定一個資料時效，以使預測模型只使用此時效內的資料進行模型訓練，舉例來說，以狀態表紀錄一小時內的預測結果並訂定重新訓練模型的門檻值，快速判斷現在的預測模型是否仍能正確預測，因而將所有資料中的時間戳章對齊，接著，將來自各個監控單元從各伺服器所收集到之原始監控資料以時間戳章為結合參照鍵值，以將該原始監控資料整合成同一筆資料。 Figure 1 is a step diagram of the method for predicting rare events that cause system anomalies according to the present invention. In step S11, collect original monitoring data from each server, and then integrate the original monitoring data with time as the main key, so as to integrate the original monitoring data into a piece of data. In this step, you can set a data aging so that the prediction model only uses the data within the time effect for model training. For example, use the state table to record the prediction results within one hour and set the threshold for retraining the model. Quickly determine whether the current prediction model can still predict correctly, so the timestamp chapters in all the data are aligned, and then the original monitoring data collected from each monitoring unit from each server is combined with the timestamp chapter as the reference key value , To integrate the original monitoring data into the same data.

於步驟S12中，填補該原始監控資料中之缺漏部分。詳言之，該原始監控資料中有時會因為不可預期的原因，出現缺漏資料，因而此步驟係對該原始監控資料中之缺漏部分進行填補。 In step S12, the missing part in the original monitoring data is filled. In detail, the original monitoring data sometimes has missing data due to unexpected reasons, so this step is to fill in the missing parts of the original monitoring data.

於一實施例中，填補該原始監控資料中之缺漏部分係包括以特定數值(如-1)填補該缺漏部分，以及對於連續的數值填補上平均值或特定數值(如-1)。簡言之，針對缺漏部分可以用特定數值(如-1)進行填補，又或者對於連續的數值填補上平均值或特定數值(如-1)，而不填入數值0的原因主要是要與實際數值0有所區隔。 In one embodiment, filling the missing part in the original monitoring data includes filling the missing part with a specific value (such as -1), and filling in an average value or a specific value (such as -1) for continuous values. In short, the missing part can be filled with a specific value (such as -1), or for continuous values, an average value or a specific value (such as -1) can be filled in, instead of filling in the original value of 0. Because it is mainly to be separated from the actual value of 0.

於步驟S13中，將該原始監控資料中之文字部分進行數值化，以令該原始監控資料皆成為數值資料。此步驟即是說明預測模型之建立須為數值資料，假設該原始監控資料本來就是數值，則無須處理，可直接使用，假若該原始監控資料為文字資料，則要以數值來表示，故要對該原始監控資料中之文字部分進行數值化。 In step S13, the text part of the original monitoring data is digitized, so that all the original monitoring data become numerical data. This step is to explain that the establishment of the prediction model must be numerical data. Assuming that the original monitoring data is originally a value, it does not need to be processed and can be used directly. If the original monitoring data is text data, it must be expressed in numerical values, so it is necessary to The text part of the original monitoring data is digitized.

於一實施例中，將該原始監控資料中之文字部分進行數值化係將不同種類之文字資料內容以從0開始之數值序列循序編排，易言之，將該原始監控資料依照不同種的文字資料內容數值化，亦即可將文字資料改以數字0、1、2、…的數值序列編排而成為離散的數值資料。 In one embodiment, the text part of the original monitoring data is digitized by sequentially arranging the content of different types of text data in a numerical sequence starting from 0. In other words, the original monitoring data is based on different types of text. The data content is digitized, that is, the text data can be arranged in a numerical sequence of numbers 0, 1, 2,... to become discrete numerical data.

另外，假若在前述步驟S12中，用特定數值(如-1)對缺漏部分進行填補，則於本步驟中，將原本以特定數值(如-1)填補該缺漏部分者採用編號取代。 In addition, if in the foregoing step S12, a specific value (such as -1) is used to fill the missing part, in this step, the original number (such as -1) to fill the missing part is replaced by a serial number.

於步驟S14中，將該原始監控資料中所有數值進行正規化。具體來說，由於每一種監控項目都有不同的數值大小特性，如果這些資料不經處理而直接使用進行異常分類預測模型之訓練，則有數值較大特性的項目對異常分類預測模型的影響力會較大，但預期中各個項目的重要性應該都要是一樣的，因此會導致異常分類預測模型無法達到預期的效果，故本步驟即是要將所有數值正規化。 In step S14, all values in the original monitoring data are normalized. Specifically, since each monitoring item has different numerical value characteristics, if these data are directly used for training of anomaly classification prediction model without processing, the influence of items with larger numerical characteristics on the anomaly classification prediction model It will be larger, but the importance of each item in the expectation should be the same, which will cause the abnormal classification prediction model to fail to achieve the expected effect, so this step is to normalize all the values.

關於數值正規化，可例如將各資料減掉其平均後除以標準差，使得數值調整至0~1的區間，另外，針對以序列編號紀錄的離散資料，則可選擇以獨熱編碼(one-hot encoding)，例如原始資料有三個數值0、1和2，則會使用三維向量，即(1,0,0)、(0,1,0)、(0,0,1)來記錄。 Regarding numerical normalization, for example, each data can be subtracted from its average and divided by the standard deviation, so that the value is adjusted to the interval of 0~1. In addition, for discrete data recorded with serial numbers, you can choose to use one-hot encoding (one -hot encoding), for example, the original data has three values 0, 1, and 2, It will use three-dimensional vectors, namely (1,0,0), (0,1,0), (0,0,1) to record.

於步驟S15中，以該原始監控資料中之一段時間內的連續資料為基礎，將該原始監控資料分群並離散化，以統計該段時間內之連續資料數量總數，俾做為表示該原始監控資料之參數。於此步驟中，係將該原始監控資料分群採用k-means分群演算法進行分群，包括統計分類後各項資料數量總數，並使用此總數作為該各項資料的表現方式。 In step S15, based on the continuous data in the original monitoring data for a period of time, the original monitoring data is grouped and discretized, so as to count the total number of continuous data in the period of time, so as to represent the original monitoring data. The parameters of the data. In this step, the original monitoring data is grouped into groups using the k-means grouping algorithm, which includes counting the total number of various data after classification, and using this total as the representation method of the various data.

於步驟S16中，再對表示該原始監控資料之參數進行訓練，以建立異常分類預測模型。於本步驟中，於異常分類預測模型建立時要先對訓練資料進行平衡處理，例如將數量較多的事件使用欠抽樣(under-sampling)方式以減低資料數量，進而建立出異常分類預測模型。 In step S16, the parameters representing the original monitoring data are trained to establish an abnormal classification prediction model. In this step, the training data should be balanced when the anomaly classification prediction model is established. For example, an under-sampling method is used to reduce the amount of data for a larger number of events, and then an anomaly classification prediction model is established.

於一實施例中，可利用隨機森林演算法來對該原始監控資料之參數進行訓練，亦即於異常分類預測模型建立時，透過對該隨機森林演算法的超參數設定一個範圍並進行網格搜尋，將該範圍內以一定間隔來進行參數搜尋，接著可使用驗證資料篩選出比較好的模型，最後的預測會使用前述該些比較好的模型一起進行預測，並透過多數決之方式加上通過門檻得出最後的結果。 In one embodiment, the random forest algorithm can be used to train the parameters of the original monitoring data, that is, when the anomaly classification prediction model is established, a range of hyperparameters of the random forest algorithm is set and the grid is performed. Search, perform a parameter search within the range at a certain interval, and then use the verification data to filter out a better model, and the final prediction will use the aforementioned better models to predict together, and add through the majority decision method Pass the threshold to get the final result.

於步驟S17中，利用該異常分類預測模型之模型特性取得造成異常發生的重要的至少一個參數，以由該重要的至少一個參數以及關聯性高的參數預測出系統異常之可能發生原因。具體來說，可利用該異常分類預測模型之模型特性取得造成異常發生的重要的至少一個參數係指利用信息增益(Information Gain，InfoGain)演算法進行特徵選取，以將對系統異常影響較大者之參數作為該重要的至少一個參數，亦即使用InfoGain演算法對訓練完的異常分類預測模型進行特徵判讀，藉此找出該異常分類預測模型中那些參數對於系統異常的成因影響較大。 In step S17, the model characteristics of the abnormal classification prediction model are used to obtain at least one important parameter that causes the abnormality, and the possible cause of the system abnormality is predicted from the important at least one parameter and the parameter with high correlation. Specifically, the use of the model characteristics of the abnormal classification prediction model to obtain at least one important parameter that causes the abnormality refers to the use of Information Gain (InfoGain) algorithm to perform feature selection, so as to determine the one that has a greater impact on the abnormality of the system. The parameter as the important at least one parameter, that is to use the InfoGain calculus The method performs feature interpretation on the trained anomaly classification prediction model to find out which parameters in the anomaly classification prediction model have a greater impact on the cause of system anomalies.

於一實施例中，前述使用隨機森林演算法以對表示該原始監控資料之參數進行訓練，係包括對於連續異常的資料僅判定第一個時間點為異常，其餘時間點視為正常，也就是說，由於系統異常發生後再預測異常意義不大，又因為系統異常發生後通常會持續一段時間，故在進行測試時，本發明設計要預測的是連續異常的發生之前的時間點，因而對於連續異常，只判定第一個時間點是異常，其餘的時間點則視為正常。 In one embodiment, the aforementioned random forest algorithm is used to train the parameters representing the original monitoring data, including that for continuous abnormal data, only the first time point is judged to be abnormal, and the remaining time points are regarded as normal, that is, In other words, because the system abnormality is not meaningful to predict the abnormality after it occurs, and because the system abnormality usually lasts for a period of time after the occurrence of the system abnormality, when testing, the design of the present invention is to predict the time point before the occurrence of the continuous abnormality. For continuous abnormalities, only the first time point is judged to be abnormal, and the remaining time points are regarded as normal.

另外，於步驟S17中，復包括利用線性門閥單元控制前一個預測資料對結果的影響程度，並使用殘差網路將該結果反饋至當前要預測的結果，簡言之，由於時間序列資料的前後資料有關連性，考慮當前時間點之前的結果有助於當前的預測，故本發明提出利用線性門閥單元控制前一個預測資料對結果的影響程度，並使用該殘差網路以使該結果反饋到當前要預測的結果。 In addition, in step S17, it includes using the linear gate valve unit to control the degree of influence of the previous prediction data on the result, and using the residual network to feed back the result to the current result to be predicted. In short, due to the time series data Before and after the data are related, considering the results before the current time point is helpful for the current prediction, so the present invention proposes to use the linear gate valve unit to control the degree of influence of the previous prediction data on the result, and use the residual network to make the result Feedback to the current result to be predicted.

另外，本發明之預測造成系統異常之稀有事件的方法復包括該異常分類預測模型的再訓練，具體來說，在監控資料持續蒐集一段時間後，由於環境與系統上部屬的程式可能已有許多變化，因而舊的異常分類預測模型可能已無法預測出新的環境的系統異常狀況，因而可回到步驟S11，重新再訓練新的預測模型。 In addition, the method of predicting rare events that cause system abnormalities in the present invention includes retraining the abnormal classification prediction model. Specifically, after the monitoring data is continuously collected for a period of time, there may be many programs under the environment and the system. Therefore, the old abnormal classification prediction model may no longer be able to predict the system abnormality of the new environment. Therefore, it is possible to return to step S11 to retrain the new prediction model.

具體來說，本發明所用技術可分為三種技術面向，包括資料前處理、資料特徵選擇與處理以及機器學習。首先，關於本發明極稀有系統異常之預測之困難處在於資料的極不平衡(highly-imbalanced)，也就是系統異常的量極為稀少，因而本發明的三種技術面向包括資料前處理、資料特徵選擇與處理以及機器學習，下面先說明三種技術面向所要考量技術問題以及所選擇處理方式進行說明。 Specifically, the technology used in the present invention can be divided into three technical aspects, including data pre-processing, data feature selection and processing, and machine learning. First of all, the difficulty in predicting abnormalities in the extremely rare system of the present invention lies in the highly-imbalanced data, that is, the system The amount of system abnormalities is extremely rare. Therefore, the three technical aspects of the present invention include data pre-processing, data feature selection and processing, and machine learning. The following first describes the three technical aspects to consider the technical issues and the selected processing methods.

第一面向為資料前處理，其對應本案第1圖中步驟S11-S14，第一面向須考慮有以下幾點：(1)要預測的問題是個分類問題，也就是說可以把很多種的系統異常視為同一種，最終目的是預測出系統異常，故可視為二元分類問題或是多元分類問題，經實驗顯示，視為多元分類問題能更有效的抓出各種錯誤；(2)類似上述觀念，異常偵測問題可視為監督式學習的分類問題或是非監督式學習，經實驗結果，監督式學習的方式會比非監督式學習效果好，因此能將此問題視為監督式的多元分類問題，例如第2圖所示；(3)因為系統異常時常是持續一段時間的，當要預測時，若已有系統異常發生了，那再預測的意義就不大了，故本發明僅預測連續系統異常的最開頭；(4)資料中常有一些遺失的部分，一般遺失的資料會用前一的時間點的資料填補或是前後時間點的資料做線性內差，惟上述兩種處理方法經實驗後都不合適，原因在於這些資料的遺失常是一些其他伺服器的當機所引起的，故本名會將這些空缺以一個新的狀態來填補。 The first aspect is data pre-processing, which corresponds to steps S11-S14 in Figure 1 of this case. The first aspect must consider the following points: (1) The problem to be predicted is a classification problem, which means that many systems can be Anomalies are regarded as the same kind, and the ultimate goal is to predict system anomalies, so they can be regarded as a binary classification problem or a multi-class classification problem. Experiments show that treating a multi-class classification problem can more effectively catch various errors; (2) similar to the above Concept, the anomaly detection problem can be regarded as a classification problem of supervised learning or unsupervised learning. The experimental results show that the method of supervised learning will have better results than unsupervised learning. Therefore, this problem can be regarded as a supervised multivariate classification. For example, as shown in Figure 2; (3) Because the system abnormality often lasts for a period of time, when it is to be predicted, if an existing system abnormality has occurred, the prediction is of little significance. Therefore, the present invention only predicts The beginning of the continuous system anomaly; (4) There are often some missing parts in the data. Generally, the missing data will be filled in with the data at the previous time point or the data at the previous and after time points will be used for linear internal difference, but the above two processing methods After experimentation, they are all inappropriate. The reason is that the loss of these data is caused by the downtime of some other servers, so the original name will fill these vacancies in a new state.

由上可知，資料前處理技術主要目標是要處理原始監控資料，讓該原始監控資料與預測目標能夠有更好的表示方法，簡言之，與一般報表上呈現的資料不同，機器學習模型(例如本發明之異常分類預測模型)對於原始資料有不同的要求，最常見例如要將數值標準化，如此一來，來自不同監控項目的數據就不會因為數據本來的數值分布而對機器學習模型造成不同程度的影響，違反了每個項目都對預測目標一樣重要的假設。 It can be seen from the above that the main goal of data pre-processing technology is to process the original monitoring data, so that the original monitoring data and the predicted target can have a better representation method. In short, different from the data presented on the general report, the machine learning model ( For example, the abnormal classification prediction model of the present invention has different requirements for the original data. The most common one is to standardize the value. In this way, the data from different monitoring items will not cause the machine learning model to be affected by the original value distribution of the data. The impact of varying degrees violates the assumption that every project is as important as the predicted goal.

第二面向為資料特徵選擇與處理，其對應本案第1圖中步驟S15，第二面向須考慮有以下幾點：(1)資料的特徵往往是很一般的數據，例如最多允許的連線數及使用中的連線，可將這些資料特徵合併成連線的使用率(即使用率=使用中的連線/最多允許的連線數)；(2)對於一個資料特徵的數值而言，可將這些數值分群，藉以產生一個新的資料特徵，例如CPU使用率的等級能分為低、中、高三級，此程序即為離散化；(3)另外，更進一步須對離散化的資料做統計，例如在10個時間區段中CPU使用率等級是低、中、高，假設高的比率超過一定的門檻，即代表錯誤很有可能會發生；(4)另外，重要資料特徵的選擇亦是不可或缺的部分，首先須將類似的資料特徵合併，找出資料的真實維度，例如透過主成分分析(Principal Component Analysis，PCA)的統計分析、簡化數據集的方法，藉此進一步做資料特徵的選取。 The second aspect is the selection and processing of data characteristics, which corresponds to step S15 in Figure 1 of this case. The second aspect must consider the following points: (1) The characteristics of the data are often very general data, such as the maximum number of connections allowed And the connection in use, these data characteristics can be combined into the utilization rate of the connection (that is, the utilization rate = the connection in use / the maximum number of connections allowed); (2) For the value of a data feature, These values can be grouped into groups to generate a new data feature. For example, the level of CPU usage can be divided into three levels: low, medium, and high. This procedure is discretization; (3) In addition, further discretization is required. Data is used for statistics. For example, the CPU usage level in 10 time periods is low, medium, and high. If the high rate exceeds a certain threshold, it means that errors are likely to occur; (4) In addition, the characteristics of important data Selection is also an indispensable part. First, similar data characteristics must be combined to find out the true dimensions of the data. For example, through the statistical analysis of principal component analysis (PCA) and the method of simplifying the data set, it is necessary to further Make the selection of data characteristics.

由上可知，資料特徵選擇與處理技術主要目標是透過對原始監控資料的觀察以知悉該原始監控資料的分布與特性，俾設計其它種較簡化的描述方法，相較直接使用原始資料來訓練機器學習預測模型，本發明透過較簡化的描述方法可減低預測模型要學習的項目，因較重要的部分已經用簡化的描述方法強調，進而讓預測模型學習更多可能影響到預測標籤的原因，提升最後預測模型預測的準確性。另外，關於預測標籤的處理也是很重要的，在實施例中有時系統可能出現短暫的異常，原因可能來自於系統規則的重新設定、新版程式的上線或是機器的重新啟動等等不同原因，但這些狀況都不應該被歸類為系統異常，因而預測標籤可選擇採用發生連續異常事件的情形，才不會預測到應該是正常的系統情形，影響到預測模型的準確度。 It can be seen from the above that the main goal of data feature selection and processing technology is to understand the distribution and characteristics of the original monitoring data through observation of the original monitoring data, so as to design other more simplified description methods, compared to directly using the original data to train the machine Learning the prediction model. The present invention can reduce the items to be learned by the prediction model through a simplified description method, because the more important part has been emphasized by a simplified description method, so that the prediction model can learn more reasons that may affect the prediction label, and improve Finally, predict the accuracy of the prediction model. In addition, the processing of predictive tags is also very important. In the embodiment, sometimes the system may have a short-term abnormality. The reason may come from different reasons such as the reset of the system rules, the launch of a new version of the program, or the restart of the machine. However, these conditions should not be classified as system abnormalities. Therefore, the prediction label can choose to adopt the situation of continuous abnormal events, so that the system conditions that should be normal will not be predicted, which affects the prediction model. The accuracy of the model.

第三面向為機器學習，其對應本案第1圖中步驟S16-S17。在極為不平衡(highly-imbalanced)的資料中要抓出錯誤，通常只靠一個機器學習的模型是難以達成的，本發明選擇隨機森林(random forest)演算法，並且透過網格搜尋(grid search)的方式來刷參數，找出最好的參數，並利用時間序列交叉檢驗(time-series cross validation)的方式，在考慮時間先後順序的條件下，驗證模型的好壞。另外，由於前一次的預測與當前預測的關聯性很高，因此我們導入了殘差網路技術，並透過簡單線性門閥單元，將前一次的預測結果納入當前預測的分析做出預測，接著，再對模型的結果設門檻，把不適當的預測濾掉，以達到在可接受的誤報率下，可以偵測出大部分的錯誤，最後，利用模型的結果以及信息增益(Information Gain)演算法及基尼不纯度(Gini Impurity)等演算法得出最可能造成異常的原因(資料特徵)，以供參考而預防系統異常的發生。 The third aspect is machine learning, which corresponds to steps S16-S17 in Figure 1 of this case. To catch errors in highly-imbalanced data, it is usually difficult to achieve by only relying on a machine learning model. The present invention selects random forest (random forest) algorithm, and searches through grid search (grid search). ) Method to brush the parameters, find the best parameters, and use time-series cross validation to verify the quality of the model in consideration of chronological order. In addition, because the previous prediction is highly correlated with the current prediction, we have introduced residual network technology, and through a simple linear gate valve unit, the previous prediction results are incorporated into the current prediction analysis to make predictions, and then, Then set a threshold for the results of the model to filter out inappropriate predictions so that most errors can be detected under an acceptable false alarm rate. Finally, use the results of the model and the Information Gain algorithm And Gini Impurity (Gini Impurity) and other algorithms to get the most likely cause of abnormality (data characteristics), for reference to prevent the occurrence of system abnormalities.

由上可知，機器學習技術主要目標是選擇出適合的演算法與適合的超參數。本發明係選擇隨機森林演算法，且對於隨機森林演算法的超參數選擇，本發明採用了網格搜尋的方式，不斷的透過驗證資料進行驗證，最後選出了較佳的超參數組合，接著，使用組合預測的方式同時對較佳的超參數組合之預測模型進行預測，利用多數決的方式加上通過門檻得出最後的結果。我們在上述模型的外層還加上了殘差網路，讓模型能夠將前一次的預測結果運用到此次的分析。再者，除了系統異常預測，本發明還使用信息增益(InfoGain)演算法對訓練完的預測模型進行特徵判讀，找出預測模型中哪些參數對於系統異常的成因影響較大。 It can be seen from the above that the main goal of machine learning technology is to select suitable algorithms and suitable hyperparameters. The present invention selects the random forest algorithm, and for the hyperparameter selection of the random forest algorithm, the present invention uses a grid search method to continuously verify through verification data, and finally selects a better hyperparameter combination, and then, Use the combination prediction method to predict the prediction model of the better hyperparameter combination at the same time, and use the majority decision method plus the pass threshold to get the final result. We also added a residual network to the outer layer of the above model, so that the model can apply the previous prediction results to this analysis. Furthermore, in addition to system abnormality prediction, the present invention also uses the InfoGain algorithm to perform feature interpretation on the trained prediction model, and find out which parameters in the prediction model have a greater impact on the cause of system abnormality.

另外，由於系統環境或是其上部屬的程式是會變動的，因此經過一段長時間後，舊的預測模型可能已經無法預測出新的環境的系統異常狀況，故於其他實施例中，可於一段時間，例如每半年，使用新蒐集到的監控資料，重新再訓練出新的預測模型。 In addition, because the system environment or its subordinate programs are subject to change, after a long period of time, the old prediction model may no longer be able to predict the abnormal state of the system in the new environment. Therefore, in other embodiments, For a period of time, such as every six months, use the newly collected monitoring data to retrain a new prediction model.

綜上可知，資料前處理之技術面向包含資料選擇與整併、遺失資料處理、資料數值化處理以及資料正規化處理等程序，該資料特徵選擇與處理之技術面向包含連續資料離散化以及預測目標項目資料處理等程序，而機器學習之技術面向則包含使用演算法建立模型、異常項目之判定以及預測模型的再訓練等程序。 In summary, the technical aspects of data pre-processing include data selection and consolidation, lost data processing, data digitization processing, and data normalization processing. The technical aspects of data feature selection and processing include continuous data discretization and prediction goals. Project data processing and other procedures, and the technical aspects of machine learning include procedures such as the use of algorithms to build models, the determination of abnormal items, and the retraining of prediction models.

將前述各程序進行整合，可得到第3圖所述之流程圖，其中，資料前處理面向包括流程S30-S34，資料特徵選擇與處理面向包括流程S35-S36，機器學習面向包含流程S37-S39，更具體來說，整個流程圖包括流程S30為資料選擇與整併，流程S31為遺失資料處理，流程S32為資料數值化處理，流程S33為資料正規化處理，流程S34為連續資料離散化，流程S35為預測目標項目資料處理，流程S36為使用隨機森林演算法建立模型的方法，流程S37為使用線性門閥單元建立殘差網路，流程S38為異常項目的判定，流程S39為預測模型的再訓練。後續將搭配具體實施範例進一步說明。 Integrating the foregoing procedures, you can get the flowchart described in Figure 3, where the data pre-processing is oriented to include processes S30-S34, the data feature selection and processing is oriented to include processes S35-S36, and the machine learning is oriented to include processes S37-S39. To be more specific, the entire flowchart includes process S30 for data selection and merging, process S31 for lost data processing, process S32 for data digitization processing, process S33 for data normalization processing, and process S34 for continuous data discretization. The process S35 is the data processing of the forecast target item, the process S36 is the method of using the random forest algorithm to establish a model, the process S37 is the use of a linear gate valve unit to establish a residual network, the process S38 is the determination of an abnormal item, and the process S39 is the renewal of the prediction model. training. The follow-up will be further explained with specific implementation examples.

本發明從系統收集資料來訓練機器學習的模型，惟太久以前的資料並不適合拿來訓練，因為系統問題會慢慢被修復，太久以前的錯誤很可能不再發生，所以資料有時效性，而訓練出來的模型也有時效性，所以要一直不停地以新的資料重新訓練。一般來說，在沒有大幅度系統改動的情況下，收集的資料時效大約半年，假設有大幅度的系統改動，那很可能系統異常類型會完全不同，所以可能需要重新收集資料並訓練才可以維持準確度。 The present invention collects data from the system to train the machine learning model, but data that is too long ago is not suitable for training, because system problems will be slowly repaired, and errors that are too long ago are likely to no longer occur, so the data is time-effective , And the trained model is also time-sensitive, so it is necessary to keep retraining with new data. Generally speaking, without major system changes In the case of, the time limit for the collected data is about half a year. Assuming that there is a significant system change, it is likely that the type of system abnormality will be completely different, so it may need to collect data and train again to maintain accuracy.

流程S30為資料選擇與整併。本發明設計一個簡單的判斷模型來判斷是否需要更新模型，如第4圖所示並參考下列表一，可使用一個小表紀錄一小時內的預測情形，每一個項目代表著時間分除以5的時間，例如18分的預測結果就屬於項目3。由於模型是要預測10分後的結果，因此10分後就可以解讀出此項目是否正確，正確就填入T，否則填F，而第一次填表時所有欄位都是空值ψ。 The process S30 is data selection and integration. The present invention designs a simple judgment model to judge whether the model needs to be updated, as shown in Figure 4 and refer to the following table 1. A small table can be used to record the forecast within one hour, and each item represents the time divided by 5. For example, the predicted result of 18 minutes belongs to Project 3. Since the model is to predict the result after 10 points, after 10 points, you can interpret whether the item is correct. Fill in T if it is correct, or fill in F. When filling in the form for the first time, all the fields are empty values ψ.

此範例中是在0~4分時啟動預測模型，因此是從欄位0開始填值，透過第4圖的簡單狀態轉換圖來描述填表的邏輯，每個欄位的狀態是獨立的，但S與V值是共用的。每次填完表後同時也會更新表中的總和(S)與非空值(V)，當V=12時代表表格中已經不存在空值，此時我們就可以計算模型的準確度S/V，當準確度低於設定的門檻值時(例如本實施例為50%)，就要考慮是否要重新訓練模型。由於此判斷是會一直被呼叫的，此種判斷方式將額外的系統負擔減到最小。 In this example, the prediction model is started at 0~4 minutes, so the value is filled from field 0. The logic of filling the form is described through the simple state transition diagram in Figure 4. The state of each field is independent. But the S and V values are shared. Each time the table is filled in, the sum (S) and non-empty value (V) in the table will also be updated. When V=12, it means that there are no empty values in the table. At this time, we can calculate the accuracy of the model S /V, when the accuracy is lower than the set threshold (for example, 50% in this embodiment), it is necessary to consider whether to retrain the model. Since this judgment will always be called, this judgment method minimizes the extra system burden.

接下來設定一個資料時效，本實施例為半年，預測模型只使用此時效內的資料進行模型訓練，接著，將所有資料中的時間戳章對齊，以5分為對齊單位，即向下對齊，例如0分0秒~4分59秒都會被歸類到0分0秒的時間戳章，最後，再將來自各個監控單元收集到的監控資料以時間戳章為結合參照鍵值，整合成同一筆資料，如第5圖所示，資料包含網頁伺服器(web server)、資料庫伺服器(database server)、系統底層的資料(其中，Ok表示沒問題，Failure表示失敗)，以時間為主鍵，整合所有資料。 Next, set a data time limit. In this example, it is half a year. The prediction model only uses the data in the time effect for model training. Then, the time stamps in all the data are aligned Qi, the alignment unit is 5 points, that is, downward alignment. For example, 0 minutes 0 seconds to 4 minutes 59 seconds will be classified into the time stamps of 0 minutes 0 seconds. Finally, the monitoring collected from each monitoring unit The data is integrated into the same data with the time stamp chapter as the combined reference key value. As shown in Figure 5, the data includes the web server, the database server, and the data at the bottom of the system (among which, Ok means no problem, Failure means failure), integrate all data with time as the main key.

流程S31為遺失資料處理。資料中常有很多失蹤(missing)的資料，失蹤的資料有幾種處理方式，可以直接補數值0，如果是連續的數值(例如記憶體(Memory)使用率)，也可以補平均值，又或是可以補一個完全沒出現過的值，例如數值-1，考量系統有時資料遺失是因為伺服器當機，此時應該補一個完全沒出現過的值來表示這種異常狀態，如第6圖所示。 The process S31 is the processing of lost data. There are often many missing data in the data. There are several ways to deal with missing data. You can directly complement the value 0. If it is a continuous value (such as the memory usage rate), you can also add the average value, or It is possible to add a value that has never appeared before, such as the value -1. Considering that the system sometimes loses data because the server is down, you should add a value that has never appeared before to indicate this abnormal state, as in Section 6. As shown in the figure.

具體來說，有時可能因為系統過於繁忙，使得系統無法回應探針程式的偵測，此時的資料遺失就比較適合使用內插補平均值的方式。，因此對於該補平均值還是補沒出現過的值，本發明使用抽樣方法補值，亦即分析1個月內的資料，計算出該補數值-1或平均值的數量分別為N與A，計算出應該補數值-1的比率為(N/N+A)，而應該補平均值的比率則為(A/N+A)，假定系統未來蒐集資料的缺漏資料也會符合這1個月中的缺漏資料的分布，故未來在資料碰到缺漏資料的時候，就會進行一次隨機過程判定，使用前面的比率做為機率，去決定該補數值-1或補平均值。 Specifically, sometimes it may be that the system is too busy, making the system unable to respond to the detection of the probe program. At this time, it is more suitable to use the method of interpolation and average value if the data is lost. Therefore, for the supplementary average value or the value that has not appeared before, the present invention uses the sampling method to supplement the value, that is, analyzes the data within 1 month, and calculates the supplementary value -1 or the average value as N and A, respectively , Calculated that the ratio of the value -1 should be complemented is (N/N+A), and the ratio of the average value should be complemented is (A/N+A), assuming that the missing data collected by the system in the future will also meet this 1 The distribution of missing data in the month, so in the future, when the data encounters missing data, a random process will be determined, using the previous ratio as the probability to determine the complement value -1 or the average value.

流程S32為資料數值化處理。資料中的特徵(feature)種類主要有兩種，一種是連續的數值，例如CPU使用率、Memory使用率等，另一種是離散的數值，例如伺服器的狀態(state)。如本案前述說明，要訓練模型需要全部是數值，故將狀態轉成數字，例如OK轉成0、Failure轉成1；另外，本實施例中離散數值不取負數，因而原本是缺值補的數值-1會用最後的編號取代，在本例是2，如第7圖所示。 The process S32 is data digitization processing. There are two main types of features in the data, one is a continuous value, such as CPU usage, memory usage, etc., and the other is a discrete value, such as server state. As explained above in this case, the model must be trained Type needs to be all numerical values, so the status is converted to numbers, such as OK to 0, Failure to 1; In addition, the discrete value in this embodiment does not take a negative number, so the value -1 that was originally a missing value will be the last The number is replaced, which is 2 in this example, as shown in Figure 7.

流程S33為資料正規化處理。由於原始資料中的數值常大小不一，對於支援向量機(support vector machine，SVM)或神經網路(Neural Network，NN)等使用權重(weight)類型的演算法，如果將數值做正規化之後，讓數值介於0和1之間(max-min normalization)，或是減掉平均除以標準差(zero-mean normalization)，可以讓模型更容易訓練起來。上述的標準正規化流程處理資料方法，使得各個特徵項目重要性相同，最終由模型決定特徵項目的重要性。 The process S33 is data normalization processing. Since the values in the original data are often of different sizes, for support vector machines (SVM) or neural networks (Neural Network, NN) and other algorithms that use weights (weight), if the values are normalized , Let the value be between 0 and 1 (max-min normalization), or subtract the average and divide by the standard deviation (zero-mean normalization), which can make the model easier to train. The above-mentioned standard normalization process data processing method makes the importance of each feature item the same, and the model determines the importance of the feature item.

對於這些使用權重類型的演算法，因為數值都有大小關係，但有些離散的數值其實是沒有大小關係，例如剛剛把文字轉成數字的伺服器狀態(state)，為了讓每種狀態都是公平的，可以做獨熱編碼(one-hot encoding)，所以原本的三種數值0、1、2會分別變成三維向量[1,0,0]、[0,1,0]、[0,0,1]，如此等於分別給三種狀態一個獨立的權重來做訓練，以達到公平訓練。另外，如果是像決策樹(decision tree、random forest)演算法，可使用相等運算子，而模擬退火(adaboost、gradient boosting)類型的演算法，可訓練的彈性比較大，就不一定要做這些處理，可以選擇性的使用。 For these algorithms that use weighting types, because the values are related to each other, some discrete values are actually not related to each other. For example, the state of the server that just converted text into numbers, in order to make each state fair. Yes, you can do one-hot encoding, so the original three values 0, 1, 2 will become three-dimensional vectors [1,0,0], [0,1,0], [0,0, 1] This is equivalent to giving the three states an independent weight for training to achieve fair training. In addition, if it is an algorithm like decision tree (decision tree, random forest), you can use equality operators, while simulated annealing (adaboost, gradient boosting) type algorithms have greater flexibility in training, so you don’t have to do these Treatment can be used selectively.

流程S34為連續資料離散化。本流程係將一些連續的特徵(feature)離散化，例如CPU使用率，首先，將連續的CPU使用率數值利用k-means演算法分群後，再作離散化，如第8圖所示，設定k-means演算法要把資料分成三類，因而CPU使用率的資料能被分群成三類，分別是 0%~30%、30%~70%及70%~100%，上述將資料分成此三類，即代表如果要對CPU資料的分布找出三個資料分布質心，那麼切割方式就會是切成這三類的資料，此亦代表使用這三類資料就可以描述出原本資料的特性，而不用使用原本複雜的數據，如此可簡化機器學習模型的訓練負擔。在統計一長段時間之前(例如一小時前到現在)，這三類分別發生幾次變成三個特徵(feature)，如圖所示，系統保持正常的那筆資料的三個數值分別為[7,1,2]，系統即將異常的那筆資料則為[3,2,5]。 The process S34 is the discretization of continuous data. This process discretizes some continuous features, such as CPU usage. First, the continuous CPU usage values are grouped by k-means algorithm, and then discretized. As shown in Figure 8, set The k-means algorithm should divide the data into three categories, so the CPU usage data can be grouped into three categories, namely 0%~30%, 30%~70%, and 70%~100%. The above classification of data into these three categories means that if three data distribution centroids are to be found for the distribution of CPU data, then the cutting method will be Divided into these three types of data, this also means that using these three types of data can describe the characteristics of the original data, instead of using the original complex data, which can simplify the training burden of the machine learning model. Before counting for a long period of time (for example, from one hour ago to now), these three types occurred several times and turned into three features. As shown in the figure, the three values of the data that the system keeps normal are respectively [ 7,1,2], the data that the system is about to be abnormal is [3,2,5].

上述紀錄方式就猶如將系統指標製作成指紋，在不同的系統狀態就會傾向出現不同類型的指紋，而相同或類似的系統狀態，系統指紋彼此之間也會比較相似甚至完全相同，而使用這個方法具有下列優點：(1)可以考慮更長時間以前的狀態；(2)可以讓特徵(feature)的數量不會增加太多；(3)考慮狀態發生的頻率，不考慮狀態發生的時間點。因此，這些特別的特徵(feature)可幫助模型更進一步的獲取長時間以前的各種資訊，以增加預測準確度。 The above recording method is like making the system indicators into fingerprints. Different types of fingerprints tend to appear in different system states. For the same or similar system states, the system fingerprints will be similar or even identical to each other. Use this The method has the following advantages: (1) It can consider the state of a longer time ago; (2) The number of features (feature) will not increase too much; (3) The frequency of the state occurrence is considered, and the time point of the state occurrence is not considered . Therefore, these special features can help the model further obtain various information from a long time ago to increase the accuracy of prediction.

流程S35為預測目標項目資料處理。本流程係將想預測的伺服器未來(例如10分鐘)的狀態取出，並且當成預測的label(predict_status)，如第9A圖所示。基於系統異常通常是發生連續一段時間的，因此，當異常已經發生時，我們再預測以後的異常就意義不大了，正常應該是要在連續的異常發生之前預測出來，換言之，我們只須考慮我們預測第一次的系統異常的準確度，所以我們將連續的異常資料只保留第一筆資料，如第9B圖所示。如果是連續的異常之間有其他正常的資料，則計算正常的資料的數量與前後連續異常資料的比例，如果正常的資料只占這三種資料的總和的 10%以下，那麼可將這整個整體視為一個單一的連續資料事件。 The process S35 is the data processing of the forecast target item. In this process, the future (for example, 10 minutes) status of the server that you want to predict is taken out and used as the predicted label (predict_status), as shown in Figure 9A. Based on the fact that system abnormalities usually occur for a continuous period of time, when the abnormality has occurred, it is not meaningful for us to predict the future abnormality. Normal should be predicted before the continuous abnormality occurs. In other words, we only need to consider We predict the accuracy of the first system anomaly, so we keep the continuous anomaly data only the first data, as shown in Figure 9B. If there are other normal data between continuous abnormalities, calculate the ratio of the number of normal data to the continuous abnormal data before and after. If the normal data only accounts for the sum of the three types of data Below 10%, then the whole whole can be regarded as a single continuous data event.

流程S35的資料處理會分兩階段執行，第一階段會先掃一遍所有資料，確認哪些部分的異常要認定為連續異常，第二階段才會真的對資料進行刪減的動作。要注意的是，第一階段有可能會碰到兩個以上的連續異常區間重疊的可能，例如連續-正常-連續-正常-連續，此種狀況有可能前三組資料被視為連續異常，後三組資料也被視為連續異常，由於中間的連續異常重複出現在前後異常區間，因此這五組資料將會被視為同一組連續異常，在第二階段會被一起進行資料刪減。 The data processing of process S35 will be executed in two stages. In the first stage, all the data will be scanned to confirm which part of the abnormalities should be recognized as continuous abnormalities, and then the data will be deleted in the second stage. It should be noted that the first stage may encounter the possibility of two or more continuous abnormal intervals overlapping, such as continuous-normal-continuous-normal-continuous. In this situation, the first three sets of data may be regarded as continuous abnormalities. The last three sets of data are also regarded as continuous anomalies. Because the middle continuous anomalies repeatedly appear in the before and after anomaly intervals, these five sets of data will be regarded as the same set of continuous anomalies and will be deleted together in the second stage.

流程S36為使用隨機森林演算法建立模型的方法。本流程係用資料來訓練隨機森林模型(random forest model)，為了處理異常極少的不平衡(imbalance)問題，class_weight參數須設成balanced，並且在訓練時做欠抽樣(under-sampling)，即將過多的正常資料減少，讓模型可更專心訓練異常的資料。利用網格搜尋(grid search)的方式，掃過隨機森林(random forest)中所有重要的參數，設定特定的參數範圍，例如n_estimators從1~20並且間隔1，所以此參數會有20種組合，可將所有好的參數組合都找過一遍，大約幾萬種組合，找完後選出所有好的模型通常不會超過200個，再將這些模型一起用來做預測，並將預測的結果合起來(ensemble)，合起來的方式是閥值投票(threshold voting)，類似多數決的投票方式，但只要有超過閥值數量以上的模型才是異常，故將其預測成異常。 The process S36 is a method for establishing a model using the random forest algorithm. This process uses data to train the random forest model. In order to deal with the extremely rare imbalance problem, the class_weight parameter must be set to balanced, and under-sampling is performed during training, which is about to be too much The normal data is reduced, allowing the model to focus more on training abnormal data. Use grid search to scan through all important parameters in random forests and set specific parameter ranges. For example, n_estimators ranges from 1 to 20 with an interval of 1, so there are 20 combinations of this parameter. You can search through all good parameter combinations, about tens of thousands of combinations. After searching, all good models are usually selected not to exceed 200, and then these models are used together to make predictions, and the results of the predictions are combined (ensemble), the combined method is threshold voting, similar to the majority voting method, but as long as there are more than the threshold number of models, it is abnormal, so it is predicted to be abnormal.

流程S37為使用線性門閥單元建立殘差網路。由於時間序列資料的前後資料有關連性，考慮當前時間點之前的結果有助於當前的預測下，本發明提出使用線性門閥單元建立殘差網路來將過去的預測結果反饋到當前要預測的結果，如第10圖。具體來說，殘差網路會先將主預測模型中使用的效能參數(X)與前一次的預測結果(Y’)串聯起來，形成(X+Y’)的維度的向量，接著使用一個簡單的線性轉換，透過(X+Y’)*2個參數，將此向量轉化為2維向量(a,b)，最後，計算殘差R=a*σ(b)，其中σ(x)是sigmoid函數(1/(1+exp(-x)))，能將輸入轉化到(0,1)的範圍內，因而σ(b)能夠控制多少a能夠成為最後的殘差。綜上，當前的預測最終結果Y_final會由主預測模型的預測Y與殘差R相加得出，而殘差網路中的(X+Y’)*2個參數的調整一樣會透過反向傳播算法，使用損失函數計算出的損失值進行調整。 The process S37 is to establish a residual network using the linear gate valve unit. Due to the relevance of the time series data before and after the data, considering the results before the current time point is helpful for the current prediction, the present invention proposes to use a linear gate valve unit to establish a residual network to feed back the past prediction results to the current prediction The result is as shown in Figure 10. Specifically, the residual network will first concatenate the performance parameter (X) used in the main prediction model with the previous prediction result (Y') to form a vector of dimensions (X+Y'), and then use a Simple linear conversion, through (X+Y')*2 parameters, convert this vector into a 2-dimensional vector (a, b), and finally, calculate the residual R=a*σ(b), where σ(x) It is a sigmoid function (1/(1+exp(-x))), which can transform the input into the range of (0,1), so σ(b) can control how much a can become the final residual. In summary, the current prediction final result Y _final will be obtained by adding the prediction Y of the main prediction model to the residual R, and the adjustment of the (X+Y')*2 parameters in the residual network will also be through the reverse To the propagation algorithm, use the loss value calculated by the loss function to adjust.

流程S38為異常項目的判定。本流程係將訓練完的模型，利用信息增益(InfoGain)演算法做特徵選擇(feature selection)，藉以選出較重要的特徵，也就是容易造成系統異常的因素，如此在模型預測出系統很可能即將發生異常時，優先檢查這些造成系統異常的因素，藉以達到在系統異常發生前預防之目的。另外，此分析預測模型的方法，可自動找出影響系統異常較大的監控項目，進而自動建立規則。 The flow S38 is the determination of abnormal items. In this process, the trained model uses the InfoGain algorithm for feature selection to select more important features, that is, factors that are likely to cause system abnormalities, so that the model predicts that the system is likely to be about to When an abnormality occurs, the factors that cause the system abnormality are checked first, so as to achieve the purpose of preventing the system abnormality before it occurs. In addition, this method of analyzing and predicting models can automatically find out the monitoring items that have a large impact on the system, and then automatically establish rules.

流程S39為預測模型的再訓練。如同一開始所述，當時間過越久或系統大幅度改動時就需要重複流程S30到流程S37，重新將舊的資料移除並加入新的資料，重新訓練模型。 The process S39 is the retraining of the predictive model. As mentioned at the same beginning, when the time passes or the system changes significantly, the process S30 to the process S37 need to be repeated, the old data is removed and new data is added, and the model is retrained.

綜上所述，本發明所提出之預測造成系統異常之稀有事件的方法，可在系統異常極為稀少以及可接受誤報率下偵測出系統異常，並且不須要改動現有的系統，便可以做出預測，預測出可能發生錯誤後，再利用機器學習演算法的結果，產生最可能發生錯誤的原因，藉此預防錯誤的發生。 In summary, the method for predicting rare events that cause system abnormalities proposed in the present invention can detect system abnormalities when system abnormalities are extremely rare and with acceptable false alarm rates, and can be made without modifying the existing system. Prediction, after predicting that errors may occur, use the results of machine learning algorithms to generate the most likely causes of errors to prevent errors occur.

上述實施形態僅例示性說明本發明之原理及其功效，而非用於限制本發明。任何熟習此項技藝之人士均可在不違背本發明之精神及範疇下，對上述實施形態進行修飾與改變。因此，本發明之權利保護範圍，應如後述之申請專利範圍所列。 The above-mentioned embodiments only exemplarily illustrate the principles and effects of the present invention, and are not used to limit the present invention. Anyone familiar with this technique can modify and change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the rights of the present invention should be listed in the scope of patent application described later.

S11~S17‧‧‧步驟 S11~S17‧‧‧Step

Claims

A method for predicting rare events that cause system anomalies includes: collecting original monitoring data from each server, and then integrating the original monitoring data with time as the key to integrate the original monitoring data into a data; filling the original monitoring data The missing part of the monitoring data; the text part of the original monitoring data is digitized, and the content of the different types of text data is sequentially arranged in a numerical sequence starting from 0, so that the original monitoring data are all numerical data; All values in the original monitoring data are normalized; based on the continuous data in a period of time in the original monitoring data, the original monitoring data is grouped and discretized, so as to count the total number of continuous data in the period of time. As a parameter representing the original monitoring data, the parameters representing the original monitoring data are trained to establish an anomaly classification prediction model; and the model characteristics of the anomaly classification prediction model are used to obtain at least one parameter that caused an abnormality to occur. The at least one parameter and the parameter with high correlation predict the possible cause of the system abnormality.

For example, the method for predicting rare events that cause system abnormalities as described in item 1 of the scope of patent application, wherein the step of filling the missing part in the original monitoring data includes filling the missing part with a specific value, and filling in the continuous value Upper average or specific value.

For example, the method for predicting rare events that cause system anomalies as described in item 2 of the scope of patent application, wherein the step of digitizing the text part of the original monitoring data is to use a number to fill the missing part with a specific value replace.

For example, the method for predicting rare events that cause system abnormalities as described in item 1 of the scope of patent application, wherein the step of grouping the original monitoring data is to use the k-means grouping algorithm to group the original monitoring data.

For example, the method for predicting rare events that cause system abnormalities as described in item 1 of the scope of patent application, wherein the step of training the parameters representing the original monitoring data includes using the random forest algorithm to supersede the random forest algorithm. A range of parameters is set, and then a grid search method is used to search for parameters within the range at a certain interval.

For example, the method for predicting rare events that cause system anomalies as described in item 1 of the scope of patent application, wherein the step of training the parameters representing the original monitoring data to establish an anomaly classification prediction model includes the use of under-sampling (under- sampling) method to reduce the amount of the original monitoring data.

A method for predicting rare events that cause system anomalies includes: collecting original monitoring data from each server, and then integrating the original monitoring data with time as the key to integrate the original monitoring data into a data; filling the original monitoring data The missing part in the monitoring data; the text part of the original monitoring data is digitized so that the original monitoring data becomes numerical data; all the values in the original monitoring data are normalized; the original monitoring data is Based on the continuous data in a period of time, the original monitoring data is grouped and discretized to count the total number of continuous data in the period of time, which is used as a parameter to represent the original monitoring data. Parameter training to build anomaly classification prediction model; and Use the model characteristics of the abnormal classification prediction model to obtain at least one parameter that caused the abnormality, and predict the possible cause of the system abnormality from the at least one parameter and the parameter with high correlation, wherein the pair represents the parameter of the original monitoring data The steps of training include the use of random forest algorithm to determine that only the first time point is abnormal for continuous abnormal data, and the remaining time points are regarded as normal.

For example, the method for predicting rare events that cause system abnormalities as described in item 1 or 7 of the scope of patent application, wherein the step of using the model characteristics of the abnormal classification prediction model to obtain at least one parameter that caused the abnormality refers to the use of information gain ( The Information Gain algorithm performs feature selection, and takes the parameter that has a greater influence on the system abnormality as the at least one parameter.

For example, the method for predicting rare events that cause system abnormalities as described in item 1 or 7 of the scope of patent application, wherein the step of using the model characteristics of the abnormal classification prediction model to obtain at least one parameter that caused the abnormality includes using a linear gate valve unit Control the degree of influence of the previous prediction data on the result, and use the residual network to feed back the result to the current result to be predicted.