TWI789003B - Service anomaly detection and alerting method, apparatus using the same, storage media for storing the same, and computer software program for generating service anomaly alert - Google Patents
Service anomaly detection and alerting method, apparatus using the same, storage media for storing the same, and computer software program for generating service anomaly alert Download PDFInfo
- Publication number
- TWI789003B TWI789003B TW110133749A TW110133749A TWI789003B TW I789003 B TWI789003 B TW I789003B TW 110133749 A TW110133749 A TW 110133749A TW 110133749 A TW110133749 A TW 110133749A TW I789003 B TWI789003 B TW I789003B
- Authority
- TW
- Taiwan
- Prior art keywords
- monitored
- monitoring
- abnormal
- group
- systems
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 79
- 238000001514 detection method Methods 0.000 title claims abstract description 58
- 238000012544 monitoring process Methods 0.000 claims abstract description 192
- 230000002159 abnormal effect Effects 0.000 claims abstract description 159
- 206010000117 Abnormal behaviour Diseases 0.000 claims abstract description 89
- 230000005856 abnormality Effects 0.000 claims description 32
- 238000012549 training Methods 0.000 claims description 25
- 238000004422 calculation algorithm Methods 0.000 claims description 23
- 238000013058 risk prediction model Methods 0.000 claims description 19
- 238000007781 pre-processing Methods 0.000 claims description 17
- 230000006399 behavior Effects 0.000 claims description 13
- 230000015654 memory Effects 0.000 claims description 10
- 230000007306 turnover Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 4
- 238000012806 monitoring device Methods 0.000 claims 2
- 238000012423 maintenance Methods 0.000 abstract description 18
- 230000000694 effects Effects 0.000 abstract description 6
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000010801 machine learning Methods 0.000 description 13
- 238000013473 artificial intelligence Methods 0.000 description 9
- 238000013499 data model Methods 0.000 description 7
- 238000002372 labelling Methods 0.000 description 6
- 230000004044 response Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 3
- 238000007637 random forest analysis Methods 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- YHXISWVBGDMDLQ-UHFFFAOYSA-N moclobemide Chemical compound C1=CC(Cl)=CC=C1C(=O)NCCN1CCOCC1 YHXISWVBGDMDLQ-UHFFFAOYSA-N 0.000 description 2
- 230000003449 preventive effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000012731 temporal analysis Methods 0.000 description 1
- 238000000700 time series analysis Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Landscapes
- Debugging And Monitoring (AREA)
- Indicating And Signalling Devices For Elevators (AREA)
Abstract
Description
本發明涉及一種用於分析受監控系統的服務是否有異常的偵測告警方法,且特別是一種智慧化的服務異常偵測告警方法與使用此方法的設備與系統。 The invention relates to a detection and alarm method for analyzing whether the service of a monitored system is abnormal, and in particular to an intelligent service abnormality detection and alarm method, and equipment and systems using the method.
在一般企業中,無論是對內或對外的線上服務,通常都有數個到幾十個系統與上百個軟硬體模組於伺服器中,其中每日(甚至每分或每秒)都可能記錄了非常多的系統監控數值,或是數量龐大的日誌文件(Log),當發生服務或系統異常、資安攻擊等事件時,系統管理者或維運人員便需要針對監控數值與日誌等做判讀(或偵測),找出異常原因並加以排除。於傳統資訊技術(Information Technology,簡稱IT)的維運中,系統管理者會根據過去的經驗來定義異常事件的規則,但隨著IT基礎設施以及雲端服務的普及與擴充、系統架構與維運環境變得複雜,錯誤或複雜的規則常會觸發大量錯誤告警,讓系統管理者疲於奔命,更可能因疏失反而忽略 嚴重的威脅。另外,傳統維運人員往往在異常事件或訊號發生後,才能被動地處理問題。 In a general enterprise, whether it is internal or external online services, there are usually several to dozens of systems and hundreds of software and hardware modules in the server, of which every day (or even every minute or every second) There may be a lot of system monitoring values recorded, or a large number of log files (Log). When events such as service or system abnormalities, information security attacks, etc. Do interpretation (or detection), find out the cause of the abnormality and eliminate it. In the maintenance and operation of traditional Information Technology (IT), system managers will define rules for abnormal events based on past experience, but with the popularization and expansion of IT infrastructure and cloud services, system architecture and maintenance The environment becomes complex, and wrong or complex rules often trigger a large number of false alarms, making system administrators exhausted and more likely to be ignored due to negligence serious threat. In addition, traditional maintenance personnel can only deal with problems passively after abnormal events or signals occur.
近年來,由人為判斷的維運已不止進步到自動化監控,更有許多維運方法、企業與服務將人工智慧(Artificial Intelligence,簡稱AI)與機器學習引入IT基礎架構與維運管理之中。例如利用過去的歷史系統負載監控值(例如,但不限於中央處理單元(CPU)的使用率或記憶體的負載),運用機器學習訓練出正常的系統負載曲線與容許值,未來當即時的監控值偏離容許值,即可觸發系統告警,加速維運與反應時間。 In recent years, maintenance and operation based on human judgment has not only progressed to automated monitoring, but many maintenance and operation methods, enterprises and services have introduced artificial intelligence (AI) and machine learning into IT infrastructure and maintenance and operation management. For example, use past historical system load monitoring values (such as, but not limited to, central processing unit (CPU) usage or memory load), use machine learning to train normal system load curves and allowable values, and monitor in real time in the future If the value deviates from the allowable value, a system alarm can be triggered to speed up maintenance and response time.
另亦有技術方法利用人工智慧進行異常事件的關聯分析,透過演算法與文字分析等技術來分析歷史日誌,將看似無關的事件分群,進而找出事件的隱性關聯。例如,發現服務網頁中斷(記錄超文本傳輸協定錯誤的回應碼)、CPU使用率過高以及網頁瀏覽量過低此類看似無關的情況常常同時發生,就能透過異常事件關聯演算法將這些事件進行根因分析(Root Cause Analysis),未來即有可能做到提前預警,增加IT維運效率。 There are also technical methods that use artificial intelligence to analyze the correlation of abnormal events, analyze historical logs through algorithms and text analysis techniques, group seemingly unrelated events, and then find out the hidden correlation of events. For example, it is found that seemingly unrelated situations such as interruption of service web pages (recording HTTP error response codes), high CPU usage, and low page views often occur at the same time, and these events can be identified through abnormal event correlation algorithms. The root cause analysis of the event (Root Cause Analysis), in the future it is possible to achieve early warning and increase the efficiency of IT maintenance and operation.
現有技術的其中一種做法可完整採集單位內全域設備的設備健康狀態、單位使用網路的流量多寡與各種日誌(包含資安事件)三種異質資料,並作關聯分析,省去人工比對查找所耗的時間並利用人工智慧的趨勢演算法則。接著,據蒐集到的各種日誌與/或單位使用網路的流量多寡之歷史資料,自動學習建立動態基準,持續比對每一分鐘進來的各種日誌與/或單位使用網路的流量多寡之資料,以即時發覺事件次數(Hit Count)、流量封包數或是位元組(Byte)數異常突增的事件、來源網際網路協定(IP)位址(通常是攻擊端)以及目的IP位址(通常是被攻擊端)。此作法無需人工逐條設定閥值,故能讓維運以及資安防護工作變得更輕鬆容易。 One of the methods in the existing technology can completely collect three kinds of heterogeneous data, the health status of the equipment in the whole area of the unit, the amount of network traffic used by the unit, and various logs (including information security events), and perform correlation analysis, eliminating the need for manual comparison and search. Time-consuming and use the trend algorithm of artificial intelligence. Then, according to the collected historical data of various logs and/or the amount of network traffic used by the unit, it automatically learns to establish a dynamic benchmark, and continuously compares the various logs that come in every minute and/or the data of the amount of network traffic used by the unit , to detect the number of events (Hit Count), the number of traffic packets or the number of bytes (Byte) in real time, the source Internet Protocol (IP) address (usually the attack end) and the destination IP address (Usually the attacked end). This method does not need to manually set thresholds one by one, which makes maintenance and information security work easier.
現有技術的其中另一種做法則是比對各個異常事件後,利用機器學習演算法可以將類似行為表現的事件整理出來,自動偵測系統服務的延遲性是否驟升、系統錯誤率是否上升以及甚至公有雲廠商的網路是否出現異常。此作法讓使用者不需要設定警報觸發條件,系統就會自動監測平臺是否出現效能異常的事件。 Another approach in the existing technology is to use machine learning algorithms to sort out similar behavioral events after comparing various abnormal events, and automatically detect whether the delay of system services has increased sharply, whether the system error rate has increased, and even Check whether the network of the public cloud vendor is abnormal. With this method, users do not need to set alarm trigger conditions, and the system will automatically monitor whether the platform has abnormal performance events.
上述的人工智慧偵測只收集了單一公司或服務的系統監測資料等,此種人工智慧維運的相關領域知識無法與現實商業營運面連結,且未考量不同的服務系統有相同或不同的特性,故監控的準確度仍有改善空間。 The above-mentioned artificial intelligence detection only collects system monitoring data of a single company or service, etc. The relevant field knowledge of this kind of artificial intelligence maintenance operation cannot be connected with the real business operation, and it does not consider that different service systems have the same or different characteristics , so there is still room for improvement in the accuracy of monitoring.
根據本發明之目的,本發明實施例提出一種服務異常偵測告警方法,執行於連結有多個受監控系統的電腦設備,且服務異常/風險偵測告警方法包括:接收對應一服務之該等受監控系統的其中一個受監控系統的營運資料與系統資料,並對該受監控系統的該營運資料與該系統資料進行資料前處理,以獲得該受監控系統的多個狀態參數;對該受監控系統的每一個該狀態參數進行分群,以獲得該受監控系統的每一個該狀態參數對應的一群集標籤;根據該受監控系統的該等狀態參數的該等群集標籤對該受監控系統分群,以獲得該受監控系統對應的一群組號碼;根據該受監控系統的至少一個該等狀態參數偵測該受監控系統是否有一異常行為;以及於偵測到該受監控系統有該異常行為時,判斷於該受監控系統的該群組號碼對應的一群組中是否有超出一特定數量的該等受監控系統也有異常行為,若該群組號碼對應的該群組未有超出該特定數量的該等受監控系統也有異常行為,則產生一告警。 According to the purpose of the present invention, an embodiment of the present invention proposes a service anomaly detection and alarm method, which is executed on a computer device connected to a plurality of monitored systems, and the service anomaly/risk detection and alarm method includes: receiving the corresponding service information operating data and system data of one of the monitored systems of the monitored system, and performing data preprocessing on the operating data and the system data of the monitored system to obtain a plurality of status parameters of the monitored system; Grouping each state parameter of the monitoring system to obtain a cluster label corresponding to each state parameter of the monitored system; grouping the monitored system according to the cluster labels of the state parameters of the monitored system , to obtain a group number corresponding to the monitored system; detect whether the monitored system has an abnormal behavior according to at least one of the status parameters of the monitored system; and detect that the monitored system has the abnormal behavior , judging whether there are abnormal behaviors in the group corresponding to the group number of the monitored system exceeding a specific number, if the group corresponding to the group number does not exceed the specified number If a number of the monitored systems also have abnormal behavior, an alarm is generated.
本發明實施例還提供一種服務異常偵測告警方法,其與前述的服務異常偵測告警方法近似,但多個受監控系統是預先被分好群組,而不具有相關的分群步驟。 The embodiment of the present invention also provides a service anomaly detection and alarm method, which is similar to the aforementioned service anomaly detection and alarm method, but a plurality of monitored systems are grouped in advance without relevant grouping steps.
本發明實施例還提供一種偵測異常並發出告警之設備,其組態有多個單元,以執行上述服務異常/風險偵測告警方法,以及本發明實施例更提供一種儲存媒介,係用於儲存關聯於上述服務異常偵測告警方法的多個程式碼。 The embodiment of the present invention also provides a device for detecting abnormality and issuing an alarm, which is configured with multiple units to execute the above-mentioned service abnormality/risk detection and alarming method, and the embodiment of the present invention further provides a storage medium for use in A plurality of program codes associated with the above-mentioned service anomaly detection and alarm method are stored.
本發明實施例還提供多個用於判定受監控系統發生異常事件並針對該異常事件產生異常告警之電腦軟體程式。 The embodiment of the present invention also provides a plurality of computer software programs for determining abnormal events in the monitored system and generating abnormal alarms for the abnormal events.
綜上所述,本發明實施例的服務異常/風險偵測告警方法、使用此方法的雲端設備與儲存此方法的儲存媒介可以精準地偵測出服務異常/風險。 To sum up, the service anomaly/risk detection and warning method, the cloud device using the method and the storage medium storing the method can accurately detect service anomalies/risks according to the embodiments of the present invention.
為了進一步理解本發明的技術、手段和效果,可以參考以下詳細描述和附圖,從而可以徹底和具體地理解本發明的目的、特徵和概念。然而,以下詳細描述和附圖僅用於參考和說明本發明的實現方式,其並非用於限制本發明。 In order to further understand the techniques, means and effects of the present invention, reference can be made to the following detailed description and accompanying drawings, so that the purpose, features and concepts of the present invention can be thoroughly and specifically understood. However, the following detailed description and drawings are only for reference and illustration of the implementation of the present invention, and are not intended to limit the present invention.
1:異常告警系統 1: Abnormal alarm system
11:雲端設備 11:Cloud device
121~12N:受監控系統 121~12N: Monitored system
111:資料前處理單元 111: Data pre-processing unit
112:個體參數分群單元 112: Individual parameter grouping unit
113:個體分群單元 113: Individual grouping unit
114:同群比對單元 114: Peer comparison unit
115:偵測單元 115: Detection unit
116:告警單元 116:Alarm unit
S31~S45:步驟 S31~S45: steps
提供的附圖用以使本發明所屬技術領域具有通常知識者可以進一步理解本發明,並且被併入與構成本發明之說明書的一部分。附圖示出了本發明的示範實施例,並且用以與本發明之說明書一起用於解釋本發明的原理。 The accompanying drawings are provided to enable those skilled in the art to which the present invention pertains to further understand the present invention, and are incorporated in and constitute a part of the specification of the present invention. The drawings illustrate exemplary embodiments of the invention and together with the description serve to explain principles of the invention.
圖1是本發明實施例之使用服務異常偵測告警方法的異常告警系統系統的方塊圖。 FIG. 1 is a block diagram of an anomaly alarm system using a service anomaly detection and alarm method according to an embodiment of the present invention.
圖2是本發明實施例之使用服務異常偵測告警方法的雲端設備的方塊圖。 FIG. 2 is a block diagram of a cloud device using a service anomaly detection and alarm method according to an embodiment of the present invention.
圖3是本發明實施例之服務異常偵測告警方法操作於判讀模式的流程圖。 FIG. 3 is a flow chart of the service anomaly detection and alarm method operating in the interpretation mode according to the embodiment of the present invention.
圖4是本發明實施例之服務異常偵測告警方法操作於建模模式的流程圖。 FIG. 4 is a flow chart of the service anomaly detection and alarm method operating in the modeling mode according to the embodiment of the present invention.
現在將詳細參考本發明的示範實施例,其示範實施例會在附圖中被繪示出。在可能的情況下,在附圖和說明書中使用相同的元件符號來指代相同或相似的部件。另外,示範實施例的做法僅是本發明之設計概念的實現方式之一,下述的該等示範皆非用於限定本發明。 Reference will now be made in detail to the exemplary embodiments of the invention, which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used in the drawings and description to refer to the same or like parts. In addition, the practice of the exemplary embodiment is only one of the implementations of the design concept of the present invention, and the following demonstrations are not intended to limit the present invention.
先前技術透過人工智慧偵測服務異常的作法大概有以下技術問題:(1)無法與現實商業營運面連結,舉例如每日中午午休時間新聞網站流量常會暴增,但此種暴增為正常現象,若管理者與AI告警將此段期間的CPU的使用率與系統負載提升視為異常,即有可能發出誤報;(2)一般IT監控值或歷史日誌大都為單一系統或單一企業內的基礎設施維運資料,不同的服務系統有不同的特性,即使相類似的系統,對不同產業的企業間而言也可能有不同的反應監控紀錄或日誌,若沒有針對不同的業務或服務特性作機器學習模型的建立,其預防性告警的正確性往往不足;以及(3)傳統的AI需要大量的資料訓練模型,但服務異常的狀況並非常發生,因此僅用基礎設施與環境資料會使機器學習無足夠的異常資料、標註去訓練模型,即便以訓練出正常服務狀態的曲線,仍要設定較為保守的告警觸發閥值,以因應意外風險。 The previous technology using artificial intelligence to detect service anomalies probably has the following technical problems: (1) It cannot be connected with real business operations, for example, the traffic of news websites often increases sharply during the lunch break every day, but such a surge is a normal phenomenon , if the administrator and the AI alarm regard the CPU usage and system load increase during this period as abnormal, false alarms may be issued; (2) Most of the general IT monitoring values or historical logs are based on a single system or a single enterprise For facility maintenance and operation data, different service systems have different characteristics. Even similar systems may have different responses to monitoring records or logs among enterprises in different industries. If there is no machine for different business or service characteristics The correctness of the preventive alarm is often insufficient for the establishment of the learning model; and (3) traditional AI requires a large amount of data to train the model, but service exceptions do not occur very often, so only using infrastructure and environmental data will make the machine learning There is not enough abnormal data and labels to train the model. Even if the curve of the normal service state is trained, it is still necessary to set a relatively conservative alarm trigger threshold to deal with unexpected risks.
為了解決上述技術問題,本發明實施例提供服務異常偵測告警方法與使用此方法的設備與系統考量了營運資料與系統資料來進行服務異常的偵測告警,以及根據營運資料(例如,商務運營面、公開資料如社群、公開新聞等資料)與系統資料(例如,系統軟硬體資訊、運行日誌)對提供服務的受監控系統進行分群,以判斷其服務特性,為了進一步提升偵測告警的精準度,在偵測到有異常時,會進一步地判斷偵測到的異常行為是否在同一服務特性的群組中也是經常出現,或其為尋常行為而並非真正的異常,也就是說,判斷這個偵測到的異常行為是否相同或類似於在同一服務特性的群組中發生頻率較高的尋常行為或頻率發生較低的異常行為。因此,相較於先前技術,本發明實施例提供的服務異常偵測告警方法與使用此方法的設備與系統具有系統監控效率提昇與維運風險降低的有益技術效果。 In order to solve the above technical problems, the embodiments of the present invention provide a service anomaly detection and alarm method and the equipment and system using this method consider the operation data and system data to detect and alarm service anomalies, and according to the operation data (for example, business operation face, public information (such as community, public news, etc.) and system information (for example, system software and hardware information, operation logs) to group the monitored systems that provide services to determine their service characteristics, in order to further improve detection and alarm When an abnormality is detected, it will further determine whether the detected abnormal behavior often occurs in the same service characteristic group, or it is a common behavior rather than a real abnormality, that is, It is judged whether the detected abnormal behavior is the same or similar to the normal behavior with higher frequency or the abnormal behavior with lower frequency in the group of the same service characteristic. Therefore, compared with the prior art, the service anomaly detection and alarm method provided by the embodiment of the present invention and the equipment and system using the method have beneficial technical effects of improving system monitoring efficiency and reducing maintenance and operation risks.
在本發明數個實施例中,服務泛指藉由資訊設備軟硬體所構成的數位化的服務,其包括線上商務系統、企業營運系統等,或者,服務可以泛指數位化的資訊互動系統,例如線上的實體或虛擬商品購買交易、金融交易、訊息交換發佈、影音圖文上傳瀏覽下載等。服務可由實際的工作負載辨識與反映,且與服務相關的資料可以是結構化或非結構化的資料,其中這些與服務相關的資料可指與此服務有關之營運或系統方面的各種資料,例如經由服務或系統處理的輸入、輸出、運算數據,或是服務或系統本身之設定、運行、監測數據,或是前述數據之衍生數據等。於本發明數個實施例中,所蒐集之與服務相關的資料包括營運資料與系統資料,其中營運資料包括即時訂單數(Gross Merchandise Volume,簡稱為GMV)、營業額、上線人數、頁面瀏覽數(Page View,簡稱為PV)、回頭客數(Repeat Visitors,簡稱維RV)、造訪次數(Unique Visitor,簡稱為UV)、 IP位址數、流量來源、地區、使用裝置、使用者瀏覽事件、服務網站或應用程式操作行為與客服通話記錄的至少其中一者,以及系統資料包括系統日誌文件(Log)、基礎設施運行資料與系統指標的至少其中一者。基礎設施可指IT設備、其軟硬體、或其軟硬體之環境或架構等。基礎設施運行資料可指提供這些服務時,系統所需要使用到的基礎設施在運行時的能耗資料、流量資料、所使用之基礎設施之其他資源的資料或所使用之基礎設施之效能的資料。系統指標可指提供這些服務的受監控系統所使用之資源的資料或效能的資料,且例如為CPU使用率、記憶體用量、I/O數(Read/Write PS)、網路流出/入量、封包流出/入量、彈性開啟的機器/叢集數量、交換(Swap)數、每秒查詢數(Queries Per Second,簡稱QPS)、每秒回覆數(Responses Per Second,簡稱RPS)、資料庫連線數與機器回應時間的至少其中一者,但本發明不以上述營運資料與系統資料的類型為限制。 In several embodiments of the present invention, services generally refer to digital services composed of software and hardware of information equipment, including online business systems, enterprise operation systems, etc., or services can generally refer to digital interactive information systems , such as online physical or virtual commodity purchase transactions, financial transactions, message exchange and release, uploading, browsing and downloading of audio, video, and text, etc. The service can be identified and reflected by the actual workload, and the data related to the service can be structured or unstructured data, where the data related to the service can refer to various data related to the operation or system of the service, such as Input, output, and calculation data processed through the service or system, or the setting, operation, and monitoring data of the service or system itself, or the derived data of the aforementioned data, etc. In several embodiments of the present invention, the collected service-related data include operating data and system data, wherein the operating data includes real-time orders (Gross Merchandise Volume, referred to as GMV), turnover, number of online users, and page views (Page View, PV for short), Repeat Visitors (RV for short), Unique Visitor (UV for short), At least one of the number of IP addresses, traffic sources, regions, devices used, user browsing events, service website or application operation behaviors, and customer service call records, as well as system data including system log files (Log), infrastructure operation data and at least one of system indicators. Infrastructure may refer to IT equipment, its software and hardware, or the environment or structure of its software and hardware. Infrastructure operation data may refer to energy consumption data, flow data, data of other resources of the infrastructure used by the system or data of the performance of the infrastructure used when providing these services. . System indicators may refer to resource data or performance data used by the monitored system that provides these services, such as CPU usage, memory usage, I/O count (Read/Write PS), network outflow/incoming volume , packet outflow/inflow, number of elastically enabled machines/clusters, number of swaps, queries per second (QPS for short), responses per second (RPS for short), database connection At least one of the line number and the machine response time, but the present invention is not limited by the types of the above operation data and system data.
首先請參照圖1,圖1是本發明實施例之使用服務異常偵測告警方法的異常告警系統的方塊圖。本發明亦可應用於地端系統(On-Premises System)或地端與雲端之混合系統;例如圖1之雲端設備11可被替代為地端設備或混合雲設備等。本發明實施例的異常告警系統1包括雲端設備11,多個受監控系統121~12N與雲端設備11通訊連接,例如透過有線或無線直接或間接連接。受監控系統121~12N可以包含以下至少一者:系統提供之服務、系統承載之工作負載、提供服務的伺服器、網路、網路相關設備如交換器、閘道器等、防火牆或儲存設備等設備、此等設備之元件如CPU、記憶體、I/O埠等、此等設備上運行之虛擬機、容器、虛擬私有雲、資料庫、軟體程式等。受監控系統121~12N可以包含雲端設備、地端設備、終端設備等。受監控系統121~12N可以分屬於多個不同服務提供者,或者,也可以是屬於同一個服務者,又或者,受監控系統121~12N的
一部分屬於其中一個服務提供者,受監控系統121~12N的另一部分屬於另一個服務提供者。
Please refer to FIG. 1 first. FIG. 1 is a block diagram of an anomaly alarm system using a service anomaly detection and alarm method according to an embodiment of the present invention. The present invention can also be applied to an on-premises system (On-Premises System) or a hybrid system of on-premises and cloud; for example, the
受監控系統121~12N用以提供上述各種服務,其中多個服務的服務特性彼此可能不全部相同,服務的服務特性是可以事先透過人工方式貼標(在預先知道服務類型時,即可以分群,例如涉及線上購物服務的受監控系統貼標為同一群組,而涉及線上諮詢服務的受監控系統貼標為另一同一群組),或透過其他監督式機器學習分群方法判讀。服務的服務特性也可以不用事先知道,而是在蒐集資料後,透過蒐集的資料找出狀態參數配合非監督式機器學習分群方法判讀,其中多個狀態參數包括CPU使用率、上線人數、營業額、頁面瀏覽數、記憶體使用量與輸入/輸出的數量等營運資料及系統資料。進行非監督式機器學習分群後,各個群集可進行貼標,各標籤可表示不同群集之服務特性。標籤可以次序變數、類型或類別變數、索引值、獨特值、號碼等方式自動標示,例如標籤可藉由分群模型之程式化方式而自動取得。標籤亦可附加標籤描述。例如,群集可依服務之商業性質標記或描述為「直播」、「入口網站」、「電商平台」、「論壇」等,或依負載或流量趨勢標記或描述為「晚間密集」、「週末活躍」、「冬季期間」、「上班族模式」等,或依技術型態或監視參數如I/O趨勢、連線頻率、等性質標記或描述,或依據該群集出現過之異常問題的特徵如過載、欠載、超頻、超時、尖峰等標記或描述。事先透過人工方式貼標而標註相同服務特性的多個服務,在根據資料分群後也可能屬於不同服務特性的群組,透過人工方式貼標而標註不同服務特性的多個服務,在根據資料分群後也可能屬於同一服務特性的群組。雲端設備11用於接受受監控系統121~12N提供的營運資料與系統資料,並藉此監控受監控系統121~
12N是否有服務異常,以達到服務異常偵測告警的目的,其中營運資料與系統資料可以如上所述,故不再贅述。
The monitored
請接著參照圖2,圖2是本發明實施例之使用服務異常偵測告警方法的雲端設備的方塊圖。如圖2所示,雲端設備11包括資料前處理單元111、狀態參數分群單元112、個體分群單元113、同群比對單元114與偵測單元115,其中可以透過硬體電路與軟體程式的執行來實現上述多個功能單元111~115,但本發明不以功能單元111~115的實現方式為限制。資料前處理單元111信號連接狀態參數分群單元112與偵測單元115,狀態參數分群單元112信號連接個體分群單元113,以及個體分群單元113信號連接同群比對單元114。雲端設備11尚可包含與同群比對單元114信號連接之告警單元116。
Please refer to FIG. 2. FIG. 2 is a block diagram of a cloud device using a service anomaly detection and alarm method according to an embodiment of the present invention. As shown in FIG. 2, the
資料前處理單元111接收前述各受監控系統121~12N的營運資料與系統資料,並對營運資料與系統資料進行資料前處理。舉例來說,前處理為對系統日誌文件進行文字分析,建立在不同時間點的關鍵字詞頻,例如各時間段網頁受監控系統的接取日誌文件(Access.log)或錯誤日誌文件(Error.log)中會有不同關鍵字如「emerg」、「alert」、「err」、「warning」(關鍵字可能會因不同受監控系統不同而不同),進行TF-IDF(Term Frequency-Inverse Document Frequency)的詞頻分析以獲取詞頻,所獲取的詞頻則之後將作為不同的狀態參數,例如透過解析錯誤日誌文件知悉記憶體溢出頻率或數量。前處理還可以是記錄基礎設施運行指標,例如CPU使用率、記憶體用量與I/O數等,或者是將客戶服務輸出的業務營運資料進行匿名化處理。前處理的目的是為了從營運資料與系統資料獲取各受監控系統121~12N的多個狀態參數(狀態參數可用來表示受監控系統之資源所被使用或其效能的量化指標),以在後面進行服務特性的分群。在某些實施
例中,要觀察1000個狀態參數,並以一週七天做為一個觀察週期,且每一個狀態參數以一分鐘為單位蒐集其參數值,則每一個狀態參數會有7*24*60=10080個資料點(各資料點有一參數值)。
The data
狀態參數分群單元112則是接收各受監控系統121~12N的多個狀態參數,並對每一個狀態參數(參數值)基於分群模型來進行分群,並可給予相關的分群標籤。例如,受監控系統121的CPU使用率基於分群模型進行分群,並可標記各類對應的群集標籤,自分群結果分析,所對應的群集標籤可能代表晚上較忙碌、早上較忙碌或中午較忙碌之不同CPU使用模式的服務特性。在本發明數個實施例中,分群模型可以定期地根據即時資料來更新與訓練,或者可以預先訓練後才拿來使用,且訓練或更新分群模型可以以非監督式機器學習的方式(不用預先知悉類型與貼標)來實現。例如,定期獲取各受監控系統121~12N的CPU使用率,並根據K均值(K-means)、層次凝聚聚類演算法(HAC)或基於密度聚類演算法(DBSCAN)來建立分群模型,將CPU使用率分群,並可依據分群結果給予群集標籤。訓練或更新分群模型也可以使用監督式學習的分群演算法(對已知的類別進行貼標)來實現,例如支持向量機(SVM)、K-近鄰演算法(KNN)、決策樹(Decision Tree)或隨機森林(Random Forests)。
The state
個體分群單元113對各受監控系統121~12N進行分群,以給予受監控系統121~12N對應的群組號碼(或標籤、次序變數、類型變數、索引值、獨特值等),其中對各受監控系統121~12N進行分群的方式可基於分群模型根據各受監控系統121~12N的多個狀態參數的多個群集標籤進行分群。某些實施例中,可將受監控系統視為其所提供之服務或所承載之工作負載,則個體分群單元113可謂對此等服務或工作負載進行分群;亦可將受監控系統視為所有人、客戶、用戶、企業、使用者、服務對象等,
則個體分群單元113可謂對此等所有人、客戶、用戶、企業、使用者、或服務對象進行分群。例如,單一受監控系統代表單一用戶時,受監控系統之分群可視為對用戶之分群;單一受監控系統代表單一服務時,受監控系統之分群可視為對服務之分群;單一受監控系統代表單一設備或軟體程式時,受監控系統之分群可視為對設備或軟體程式之分群。例如,基於分群模型根據受監控系統121~12N的CPU使用率的群集標籤、上線人數的群集標籤、營業額的群集標籤與網路流出/入量的群集標籤等來決定受監控系統121~12N提供的服務屬於哪些群組,各群組可代表分群至各群組內之服務彼此間具有某種相同或相似之服務特性,並可給予相關的群組號碼、變數、或標籤。同樣地,用於分群的模型也同樣地可以以非監督式機器學習或監督式機器學習來進行訓練與更新。可以不特別指定服務特性,將多個狀態參數的群集標籤進行非監督式學習之模型訓練(例如K-means、HAC或DBSCAN),或者,事先針對某些已知特性的服務貼標,例如電商服務、企業資源規劃系統(ERP)系統、交易系統、直播系統等,將貼標完的服務的多個狀態參數的群集標籤進行監督式學習之模型訓練(例如SVM、KNN、決策樹或隨機森林),之後有未知的服務即可用此模型進行分群,歸納服務的服務特性。
The
偵測單元115基於模型根據各受監控系統121~12N的多個狀態參數進行建模,以偵測各受監控系統121~12N是否有異常(例如,預測出未來或現在的狀態參數超出門限值,或者,未來可能有特定異常事件的發生),其中此處的模型可以是時間序列模型、歷史資料模型與/或風險預測模型。各受監控系統121~12N的歷史資料模型可以使用機器學習演算法等套用到過去蒐集的多個狀態參數來建立其正常狀況下的時間序列模型,使用演算法可以是差分整合移動平均自迴歸演算法(ARIMA)、長短期
記憶演算法(LSTM)或隨機切割森林(Random Cut Forests,簡稱RCF)。舉例來說,可以建立受監控系統121之CPU使用率的時間序列模型,透過時間序列模型的預測可以知悉CPU使用率是否目前有異常或未來可能會異常。某些實施例中,若已知某些已發生的系統問題、資安事件、遭受攻擊等異常狀況,則可將受監控系統多個狀態參數的參數值聯集進行風險值標註,以訓練出各受監控系統121~12N於單一時間點的風險預測模型,其中訓練風險預測模型的演算法可以是隨機森林或極限梯度提升演算法(XGBoost)。舉例來說,受監控系統121之風險預測模型中,可根據實際發生異常之情況標註單筆資料之風險值(該筆資料之參數觀察值可能為CPU使用率=0.5、記憶體使用量=15GB、輸入/輸出的數量=2500等),如此藉由對歷史資料集中每筆資料進行風險標註,對該資料集進行擬合以訓練出該風險預測模型,則可對新資料預測此受監控系統121是否遭受異常。對此等風險值可形成一風險值序列,並對此風險值序列訓練出一時間序列模型,藉此可預測如下一時間點之風險值。如下開將說明,藉由利用風險預測模型及時間序列模型所預測之風險值進行比對,並藉由同群比對單元114之判斷,可判斷是否遭受異常。
The
某些實施例中,針對受監控系統所提供之服務,對於某一種服務特性的服務,偵測單元115所偵測的異常行為未必真的是異常,可能在同一個服務特性的群組中,此偵測到的異常行為實際上在群組內會被判斷為尋常行為。因此,同群比對單元114判斷此偵測到的異常行為在其同一服務特性(如同一群組號碼)的群組之受監控系統中是否為異常行為或尋常行為。舉例來說,受監控系統121、122、129與12N都是提供線上購物服務,且在母親節檔期,受監控系統121之上線人數被偵測為暴增為因而先被判斷為異常行為,且透過同群比對單元114發現受監控系統122、129與12N亦偵
測到上線人數為暴增,因此,同群比對單元114判斷偵測到之受監控系統121之上線人數暴增的異常行為屬於群組內之尋常行為,因此判斷受監控系統121未發生異常。又例如,受監控系統123、126與129都是財會系統,且在報稅期間,受監控系統123的流量來源飛快增加,而透過風險預測模型被偵測成有被攻擊的風險,但同群比對單元114偵測到受監控系統126與129的流量來源也飛快增加,故同群比對單元114不會將偵測到受監控系統121之流量來源飛快增加的風險當作異常事件,即不會認為受監控系統121有被攻擊的風險。雲端設備11可信號連接一告警單元116,告警單元116可根據前述判斷出之異常事件產生一異常告警,並可將該異常告警發送至與雲端設備11信號連接之一終端裝置(未繪示),使該終端裝置顯示該異常告警。如此,雲端設備11可以避免錯誤地向系統管理者告警,讓服務提供者在運營上更有效率。
In some embodiments, for the services provided by the monitored system, for a service with a certain service characteristic, the abnormal behavior detected by the
請參照圖3,圖3是本發明實施例之服務異常偵測告警方法操作於判讀模式的流程圖。服務異常偵測告警方法可以被上述雲端設備11所執行。在訓練或更新各模型後,雲端設備11可以操作於判讀模式,並於判讀模式執行下述步驟。首先,在步驟S31中,接收各受監控系統對應的營運資料與系統資料,並進行資料前處理,以產生各受監控系統的多個狀態參數。然後,在步驟S32中,基於用於分群每一個狀態參數之類別的模型,對各受監控系統的每一個狀態參數進行分群,以給予各受監控系統的每一個狀態參數一個群集標籤。然後,在步驟S33中,基於用於分群每一個服務之服務特性的模型,依據各被監控設備的多個狀態參數對應的多個群集標籤對各被監控設備分群,以給予各被監控設備一個群組號碼(或稱之為另一組群集標籤)。
Please refer to FIG. 3 . FIG. 3 is a flow chart of the service anomaly detection and alarm method operating in the interpretation mode according to the embodiment of the present invention. The service anomaly detection and alarm method can be executed by the above-mentioned
在步驟S34中,基於各受監控系統之多個歷史資料模型與/或多個風險預測模型,根據各受監控系統的多個狀態參數偵測各被監控設備是否有異常行為。若無偵測到異常行為,則無需告警,若有偵測到異常行為,則進行步驟S35。在步驟S35中,判斷偵測到之各受監控系統之異常行為在其群組號碼的群組中是否為異常行為或尋常行為,亦即判斷是否為異常,判斷的方式可以是,同一群組號碼的多個受監控系統中有至少一特定數量者也被偵測到有此異常行為,其中此特定數量可以是同一群組號碼的多個受監控系統的一半、全部、四分之一或其他數值,例如「1」(群組內一個受監控系統)、「2」(群組內二個受監控系統)等。簡單地說,只要同一群組號碼的群組有超過特定數量的受監控系統都有被偵測到相同或相似的異常行為,則此被偵測到的異常行為並非群組內真的異常行為而應該是尋常行為,故不用進行告警。若同一群組號碼的群組沒有超過特定數量的受監控系統都有被偵測到相同的異常行為,則進行步驟S36。在步驟S36中,向與雲端設備11電性連接之一終端裝置發送一異常告警,並使該終端裝置顯示該異常告警,以向對應的系統管理者或維運人員進行告警。
In step S34 , based on multiple historical data models and/or multiple risk prediction models of each monitored system, it is detected whether each monitored device has abnormal behavior according to multiple state parameters of each monitored system. If no abnormal behavior is detected, no alarm is required, and if abnormal behavior is detected, go to step S35. In step S35, it is judged whether the abnormal behavior of each monitored system detected is abnormal behavior or common behavior in the group of its group number, that is to say whether it is abnormal, the way of judging can be that the same group At least a specified number of the monitored systems of the number have also been detected to have the abnormal behavior, where the specified number can be half, all, a quarter or a quarter of the monitored systems of the same group of numbers Other values, such as "1" (one monitored system in the group), "2" (two monitored systems in the group), etc. Simply put, as long as the same or similar abnormal behavior is detected in more than a certain number of monitored systems in the group with the same group number, the detected abnormal behavior is not a real abnormal behavior in the group It should be a normal behavior, so there is no need to send an alarm. If the same abnormal behavior has not been detected in the monitored systems of the group with the same group number exceeding a certain number, then proceed to step S36. In step S36 , an abnormality alarm is sent to a terminal device electrically connected to the
接著,請參照圖4,圖4是本發明實施例之服務異常偵測告警方法操作於建模模式的流程圖。判讀模式下的模型是雲端設備11在建模模式所建立,且建模模式的步驟如下。在步驟S41,接收多個受監控系統對應的營運資料與系統資料,並進行資料前處理,以產生多個受監控系統之每一者的多個狀態參數。接著,在步驟S42中,針對每一種狀態參數,依據多個受監控系統的多個同一種狀態參數,建立用於分群狀態參數的模型。之後,在步驟S43中,針對每一個受監控系統,對受監控系統的每一個狀態參數進行分群,以給予每一個狀態參數一個群集標籤。然後,在步驟S44中,根據多個受監控系統的多個狀態參數的群集標籤建立用於針對
受監控系統之服務特性的分群模型。接著,在步驟S45中,針對每一個受監控系統的每一個狀態參數,依據每一個狀態參數之一段時間的數值序列建立出用於偵測異常的模型。
Next, please refer to FIG. 4 . FIG. 4 is a flow chart of the service anomaly detection and alarm method operating in the modeling mode according to the embodiment of the present invention. The model in the interpretation mode is established by the
本發明實施例還提供一種儲存媒介,此儲存媒介為非揮發性的儲存媒介,例如快閃記憶體、光碟或硬碟等,其儲存有多個程式碼,且此等程式碼可以被計算機裝置所讀取,以藉此讓讀取此等程式碼的計算機裝置進行如圖3與圖4之服務異常偵測告警方法的步驟。 The embodiment of the present invention also provides a storage medium, which is a non-volatile storage medium, such as a flash memory, an optical disc or a hard disk, etc., which stores a plurality of program codes, and these program codes can be installed by a computer read, so as to allow the computer device that reads these program codes to perform the steps of the service anomaly detection and alarm method as shown in FIG. 3 and FIG. 4 .
基於上述內容,以下使用一個實際例子來說明。在某些實施例中,受監控系統的CPU使用率大於80%持續超過5分鐘即告警,或某外部同一個IP連線在1分鐘內超過50次即告警,但門限值往往需要跟系統實際運行的業務、服務有關係,而隨著業務運行,告警的門限值也需要根據不同時間週期作調整,而不一定是固定值。加上時間一久、規則持續新增,系統管理者很難釐清不同規則間的關係或邏輯,不同服務之系統間的系統差異也容易造成雜訊或噪音的誤報。 Based on the above content, a practical example is used below to illustrate. In some embodiments, an alarm will be issued if the CPU utilization rate of the monitored system is greater than 80% for more than 5 minutes, or an external IP connection exceeds 50 times within 1 minute. The running business and service are related, and with the running of the business, the threshold value of the alarm needs to be adjusted according to different time periods, not necessarily a fixed value. In addition, as time goes by and rules continue to be added, it is difficult for system administrators to clarify the relationship or logic between different rules, and system differences between systems for different services are likely to cause noise or false alarms.
因此,本發明實施例之系統或設備使用的服務異常偵測告警方法則可以解決上述技術問題。首先,以狀態參數「上線人數」為例,經營某一類型之電子商務的系統的線上人數通常在20至24點人數較多,而某一類型之新聞類網站系統,其線上人數的高峰值通常在於7至9點、12至13點與18點21點。於對狀態參數分群階段,可將每種狀態參數利用無監督式學習的分群方法,例如使用K-means歸納出受監控系統之上線人數模式的10種態樣,以及整理出各受監控系統的線上人數模式各屬於何種態樣,並給予相應的群集標籤(如採1、2、3…的次序變數標籤)。類似地,各受監控系統的CPU使用率各屬於何種態樣也可以利用上述分群方法獲得。在獲得各受監控系統的各狀態參數的各群集標籤後,即可以對各受監控系統進行 分群。於下面表一至表三,明白舉出狀態參數分群與受監控系統分群的例子。 Therefore, the service anomaly detection and alarm method used by the system or device of the embodiment of the present invention can solve the above technical problems. First, take the state parameter "number of online users" as an example, the number of online users of a certain type of e-commerce system is usually more at 20 to 24 o'clock, and the peak number of online users of a certain type of news website system Usually at 7 to 9 o'clock, 12 to 13 o'clock and 18:00 to 21 o'clock. In the stage of grouping status parameters, each status parameter can be grouped using unsupervised learning methods, for example, using K-means to summarize 10 patterns of the number of online people in the monitored system, and sorting out the status of each monitored system. What kind of online population patterns belong to, and give corresponding cluster labels (for example, order variable labels of 1, 2, 3...). Similarly, the status of the CPU usage of each monitored system can also be obtained by using the above-mentioned grouping method. After obtaining each cluster label of each status parameter of each monitored system, each monitored system can be Group. In Tables 1 to 3 below, examples of state parameter grouping and monitored system grouping are clearly given.
如上表的舉例,受監控系統#1與受監控系統#2的各狀態參數之群集標籤近似,所以被給予同一個群組號碼,以表示被歸類到服務特性相同的群組。接著,受監控系統#1的上線人數與CPU使用率於未來的觀察值可以與其上線人數與CPU使用率的歷史資料模型與/或風險預測模型的預測值進行比較,以偵測上線人數與CPU使用率是否有異常行為。若任一
狀態參數有異常行為,則比較群組號碼為1之群組的各受監控系統(即群組內其他受監控系統)的上線人數或CPU使用率多數是否也有被偵測到異常行為,若無,則表示偵測到之受監控系統#1之上線人數或CPU使用率的異常行為在群組內為異常事件,並且需要向受監控系統#1的系統管理者或維運人員告警。再者,當在偵測到之受監控系統#1之上線人數或CPU使用率的異常行為被認為是異常事件時,該異常的資料或異常發生前的歷史資料可以拿來訓練另一個歷史資料模型或風險預測模型,以讓同一群組的受監控系統可以使用此歷史資料模型或風險預測模型來偵測異常。
As shown in the above table, the cluster labels of the monitored
基於上述內容,某些實施例中,多個受監控系統對應的營運資料與系統資料仍被接收與進行資料前處理,以產生多個受監控系統之每一者的多個狀態參數。然而,多個受監控系統是預先透過人工方式分好群組,或者透過非監督式學習分群方法分群。針對每一個受監控系統,根據過去實際異常行為發生的與否(例如,駭客入侵、服務中斷與系統當機等)對受監控系統的多個狀態參數的參數值聯集標註風險值。在某一個實施例中,已知某個受監控系統被駭客入侵時的CPU使用率為0.6,其記憶體使用量為20GB,且其輸入/輸出的數量為3000,則可以將該筆資料標註風險值為1。其他無異常發生的資料則將風險值標註為0。接著,利用機器學習演算法(例如XGBoost),根據標記有風險值的資料建立(訓練)受監控系統的風險預測模型,此風險預測模型可以預測未來時間點的風險值。附帶一提的是,受監控系統的風險預測模型可能會有週期性,或經過一段較長時間後改變,或經過特定事件(如系統軟硬體版本更新、新模組加入系統、商業模式改變、市場需求變化等)後會改變。例如,本發明可藉如進行季節性、週期性等時間序列分析,由歷史資料所學習的時間序列圖形判斷風險趨勢,以找出風險值的變化趨勢,藉此重新訓練模型或調整模型超參數等,以得 到適用新週期之模型。在建立完對應受監控系統的風險預測模型後,在獲得受監控系統的新的多個狀態參數(參數值),則可以根據新的多個狀態參數(參數值)基於風險預測模型計算預測風險值,可將值的範圍界定在0至1之間。若預測風險值大於特定值,例如0.5,則等同於偵測到異常行為,此時再根據同一群組是否也有超過特定數量的其他受監控系統也有預測風險值大於特定值的情況,如果沒有,則將此偵測到的異常行為當作異常(異常事件),反之,則視為非異常(非異常事件)。在此實際例子中,預測風險值的結果可能來自於許多參數的影響,而參數間關係太複雜無法以人工規則去理解,但此時隱性的參數間關係與時間序列關係則可藉由模型的訓練來獲取。 Based on the above, in some embodiments, the operation data and system data corresponding to the plurality of monitored systems are still received and subjected to data pre-processing, so as to generate a plurality of status parameters of each of the plurality of monitored systems. However, multiple monitored systems are grouped manually in advance, or grouped by unsupervised learning grouping methods. For each monitored system, according to whether the actual abnormal behavior occurred in the past (for example, hacker intrusion, service interruption, system crash, etc.), the parameter values of multiple state parameters of the monitored system are combined to mark the risk value. In a certain embodiment, it is known that the CPU usage rate of a monitored system is 0.6 when it is hacked, its memory usage is 20GB, and its input/output quantity is 3000, then the data can be Label the risk value as 1. For other data without abnormal occurrence, the risk value is marked as 0. Then, a machine learning algorithm (such as XGBoost) is used to establish (train) a risk prediction model of the monitored system based on the data marked with the risk value. This risk prediction model can predict the risk value at a future time point. Incidentally, the risk prediction model of the monitored system may be periodic, or change after a long period of time, or after specific events (such as system software and hardware version updates, new modules added to the system, business model changes , changes in market demand, etc.) will change after. For example, the present invention can use time series analysis such as seasonality and periodicity to judge the risk trend from the time series graph learned from historical data, so as to find out the change trend of the risk value, thereby retraining the model or adjusting the model hyperparameters wait to get to the model applicable to the new cycle. After establishing the risk prediction model corresponding to the monitored system, after obtaining the new multiple state parameters (parameter values) of the monitored system, the predicted risk can be calculated based on the risk prediction model based on the new multiple state parameters (parameter values) Value, you can define the value range between 0 and 1. If the predicted risk value is greater than a specific value, such as 0.5, it is equivalent to detecting abnormal behavior. At this time, according to whether there are other monitored systems in the same group that exceed a specific number, the predicted risk value is greater than a specific value. If not, The detected abnormal behavior is regarded as abnormal (abnormal event), otherwise, it is regarded as non-abnormal (non-abnormal event). In this practical example, the result of predicting the risk value may come from the influence of many parameters, and the relationship between parameters is too complicated to be understood by artificial rules, but at this time, the implicit relationship between parameters and time series relationship can be obtained through the model training to obtain.
附帶一提的是,若受監控系統#1之上線人數或CPU使用率等狀態參數有被偵測到異常行為的原因在於推出特定活動(例如,受監控系統#1提供線上購物服務舉辦如週年慶之特定線上活動)或發生特定事件(如調整CPU使用率政策)等,則群組號碼為1的其他受監控系統的上線人數與CPU使用率並沒有被偵測到有異常行為,故雖然本發明實施例提供的服務異常偵測告警方法與使用此方法的設備與系統會將偵測到的受監控系統#1之上線人數或CPU使用率的異常行為認定為異常,但在使用非監督式的即時學習方式時,在受監控系統1的系統管理者或維運人員確認為此偵測到的異常實際原因可以確定後,接下來的偵測到之受監控系統#1之上線人數與CPU使用率的異常便可與該特定活動綜合判斷,此外歷史資料模型或風險預測模型也可藉由調整風險值等進行修正、更新或重新訓練,以使該特定活動不會成被預測為異常行為。
Incidentally, if the status parameters such as the number of online users or the CPU usage rate of the monitored
某些實施例中,本發明可藉由一種用於判定受監控系統發生異常事件並針對該異常事件產生異常告警之電腦軟體程式來實施,該程式 可載入如伺服器之電腦設備。該電腦軟體程式可於載入電腦設備後,執行以下之步驟:接收複數個第一監控資料集,該等第一監控資料集每一者包含複數個監控資料,該等監控資料包含第一監控參數及第二監控參數,該第一監控參數及該第二監控參數分別包含複數個監控資料點,該等資料點每一者包含一監控參數值,該等第一監控資料集係分別包含不同受監控系統之監控資料,該等第一監控資料集每一者之複數個監控資料包含營運資料及系統資料;利用第一分群模型對該等第一監控資料集之第一監控參數的監控參數值進行分群並產生複數個第一群集標籤,使該等第一監控資料集每一者對應該等第一群集標籤其中一者,且利用該第一分群模型對該等第一監控資料集之第二監控參數的監控參數值進行分群並產生複數個第二群集標籤,使該等第一監控資料集每一者對應該等第二群集標籤其中一者;利用第二分群模型對該等第一監控資料集之第一群集標籤及第二群集標籤進行分群並產生複數個第三群集標籤,使該等第一監控資料集每一者對應該等第三群集標籤其中一者,該等第一監控資料集其中至少複數者係對應該等第三群集標籤中某一者而形成第一監控群組,該第一監控群組依其所對應的複數個第一監控資料集包含相對應的複數個受監控系統,該第一監控群組中的複數個受監控系統包含第一受監控系統;自該第一監控群組接收複數個第二監控資料集,該等第二監控資料集係分別接收自該第一監控群組中的複數個受監控系統,該等第二監控資料集每一者包含該第一監控參數及該第二監控參數;基於該等第二監控資料集,利用一時間序列演算法預測分別對應該第一監控群組之該第一監控參數及該第二監控參數的複數個第一監控參數預測值及複數個第二監控參數預測值,使該第一監控群組中的複數個受監控系統每一者對應其第一監控參數預測值及第二監控參數預測值;自該第一監控群組中的複數個受監控系統接收針對該第一 監控參數之複數個第一監控參數觀測值,該等第一監控參數觀測值係分別對應該等第一監控參數預測值;對該第一監控群組中的複數個受監控系統每一者所對應的第一監控參數觀測值及第一監控參數預測值進行比對,判斷對應該第一受監控系統的第一監控參數觀測值不符合其對應的第一監控參數預測值,並判斷該第一監控群組中除了該第一受監控系統之外的其他受監控系統中所對應之第一監控參數觀測值不符合相對應第一監控參數預測值的受監控系統的數量低於一受監控系統數量閾值,藉此判斷該第一受監控系統發生一異常事件,其中該受監控系統數量閾值係小於該第一監控群組的受監控系統數量;以及根據該異常事件之判斷產生一異常告警。 In some embodiments, the present invention can be implemented by a computer software program for determining that an abnormal event occurs in a monitored system and generating an abnormal alarm for the abnormal event. It can be loaded into a computer device such as a server. The computer software program may perform the following steps after being loaded into the computer device: receiving a plurality of first monitoring data sets, each of the first monitoring data sets includes a plurality of monitoring data, and the monitoring data includes the first monitoring data parameter and a second monitoring parameter, the first monitoring parameter and the second monitoring parameter respectively include a plurality of monitoring data points, each of the data points includes a monitoring parameter value, and the first monitoring data sets respectively include different The monitoring data of the monitored system, the plurality of monitoring data of each of the first monitoring data sets includes operation data and system data; the monitoring parameters of the first monitoring parameters of the first monitoring data sets using the first clustering model Values are grouped and a plurality of first cluster labels are generated, so that each of the first monitoring data sets corresponds to one of the first cluster labels, and the first clustering model is used to classify the first monitoring data sets The monitoring parameter values of the second monitoring parameters are grouped and a plurality of second cluster labels are generated, so that each of the first monitoring data sets corresponds to one of the second cluster labels; The first cluster label and the second cluster label of the first monitoring data set are grouped to generate a plurality of third cluster labels, so that each of the first monitoring data sets corresponds to one of the third cluster labels, and the At least a plurality of the first monitoring data sets correspond to one of the third cluster labels to form a first monitoring group, and the first monitoring group includes corresponding a plurality of monitored systems in the first monitored group comprising the first monitored system; receiving a plurality of second monitored data sets from the first monitored group, the second monitored data sets are respectively received from a plurality of monitored systems in the first monitoring group, each of the second monitoring data sets includes the first monitoring parameter and the second monitoring parameter; based on the second monitoring data sets, A time series algorithm is used to predict a plurality of predicted values of the first monitored parameters and a plurality of predicted values of the second monitored parameters respectively corresponding to the first monitored parameter and the second monitored parameter of the first monitored group, so that the first monitored parameter Each of the plurality of monitored systems in the monitoring group corresponds to its first predicted value of the monitored parameter and its predicted value of the second monitored parameter; A plurality of observed values of the first monitored parameters of the monitored parameters, the observed values of the first monitored parameters are respectively corresponding to the predicted values of the first monitored parameters; each of the plurality of monitored systems in the first monitored group Comparing the corresponding observed value of the first monitoring parameter with the predicted value of the first monitoring parameter, judging that the observed value of the first monitoring parameter corresponding to the first monitored system does not meet the corresponding predicted value of the first monitoring parameter, and judging that the first monitored parameter The number of monitored systems whose observed value of the first monitored parameter does not conform to the predicted value of the corresponding first monitored parameter in other monitored systems other than the first monitored system in a monitored group is less than a monitored system a system quantity threshold, whereby it is judged that an abnormal event has occurred in the first monitored system, wherein the monitored system quantity threshold is smaller than the monitored system quantity of the first monitoring group; and an abnormal alarm is generated according to the judgment of the abnormal event .
某些實施例中,對於包含前述電腦軟體程式所進行之步驟,可包含以下之實施方式。一監控資料集可自一受監控系統直接或間接接收;監控資料集可為一歷史資料集,如受監控系統已發生或已藉由監控所蒐集之資料集。監控資料集可於不同時段或時間點接收,例如第一監控資料集可於第一時間點接收,第二監控資料集可於第二時間點接收。第一監控資料集及第二監控資料集可包含相同特徵維度之監控資料,例如第一監控資料集及第二監控資料集可包含相同之監控參數,如當第一監控資料集包含線上人數、CPU使用率等二十五個監控參數時,第二監控資料集亦包含相同的二十五個監控參數。第一監控參數及第二監控參數可為受監控系統不同之監控參數,例如第一監控參數可為線上人數,第二監控參數可為CPU使用率。監控參數可以是受監控系統之營運相關之參數,亦可為受監控系統之系統相關之參數。可對監控資料集進行資料前處理,以形成複數個監控參數及/或監控資料點。監控資料可指受監控系統受監控而產生之資料,而該等資料可包含本揭露所例示者。監控資料所包含之監控參數可包含營運資料參數及/或系統資料參數,且監控資料亦可包含表示該等參數之 狀態的狀態值。一個監控資料可指自一受監控系統接收之一筆包含監控參數之資料。若以表格方式處理資料,監控資料可以「列」(row)之方式為其資料形態。監控資料點可指不同之時間點,複數個監控資料點可包含一時間序列形態。監控資料點可為表一中之時間點,且監控參數值可為受監控系統於各時間點之對應的參數值。第一監控資料集與第二監控資料集可包含不同時段或時間點之監控資料,例如每個監控資料包含有一時間值,該時間值可作為所述時間點或是所關注時段內之一時間點。第一監控資料集可包含第一時段之監控資料,第二監控資料集可包含第二時段之監控資料。第一分群模型及第二分群模型可採用相同之分群演算法。第一分群模型及第二分群模型可包含不同之超參數,如可包含不同之群集數。第一分群模型及第二分群模型之區分可僅為依所進行之分群階段不同所做之區分。監控資料集與群集標籤之關係可為多對一關係,即一或多個監控資料集可對應至一個群集標籤。當監控資料集與受監控系統具有一對一的關係時,受監控系統與群集標籤因此具有多對一關係,即一個群集標籤可指派或對應一或多個受監控系統。藉由本揭露之分群方式,可產生複數個監控群組,各監控群組對應或指派有一群集標籤,如第三群集標籤中其中一者,各第三群集標籤可以是獨特之標籤,使各監控群組得以彼此間區分,亦即相似之受監控系統可分群至同一群組,使各監控群組可包含一或多個受監控系統。時間序列演算法可採ARIMA、LSTM、RCF等可適用於時間序列型態資料的演算法。第二監控資料集可輸入由時間序列演算法所建立之模型,藉以產出基於第二監控資料集所預測之預測值。例如,第二監控資料集可以是欲預測之資料點的前一週的資料點,若每分鐘為一資料點,對單一監控參數而言,第二監控資料集可包含10080個資料點。時間序列模型可先經歷史資料訓練而建模,例如利用各受監控系統之各監控參數的歷史 資料來建立各受監控系統的時間序列模型,因此產出各受監控系統所對應的時間序列模型。第二監控資料集包含如線上人數、CPU使用率等二十五個監控參數時,歷史資料可包含與第二監控資料集相同的二十五個監控參數。可設定一信賴區間,使監控參數預測值包含一範圍內之值,例如信賴區間可設為95%。對於不同之監控參數,可設定不同數值之信賴區間。信賴區間可藉由採用一時間序列模組或涵式庫來自動計算、設定。監控參數觀測值可於同一時間點上對應監控參數預測值,例如,對於某一監控參數而言,預測一受監控系統的下一個資料點的監控參數預測值,並可取得該受監控系統於同一個資料點的監控參數觀測值。進行監控參數觀測值與監控參數預測值的比對時,可將監控參數觀測值比對監控參數預測值的信賴區間,若監控參數觀測值不在信賴區間內,可判斷為監控參數觀測值不符合監控參數預測值。監控參數觀測值不在信賴區間可指該觀測值超出該信賴區間,例如該觀測值大於或小於該信賴區間的上限值或下限值。監控參數觀測值與監控參數預測值之比對,可包含計算或接收多個時間點的多個值後進行整體比對,例如可於相同時段內建立一監控參數預測值時間序列及其所對應之一監控參數觀測值時間序列,各時間序列包含該時段內之多個時間點對應的監控參數預測值或監控參數觀測值,進行該二時間序列之比對,若該二時間序列之間判斷有一組觀測值與預測值不符合、有數組觀測值與預測值不符合、有數組觀測值與預測值連續不符合、有數組觀測值與預測值連續不符合後又有一組觀測值與預測值不符合、或有數組觀測值與預測值連續不符合後又緊接一組觀測值與預測值不符合等,則可判斷為監控參數觀測值不符合監控參數預測值。或者,對於發生一組觀測值與預測值後連續數組觀測值與預測值皆無不符合,則可判斷未發生監控參數觀測值不符合監控參數預測值。受監控系統數量閾值可以根據監控群組的受 監控系統數量來設定。受監控系統數量閾值可設定為「2」,使群組內除了所關注之受監控系統(如第一受監控系統)判斷有監控參數觀測值不符合監控參數預測值之情形之外,僅有另一個受監控系統亦判斷有監控參數觀測值不符合監控參數預測值之情形或沒有另外的受監控系統判斷有監控參數觀測值不符合監控參數預測值之情形時,便可判斷發生異常事件;若欲將判斷異常事件的標準設為較寬鬆,換言之欲使異常偵測敏感度降低,則該閾值可設為較高之值,如「3」等;若欲採較嚴格標準,則該閾值可設為「1」,使當群組內只有所關注之受監控系統判斷有監控參數觀測值不符合監控參數預測值之情形時,便可判斷發生異常事件。受監控系統數量閾值亦可設為一比例,例如10%。此外,對於群組內除了所關注之受監控系統之外的其他受監控系統中判斷有觀察值不符合預測值者,可將之判斷為發生異常事件,亦即判斷該其他受監控系統中判斷有觀察值不符合預測值者每一者發生一異常事件,藉此對各異常事件發出告警;在此情況下,可基於群組中判斷為發生異常事件的受監控系統的數量少於同群組中未判斷為發生異常事件的受監控系統的數量,而此數量比例可藉由類似本揭露設定閾值方式設定。依本揭露之群組內判斷異常事件的原理,本發明亦包含依相同原理採相反之判斷方式,例如可判斷第一監控群組中除了第一受監控系統之外的其他受監控系統中所對應之第一監控參數觀測值符合相對應第一監控參數預測值的受監控系統的數量高於一受監控系統數量閾值,藉此判斷第一受監控系統發生異常事件,將該閾值設定為較高數值則反映出較嚴格標準等。異常告警可包含訊息、通知、旗標、標籤、音訊等。異常告警可包含與所欲關注之受監控系統(如該第一受監控系統)的相關資訊,例如指示發生異常者為所欲關注之受監控系統,以使異常告警及/或受監控系統可進行如記錄、統計、反映等後續處理。 In some embodiments, the steps performed by the aforementioned computer software programs may include the following implementations. A monitoring data set may be received directly or indirectly from a monitored system; the monitoring data set may be a historical data set, such as a data set that has occurred or has been collected by the monitored system. The monitoring data sets can be received at different periods or time points, for example, the first monitoring data set can be received at the first time point, and the second monitoring data set can be received at the second time point. The first monitoring data set and the second monitoring data set may include monitoring data of the same feature dimension, for example, the first monitoring data set and the second monitoring data set may include the same monitoring parameters, such as when the first monitoring data set includes the number of people online, When there are 25 monitoring parameters such as CPU usage, the second monitoring data set also includes the same 25 monitoring parameters. The first monitoring parameter and the second monitoring parameter may be different monitoring parameters of the monitored system, for example, the first monitoring parameter may be the number of people online, and the second monitoring parameter may be the CPU utilization rate. The monitoring parameters may be parameters related to the operation of the monitored system, or parameters related to the system of the monitored system. Data pre-processing can be performed on the monitoring data set to form a plurality of monitoring parameters and/or monitoring data points. Monitoring data may refer to data generated by the monitored system being monitored, and such data may include those exemplified in this disclosure. The monitoring parameters included in the monitoring data may include operating data parameters and/or system data parameters, and the monitoring data may also include The state value of the state. A monitoring data may refer to a data received from a monitored system including monitoring parameters. If the data is processed in the form of a table, the monitoring data can be its data form in the form of "row". Monitoring data points can refer to different time points, and multiple monitoring data points can include a time series pattern. The monitoring data points can be the time points in Table 1, and the monitoring parameter values can be the corresponding parameter values of the monitored system at each time point. The first monitoring data set and the second monitoring data set may contain monitoring data of different periods or time points. For example, each monitoring data contains a time value, which can be used as the time point or a time within the period of interest point. The first monitoring data set may include the monitoring data of the first period, and the second monitoring data set may include the monitoring data of the second period. The first clustering model and the second clustering model can use the same clustering algorithm. The first grouping model and the second grouping model may include different hyperparameters, such as different numbers of clusters. The distinction between the first grouping model and the second grouping model can only be made according to the different stages of grouping. The relationship between monitoring data sets and cluster tags can be a many-to-one relationship, that is, one or more monitoring data sets can correspond to a cluster tag. When the monitoring dataset has a one-to-one relationship with the monitored systems, the monitored systems and the cluster tags thus have a many-to-one relationship, ie, one cluster tag can be assigned or correspond to one or more monitored systems. By the grouping method disclosed in this disclosure, a plurality of monitoring groups can be generated, and each monitoring group corresponds to or is assigned a cluster label, such as one of the third cluster labels, and each third cluster label can be a unique label, so that each monitoring group Groups can be distinguished from each other, ie similar monitored systems can be grouped into the same group, so that each monitoring group can contain one or more monitored systems. The time series algorithm can adopt ARIMA, LSTM, RCF and other algorithms that can be applied to time series data. The second monitoring data set can be input into the model established by the time series algorithm, so as to generate the predicted value predicted based on the second monitoring data set. For example, the second monitoring data set can be the data points of the previous week of the data point to be predicted. If every minute is a data point, for a single monitoring parameter, the second monitoring data set can contain 10080 data points. The time series model can be modeled by training historical data first, such as using the history of each monitoring parameter of each monitored system The data are used to establish the time series model of each monitored system, so the time series model corresponding to each monitored system is produced. When the second monitoring data set includes 25 monitoring parameters such as number of online users and CPU usage, the historical data may include the same 25 monitoring parameters as the second monitoring data set. A confidence interval can be set so that the predicted value of the monitoring parameter includes a value within a range, for example, the confidence interval can be set to 95%. For different monitoring parameters, confidence intervals of different values can be set. Confidence intervals can be automatically calculated and set by using a time series module or a library of culverts. The observed value of the monitored parameter can correspond to the predicted value of the monitored parameter at the same time point. For example, for a certain monitored parameter, the predicted value of the monitored parameter of the next data point of a monitored system can be predicted, and the monitored system can be obtained at Observed values of monitored parameters for the same data point. When comparing the observed value of the monitored parameter with the predicted value of the monitored parameter, the observed value of the monitored parameter can be compared with the confidence interval of the predicted value of the monitored parameter. If the observed value of the monitored parameter is not within the confidence interval, it can be judged that the observed value of the monitored parameter does not conform to Monitor parameter predictions. The observed value of the monitoring parameter is not within the confidence interval may mean that the observed value exceeds the confidence interval, for example, the observed value is greater than or smaller than the upper limit or lower limit of the confidence interval. The comparison between the observed value of the monitoring parameter and the predicted value of the monitoring parameter may include calculating or receiving multiple values at multiple time points and performing an overall comparison. For example, a time series of predicted values of the monitoring parameter and its corresponding time series may be established within the same period of time. One monitoring parameter observation value time series, each time series includes the monitoring parameter prediction value or monitoring parameter observation value corresponding to multiple time points in the period, the comparison of the two time series is carried out, if the judgment between the two time series There is a group of observed values that do not match the predicted values, there is an array of observed values that do not match the predicted values, there is an array of observed values that do not match the predicted values continuously, and there is another set of observed values that do not match the predicted values If it does not conform, or if there is a continuous discrepancy between the observed value of the array and the predicted value, followed by a group of observed values that do not conform to the predicted value, it can be judged that the observed value of the monitoring parameter does not conform to the predicted value of the monitoring parameter. Alternatively, if there is no discrepancy between the observed values and the predicted values of the continuous array after a group of observed values and predicted values occurs, it can be judged that the observed values of the monitored parameters do not conform to the predicted values of the monitored parameters. The threshold for the number of monitored systems can be based on the monitored The number of monitoring systems is set. The threshold of the number of monitored systems can be set to "2", so that in the group, except for the monitored system of concern (such as the first monitored system) that judges that the observed value of the monitored parameter does not meet the predicted value of the monitored parameter, only When another monitored system also judges that the observed value of the monitored parameter does not meet the predicted value of the monitored parameter, or when no other monitored system judges that the observed value of the monitored parameter does not meet the predicted value of the monitored parameter, it can determine that an abnormal event has occurred; If you want to set the standard for judging abnormal events to be looser, in other words, you want to reduce the sensitivity of abnormal detection, you can set the threshold to a higher value, such as "3"; if you want to adopt a stricter standard, you can set the threshold It can be set to "1", so that when only the monitored system of interest in the group judges that the observed value of the monitored parameter does not meet the predicted value of the monitored parameter, it can be judged that an abnormal event has occurred. The threshold of the number of monitored systems may also be set as a percentage, such as 10%. In addition, for those in other monitored systems in the group other than the monitored system of interest, it is judged that the observed value does not meet the predicted value, it can be judged as an abnormal event, that is, it is judged that the judgment in the other monitored system An abnormal event occurs for each of the observed values that do not meet the predicted value, so as to issue an alarm for each abnormal event; The number of monitored systems in the group that are not judged to have abnormal events, and the proportion of this number can be set by setting a threshold similar to the present disclosure. According to the principle of judging abnormal events in the group disclosed in this disclosure, the present invention also includes adopting the opposite judging method based on the same principle, for example, it can judge the abnormal events in other monitored systems in the first monitoring group except the first monitored system. The number of monitored systems corresponding to the observed value of the first monitored parameter conforming to the predicted value of the corresponding first monitored parameter is higher than a threshold of the number of monitored systems, thereby judging that an abnormal event has occurred in the first monitored system, and setting the threshold to a higher Higher values reflect stricter standards, etc. Anomaly alerts can include messages, notifications, flags, labels, audio, etc. The abnormal alarm may include information related to the monitored system to be concerned (such as the first monitored system), for example, indicating that the abnormality is the monitored system to be concerned, so that the abnormal alarm and/or the monitored system can be Carry out follow-up processing such as recording, statistics, and reflection.
某些實施例中,本發明可藉由一種產生異常告警之電腦軟體程式,經由電腦載入該程式後,執行包含以下之步驟:接收複數個受監控系統每一者的第一組監控資料,該等第一組監控資料每一者包含所對應之受監控系統受監控的營運資料及系統資料,該等營運資料及系統資料以複數個狀態參數分類;利用該等第一組監控資料之狀態參數對該等受監控系統進行分群,以產生複數個狀態參數群集標籤,使該等受監控系統每一者對應該等狀態參數群集標籤其中一者;利用該等狀態參數的該等狀態參數群集標籤對該等受監控系統進行分群,以產生複數個受監控系統群集標籤,使該等受監控系統每一者對應該等受監控系統群集標籤其中一者,該等受監控系統形成複數個受監控系統群集,該等受監控系統群集分別對應至該等受監控系統群集標籤,該等受監控系統群集包含第一受監控系統群集,該第一受監控系統群集包含對應至同一受監控系統群集標籤的複數個受監控系統;針對該等狀態參數每一者,接收該第一受監控系統群集之複數個受監控系統每一者的一監控觀測值,且針對該第一受監控系統群集之複數個受監控系統每一者產生對應該等狀態參數每一者的一監控資料時間序列,並依據各監控資料時間序列產生一監控預測值,該監控預測值於一時間軸上係對應該監控觀測值;判斷該第一受監控系統群集之複數個受監控系統每一者的監控觀測值是否符合其對應之監控預測值,若不符合則判斷為所對應之受監控系統發生一異常行為,並當該第一受監控系統群集內判斷為發生異常行為的受監控系統的數量大於一且小於或等於一異常閾值時,判斷發生異常行為之該至少一受監控系統發生異常事件,該異常閾值係小於該第一受監控系統群集中全部受監控系統的數量;以及當判斷發生該異常事件時,產生一異常告警,該異常告警係指示發生異常事件者為該至少一受監控系統。 In some embodiments, the present invention can use a computer software program that generates an abnormal alarm. After the program is loaded into the computer, the steps comprising the following steps are executed: receiving the first set of monitoring data from each of a plurality of monitored systems, Each of the first group of monitoring data includes the monitored operation data and system data of the corresponding monitored system, and the operation data and system data are classified by a plurality of status parameters; using the status of the first group of monitoring data grouping the monitored systems by parameter to generate a plurality of status parameter cluster tags such that each of the monitored systems corresponds to one of the status parameter cluster tags; using the status parameter clusters of the status parameters tags group the monitored systems to generate a plurality of monitored system cluster tags such that each of the monitored systems corresponds to one of the monitored system cluster tags, the monitored systems form a plurality of monitored system cluster tags Monitored system clusters, the monitored system clusters corresponding to the monitored system cluster labels respectively, the monitored system clusters include a first monitored system cluster, the first monitored system cluster includes a plurality of monitored systems of tags; for each of the state parameters, receiving a monitored observation for each of the plurality of monitored systems of the first cluster of monitored systems, and for each of the first cluster of monitored systems Each of the plurality of monitored systems generates a monitoring data time series corresponding to each of the corresponding state parameters, and generates a monitoring prediction value according to each monitoring data time series, and the monitoring prediction value corresponds to the monitoring on a time axis Observation value: judging whether the monitoring observation value of each of the plurality of monitored systems in the first monitored system cluster conforms to its corresponding monitoring prediction value, if not, it is determined that an abnormal behavior has occurred in the corresponding monitored system, And when the number of monitored systems determined to have abnormal behavior in the first monitored system cluster is greater than one and less than or equal to an abnormal threshold, it is determined that an abnormal event has occurred in the at least one monitored system that has abnormal behavior, and the abnormal threshold is less than the number of all monitored systems in the first monitored system cluster; and when it is determined that the abnormal event occurs, an abnormal alarm is generated, and the abnormal alarm indicates that the abnormal event occurred is the at least one monitored system.
某些實施例中,前述各步驟更可藉由以下方式實施。所謂異常行為,可意指或代表參數值不符合之判斷結果,並非一定指據以產生異常告警之異常事件。異常行為可視為異常判斷之初步結果或中繼結果,並藉由多個異常行為之綜合判斷,例如判斷發生異常行為之受監控系統的數量等,再判斷最終結果,例如將最終結果判斷為異常事件,以異常事件為產生異常告警之依據。異常行為可作為單一受監控系統的異常判斷結果,由於本發明可進行群組內多個受監控系統的整體判斷,因此群組內多個異常行為可整體作為單一受監控系統是否發生異常事件之判斷依據。例如,第一受監控系統群集包含15個受監控系統時,異常閾值可設為50%,則判斷群集內至少一個且不超過七個的受監控系統發生異常行為時,可判斷該等發生異常行為的受監控系統發生異常事件,據此可產生異常告警,所產生之異常告警可指示該至少一個且不超過七個的受監控系統發生異常事件。此外,若判斷超過七個受監控系統發生異常行為時,可對群集內未發生異常行為的受監控系統產生異常告警,亦即不將判斷發生異常行為的受監控系統判斷為發生異常事件,而是將未發生異常行為的受監控系統判斷為發生異常事件,此方式可使群集內在行為上與其他受監控系統不同的少數者被判斷為發生異常事件。據此,此方式可藉由下列步驟實施為:判斷該第一受監控系統群集之複數個受監控系統每一者的監控觀測值是否符合其對應之監控預測值,若不符合則判斷為所對應之受監控系統發生一異常行為,並當該第一受監控系統群集內判斷為發生異常行為的受監控系統的數量大於該群集內未判斷為發生異常行為的受監控系統的數量時,將該未判斷為發生異常行為的至少一受監控系統判斷為發生一異常事件,當判斷發生該至少一異常事件時,產生一異常告警,該異常告警係指示該發生異常事件之至少一受監控系統。當然,此方式中對於群集內之數量判斷(如多 數/少數判斷,或以判斷異常行為來區分群集內的二個次群集)可視為等同設定一閾值,或是可藉由設定一閾值來控制此判斷方式。例如,針對包含多個受監控系統的一特定受監控系統群集,利用本揭露判斷異常行為的方式,建立第一異常事件判斷條件,即當群集內少於半數的受監控系統有異常行為時,判斷該少於半數的受監控系統發生異常事件;此外,可再建立第二異常事件判斷條件,即當群集內多於半數的受監控系統有異常行為時,判斷所剩的、未判斷為發生異常行為的受監控系統發生異常事件,而第二異常事件判斷條件可獨立作為群集內異常事件的判斷,或是與第一異常事件判斷條件一同使用。此外,此方式可同時利用本揭露其他閾值的方式一同實施,以建立不同情況下的判斷條件,且本發明可藉由設定複數個閾值來對不同的異常事件判斷方式進行告警。例如,可設定不同或層級化的數值門檻之閾值來表示不同等級之異常事件,如可設第一類型異常閾值為「2」及第二類型中度異常閾值為「5」,將二個以內之受監控系統的異常行為視為第一類型異常事件時,異常告警可指示發生第一類型異常事件,而將三個以上至五個以內之受監控系統的異常行為視為第二類型異常事件時,異常告警可指示發生第二類型異常事件等;不同類型之異常事件可代表不同異常程度等。針對各個受監控系統群集,可採本揭露之群集內異常告警方式,因此可確保每一受監控系統在其自身的群集內皆可受到監控而在發生異常事件時產生告警。 In some embodiments, the aforementioned steps can be implemented in the following ways. The so-called abnormal behavior may mean or represent the judgment result that the parameter value does not conform to, and does not necessarily refer to the abnormal event based on which the abnormal alarm is generated. Abnormal behavior can be regarded as the preliminary result or intermediate result of abnormal judgment, and the final result is judged by comprehensive judgment of multiple abnormal behaviors, such as judging the number of monitored systems where abnormal behavior occurs, such as judging the final result as abnormal Events, using abnormal events as the basis for generating abnormal alarms. The abnormal behavior can be used as the abnormal judgment result of a single monitored system. Since the present invention can make an overall judgment of multiple monitored systems in a group, multiple abnormal behaviors in a group can be used as a whole to determine whether an abnormal event occurs in a single monitored system. Judgments based. For example, when the first monitored system cluster contains 15 monitored systems, the abnormality threshold can be set to 50%, and when it is determined that at least one and no more than seven monitored systems in the cluster have abnormal behaviors, it can be judged that such abnormalities occur Abnormal events occur in the monitored systems of the behavior, and abnormal alarms can be generated accordingly, and the generated abnormal alarms can indicate that abnormal events occur in at least one and no more than seven monitored systems. In addition, if it is determined that more than seven monitored systems have abnormal behaviors, an abnormal alarm can be generated for the monitored systems in the cluster that do not have abnormal behaviors, that is, the monitored systems that are judged to have abnormal behaviors are not judged to have abnormal events, but It is to judge a monitored system that does not have abnormal behavior as an abnormal event. This method can make a small number of people in the cluster whose behavior is different from other monitored systems be judged to have an abnormal event. Accordingly, this method can be implemented through the following steps: judging whether the monitoring observed value of each of the plurality of monitored systems of the first monitored system cluster conforms to its corresponding monitoring predicted value, and if not, it is judged to be An abnormal behavior occurs in the corresponding monitored system, and when the number of monitored systems that are judged to have abnormal behaviors in the first monitored system cluster is greater than the number of monitored systems that are not judged to have abnormal behaviors in the cluster, the The at least one monitored system that is not determined to have an abnormal behavior is determined to have an abnormal event, and when it is determined that the at least one abnormal event has occurred, an abnormal alarm is generated, and the abnormal alarm indicates the at least one monitored system that has an abnormal event. . Of course, in this method, the judgment of the number in the cluster (such as how many Number/minority judgment, or distinguishing two sub-clusters within a cluster by judging abnormal behavior) can be regarded as equivalent to setting a threshold, or the judgment method can be controlled by setting a threshold. For example, for a specific monitored system cluster including multiple monitored systems, the method of judging abnormal behavior disclosed in this disclosure is used to establish the first abnormal event judgment condition, that is, when less than half of the monitored systems in the cluster have abnormal behavior, Judging that less than half of the monitored systems have abnormal events; in addition, a second abnormal event judgment condition can be established, that is, when more than half of the monitored systems in the cluster have abnormal behaviors, the remaining ones that are not judged to have occurred An abnormal event occurs in the monitored system with abnormal behavior, and the second abnormal event judging condition can be independently used as the judging abnormal event in the cluster, or used together with the first abnormal event judging condition. In addition, this method can be implemented together with other threshold methods of the present disclosure to establish judgment conditions in different situations, and the present invention can alert different abnormal event judgment methods by setting multiple thresholds. For example, different or hierarchical numerical thresholds can be set to represent abnormal events of different levels. For example, the first type of abnormal threshold can be set to "2" and the second type of moderate abnormal threshold to "5". When the abnormal behavior of the monitored system is regarded as the first type of abnormal event, the abnormal alarm can indicate the occurrence of the first type of abnormal event, and the abnormal behavior of more than three to less than five monitored systems is regarded as the second type of abnormal event When , the abnormal alarm can indicate the occurrence of the second type of abnormal event, etc.; different types of abnormal events can represent different abnormal degrees, etc. For each monitored system cluster, the intra-cluster abnormal alarm method disclosed in this disclosure can be adopted, thus ensuring that each monitored system can be monitored in its own cluster and generate an alarm when an abnormal event occurs.
某些實施例中,本發明可藉由一種產生異常告警之電腦軟體程式來實施,經由電腦載入該程式後可執行包含以下之步驟:對複數個受監控系統的複數個狀態參數進行分群並產生每一者包含複數個第一群集標籤之複數個第一群集標籤集,使該等受監控系統每一者被指派該等第一群集標籤集每一者中的一個第一群集標籤,該等第一群集標籤集係分別對應 該等狀態參數,該複數個狀態參數係包含該等受監控系統之營運資料及系統資料的狀態參數;以該等第一群集標籤集的複數個第一群集標籤對該等受監控系統進行分群並產生包含複數個第二群集標籤之第二群集標籤集,使該等受監控系統每一者被指派該第二群集標籤集中的一個第二群集標籤,其中,該等受監控系統中有複數個受監控系統形成第一受監控系統群組,使該第一受監控系統群組之全部受監控系統係被指派同一個第二群集標籤,該第一受監控系統群組包含一目標受監控系統;針對該第一受監控系統群組,形成該群組中各受監控系統之一訓練資料集,該等訓練資料集每一者包含該等狀態參數且更包含一異常值參數,該異常值參數係指示所對應之受監控系統的一異常狀態;基於該等訓練資料集建立對應該第一受監控系統群組中各受監控系統的第一異常預測模型,該等第一異常預測模型每一者係用以預測所對應之受監控系統的第一異常值參數預測值;自該等訓練資料集每一者中之異常值參數形成一異常值參數時間序列,基於該等異常值參數時間序列建立對應該第一受監控系統群組中各受監控系統的第二異常預測模型,該等第二異常預測模型每一者係用以預測所對應之受監控系統的第二異常值參數預測值;針對該第一受監控系統群組中各受監控系統,利用該等第一異常預測模型中所對應者預測於一目標時間點之第一異常值參數預測值,利用該等第二異常預測模型中所對應者預測於該目標時間點之第二異常值參數預測值,於該第一異常值參數預測值不符合該第二異常值參數預測值時判斷所對應之受監控系統發生異常行為,依前述異常行為判斷方式判斷該目標受監控系統發生異常行為,並依前述異常行為判斷方式判斷該第一受監控系統群組中除了該目標受監控系統之外有低於一特定數量之其他受監控系統發生異常行為,該數量低於該第一受監控 系統群組的受監控系統數量;以及產生關聯於該目標受監控系統之一異常告警。 In some embodiments, the present invention can be implemented by a computer software program that generates an abnormal alarm. After the program is loaded through the computer, the following steps can be performed: grouping a plurality of status parameters of a plurality of monitored systems and generating a plurality of first cluster label sets each comprising a plurality of first cluster label sets, such that each of the monitored systems is assigned a first cluster label in each of the first cluster label sets, the etc. The first cluster label set corresponds to The status parameters, the plurality of status parameters are status parameters including the operating data and system data of the monitored systems; the monitored systems are grouped by a plurality of first cluster tags of the first cluster tag set and generating a second cluster label set comprising a plurality of second cluster labels such that each of the monitored systems is assigned a second cluster label in the second cluster label set, wherein the monitored systems have a plurality of monitored systems form a first monitored system group such that all monitored systems of the first monitored system group are assigned the same second cluster label, the first monitored system group includes a target monitored system system; for the first monitored system group, forming a training data set for each monitored system in the group, each of the training data sets includes the state parameters and further includes an outlier parameter, the abnormal The value parameter indicates an abnormal state of the corresponding monitored system; a first abnormal prediction model corresponding to each monitored system in the first monitored system group is established based on the training data sets, and the first abnormal prediction models each for predicting a first outlier parameter prediction value for the corresponding monitored system; forming an outlier parameter time series from outlier parameters in each of the training data sets, based on the outlier parameter The time series establishes a second abnormality prediction model corresponding to each monitored system in the first monitored system group, and each of the second abnormality prediction models is used to predict a second abnormal value parameter of the corresponding monitored system Predicted value: For each monitored system in the first monitored system group, use the corresponding one in the first abnormality prediction model to predict the first abnormal value parameter prediction value at a target time point, use the second The corresponding person in the abnormality prediction model predicts the predicted value of the second abnormal value parameter at the target time point, and judges that the corresponding monitored system occurs when the predicted value of the first abnormal value parameter does not meet the predicted value of the second abnormal value parameter Abnormal behavior, judging that abnormal behavior has occurred in the target monitored system according to the aforementioned abnormal behavior judging method, and judging according to the aforementioned abnormal behavior judging method that there are less than a specific number of the first monitored system group except the target monitored system Abnormal behavior occurs in other monitored systems, the number is lower than that of the first monitored system the number of monitored systems in the system group; and generating an anomaly alert associated with the target monitored system.
某些實施例中,前述訓練資料集的異常值參數係代表一異常值或風險值,如可為本揭露所述之風險值。異常值參數可以包含多個時間點或一時間序列的異常值,該等異常值可以有無異常或有無風險來標註,如有異常者標註為「1」,無異常者標註為「0」;針對此異常值參數,訓練資料集中標註為「1」的資料可代表受監控系統的異常狀態為異常,而標註為「0」的資料可代表受監控系統的異常狀態為非異常。訓練資料集可包含歷史資料,例如包含異常值或風險值的資料。訓練資料集的每筆資料可包含一特定時間點之資料,據此,異常值參數反映受監控系統於該特定時間點之異常狀態。利用訓練資料集可訓練出第一異常預測模型,此模型所用之演算法可包含適合之監督式機器學習演算法。或者可說,異常值參數係對狀態參數聯集進行標註之資料,使在不需針對個別狀態參數進監控之情況下,可針對全部狀態參數來整體監控,並據此以第一異常預測模型來預測異常。此外,各個時間點的異常值可藉由形成一時間序列,所形成之異常值參數時間序列可用以訓練如採用LSTM演算法等所形成之第二異常預測模型。例如,針對一受監控系統,每分鐘為一資料點,對每一資料點標註異常值參數,若有標註了一週之資料,則有10080個異常值參數的參數值(或說異常值參數包含了該週各資料點),此等參數值便可形成一異常值參數時間序列並用以建立時間序列模型。針對一筆新資料,如新的觀察值,該新資料可包含狀態參數但不包含異常值參數,利用第一及第二異常預測模型便可分別預測出各模型的預測值,該等預測值若有差異時,便可據以判斷有異常行為產生。例如,將第二異常預測模型作為基準模型時,第二異常值參數預測值可作為預測基準值,若第一異常值參數預測值不符 合第二異常值參數預測值或超出其信賴區間之上、下限值時,便可判斷發生異常行為。 In some embodiments, the outlier parameter of the aforementioned training data set represents an outlier or risk value, such as the risk value described in this disclosure. The outlier parameter can contain outliers at multiple time points or a time series. These outliers can be marked with or without abnormality or risk. If there is an abnormality, it will be marked as "1", and if there is no abnormality, it will be marked as "0"; for For this outlier parameter, the data marked as "1" in the training data set may represent the abnormal state of the monitored system as abnormal, and the data marked as "0" may represent the abnormal state of the monitored system as non-abnormal. The training data set can contain historical data, such as data containing outliers or values at risk. Each piece of data in the training data set may include data at a specific time point, and accordingly, the outlier parameter reflects the abnormal state of the monitored system at the specific time point. The first anomaly prediction model can be trained by using the training data set, and the algorithm used in this model can include a suitable supervised machine learning algorithm. In other words, the outlier parameters are the data that mark the joint set of state parameters, so that all state parameters can be monitored as a whole without monitoring individual state parameters, and based on this, the first abnormality prediction model to predict exceptions. In addition, the outliers at each time point can be formed into a time series, and the outlier parameter time series can be used to train a second outlier prediction model formed by using LSTM algorithm or the like. For example, for a monitored system, every minute is a data point, and an outlier parameter is marked for each data point. If there is data marked for a week, there are 10080 outlier parameter values (or outlier parameters include Each data point of the week), these parameter values can form an outlier parameter time series and be used to build a time series model. For a piece of new data, such as a new observation value, the new data may contain state parameters but not abnormal value parameters, and the first and second abnormal prediction models can be used to predict the predicted values of each model. If the predicted values are When there is a difference, it can be judged that there is abnormal behavior. For example, when the second abnormality prediction model is used as the reference model, the predicted value of the second outlier parameter can be used as the prediction reference value. If the predicted value of the first outlier parameter does not match When the predicted value of the second outlier parameter or exceeds the upper and lower limits of its confidence interval, it can be judged that abnormal behavior occurs.
綜合以上所述,本發明實施例提供的服務異常偵測告警方法與使用此方法的設備與系統,相較於先前技術,可包含下述優點:(1)可避免僅從IT維運等資料與日誌來監控或判斷異常事件,提升早期告警機會,亦降低誤報可能性;(2)藉由同類型服務或公司的資料判斷或訓練,避免資料不足狀況,提升模型準確性;(3)藉由同類型服務或公司的資料與建模,提升異常狀況的比對與告警效率;(4)可以事前防範在不明顯特徵或特徵組合下,即可預測異常,不再需要人為定義邏輯;(5)可考量事件的週期性、季節性,以及長短期資料的影響;以及(6)綜合時間序列趨勢的預防性告警,以及風險值評估方法,避免傳統風險預測方法的主觀偏差,以及對處理大量資料效率與準確率過低的狀況。另外,本發明實施例提供的服務異常偵測告警方法與使用此方法的設備與系統可以應用於各類AI相關產品與服務,例如電商、遊戲業的異常偵測、製造業的異常偵測、財務報表保險制度(FSI)金融保險產業的詐欺偵測,以及管理服務提供商(MSP)的自動化維運等。 Based on the above, the service anomaly detection and alarm method provided by the embodiment of the present invention and the equipment and system using this method can include the following advantages compared with the prior art: (1) It can avoid only using information such as IT maintenance and operation Use logs to monitor or judge abnormal events, improve early warning opportunities, and reduce the possibility of false alarms; (2) use data judgment or training from similar services or companies to avoid insufficient data and improve model accuracy; (3) use Based on the data and modeling of the same type of service or company, the efficiency of comparison and warning of abnormal conditions can be improved; (4) It can be prevented in advance under the unobvious features or combination of features, and abnormalities can be predicted, no need for artificially defined logic; ( 5) The periodicity and seasonality of events, and the impact of long-term and short-term data can be considered; and (6) preventive warnings of comprehensive time series trends, and risk value assessment methods, avoiding subjective bias of traditional risk prediction methods, and processing The efficiency and accuracy of a large amount of data are too low. In addition, the service anomaly detection and alarm method provided by the embodiment of the present invention and the equipment and system using this method can be applied to various AI-related products and services, such as e-commerce, abnormal detection in the game industry, and abnormal detection in the manufacturing industry. , Fraud detection in financial statement insurance system (FSI) financial insurance industry, and automated maintenance and operation of management service provider (MSP), etc.
應當理解,本文描述的示例和實施例僅用於說明目的,所揭露之實施例及技術特徵在符合本發明之精神之下可有各種組合,並且鑑於其的各種修改或改變將被建議給本領域技術人員,並且將被包括在本申請的精神和範圍以及所附權利要求的範圍之內。 It should be understood that the examples and embodiments described herein are for illustrative purposes only, and that the disclosed embodiments and technical features can have various combinations in accordance with the spirit of the present invention, and various modifications or changes will be suggested to the present invention. skilled in the art and are to be included within the spirit and scope of this application and the purview of the appended claims.
S31~S36:步驟 S31~S36: steps
Claims (16)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW110133749A TWI789003B (en) | 2021-09-10 | 2021-09-10 | Service anomaly detection and alerting method, apparatus using the same, storage media for storing the same, and computer software program for generating service anomaly alert |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW110133749A TWI789003B (en) | 2021-09-10 | 2021-09-10 | Service anomaly detection and alerting method, apparatus using the same, storage media for storing the same, and computer software program for generating service anomaly alert |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TWI789003B true TWI789003B (en) | 2023-01-01 |
| TW202312710A TW202312710A (en) | 2023-03-16 |
Family
ID=86669931
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW110133749A TWI789003B (en) | 2021-09-10 | 2021-09-10 | Service anomaly detection and alerting method, apparatus using the same, storage media for storing the same, and computer software program for generating service anomaly alert |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI789003B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TWI881786B (en) * | 2024-04-08 | 2025-04-21 | 神雲科技股份有限公司 | A method for error event recording |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20250138983A (en) * | 2024-03-14 | 2025-09-23 | 쿠팡 주식회사 | Method, apparatus, and recording medium for determining availability of an application |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101345973A (en) * | 2008-09-01 | 2009-01-14 | 中国移动通信集团山东有限公司 | Method and system for network optimization and adjustment using cell clusters in communication network |
| CN102098175A (en) * | 2011-01-26 | 2011-06-15 | 浪潮通信信息系统有限公司 | Alarm association rule obtaining method of mobile internet |
| US8676553B2 (en) * | 2008-11-19 | 2014-03-18 | Hitachi, Ltd. | Apparatus abnormality diagnosis method and system |
| CN107124298A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | Alert aggregation method and system |
| TWI721693B (en) * | 2019-12-09 | 2021-03-11 | 中華電信股份有限公司 | Network behavior anomaly detection system and method based on mobile internet of things |
| TWM622216U (en) * | 2021-09-10 | 2022-01-11 | 伊雲谷數位科技股份有限公司 | Apparatuses for service anomaly detection and alerting |
-
2021
- 2021-09-10 TW TW110133749A patent/TWI789003B/en active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101345973A (en) * | 2008-09-01 | 2009-01-14 | 中国移动通信集团山东有限公司 | Method and system for network optimization and adjustment using cell clusters in communication network |
| US8676553B2 (en) * | 2008-11-19 | 2014-03-18 | Hitachi, Ltd. | Apparatus abnormality diagnosis method and system |
| CN102098175A (en) * | 2011-01-26 | 2011-06-15 | 浪潮通信信息系统有限公司 | Alarm association rule obtaining method of mobile internet |
| CN107124298A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | Alert aggregation method and system |
| TWI721693B (en) * | 2019-12-09 | 2021-03-11 | 中華電信股份有限公司 | Network behavior anomaly detection system and method based on mobile internet of things |
| TWM622216U (en) * | 2021-09-10 | 2022-01-11 | 伊雲谷數位科技股份有限公司 | Apparatuses for service anomaly detection and alerting |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TWI881786B (en) * | 2024-04-08 | 2025-04-21 | 神雲科技股份有限公司 | A method for error event recording |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202312710A (en) | 2023-03-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20220263860A1 (en) | Advanced cybersecurity threat hunting using behavioral and deep analytics | |
| US12309174B2 (en) | Change monitoring and detection for a cloud computing environment | |
| US10692032B2 (en) | Pervasive, domain and situational-aware, adaptive, automated, and coordinated big data analysis, contextual learning and predictive control of business and operational risks and security | |
| TWM622216U (en) | Apparatuses for service anomaly detection and alerting | |
| EP2487860B1 (en) | Method and system for improving security threats detection in communication networks | |
| US20190089725A1 (en) | Deep Architecture for Learning Threat Characterization | |
| CN118153117A (en) | An information security risk assessment system based on blockchain | |
| US20210097433A1 (en) | Automated problem detection for machine learning models | |
| US20240036963A1 (en) | Multi-contextual anomaly detection | |
| US20180013783A1 (en) | Method of protecting a communication network | |
| CN120029858B (en) | Comprehensive financial IT operation and maintenance management system and method based on artificial intelligence | |
| US20220058745A1 (en) | System and method for crowdsensing-based insurance premiums | |
| TWI789003B (en) | Service anomaly detection and alerting method, apparatus using the same, storage media for storing the same, and computer software program for generating service anomaly alert | |
| US20260019443A1 (en) | System and Method for Analyzing Cyber Security Postures and Real-Time Asset Validation for Critical Infrastructure | |
| Bhaduri et al. | Detecting abnormal machine characteristics in cloud infrastructures | |
| US20250110820A1 (en) | Systems and methods for a real time anomaly streaming module | |
| CN120342697A (en) | Network environment security status assessment method, device, equipment and storage medium | |
| CN119341888A (en) | Security early warning methods, devices, equipment, media and program products | |
| WO2021055964A1 (en) | System and method for crowd-sourced refinement of natural phenomenon for risk management and contract validation | |
| CN113992496A (en) | Change alarm method, device and computing device based on quartile algorithm | |
| CN114238027B (en) | A multi-dimensional analysis system based on massive request data | |
| CN119442334B (en) | Cloud service privacy data detection method and cloud server | |
| WO2020255512A1 (en) | Monitoring system and monitoring method | |
| Pekarčík et al. | Real-time processing of cybersecurity system data for attacker profiling | |
| Liu | Data Quality and Data Preprocessing on an IoT-based Ecosystem for Smart Maintenance in the Manufacturing Industry |