TWI893528B

TWI893528B - Server and error diagnosis method thereof

Info

Publication number: TWI893528B
Application number: TW112144209A
Authority: TW
Inventors: 劉濤; 金大鵬; 滿毅; 馬聰
Original assignee: 韓商韓領有限公司
Priority date: 2023-10-11
Filing date: 2023-11-16
Publication date: 2025-08-11
Also published as: TW202516348A; KR20250052267A; TW202540855A; KR102679450B1; WO2025079779A1

Abstract

本發明提供一種伺服器之錯誤分析方法，其包括如下步驟：確認第1服務伺服器中發生之第1錯誤，該第1服務伺服器係彼此間進行服務呼叫之複數個服務伺服器中之一者；基於第1錯誤之類型資訊，判斷第1錯誤是否具有呼叫依存性；基於第1錯誤是否具有上述呼叫依存性，選擇複數個服務伺服器之錯誤歷史資訊及資源歷史資訊中之至少一部分作為分析對象資訊；及基於分析對象資訊，確認第1錯誤之原因。The present invention provides a server error analysis method, comprising the following steps: confirming a first error occurring in a first service server, the first service server being one of a plurality of service servers that perform service calls with each other; determining whether the first error has call dependency based on type information of the first error; selecting at least a portion of error history information and resource history information of the plurality of service servers as analysis target information based on whether the first error has the aforementioned call dependency; and confirming the cause of the first error based on the analysis target information.

Description

Server and error analysis method thereof

本發明係關於一種用以查找彼此間進行服務呼叫之服務伺服器中所發生之錯誤之原因的方法及實行該方法之伺服器。The present invention relates to a method for finding the cause of an error occurring in a service server performing a service call between the servers and a server implementing the method.

最近之網際網路服務大多以微型服務架構實現。如上所述之微型服務架構可指如下結構，該結構具有多個實行小功能之服務伺服器，並且其彼此間進行鏈式服務呼叫以實行整體大服務。如上所述之發生鏈式服務呼叫之結構中，於發生任何服務故障時難以找到其根本原因。即，於發生特定錯誤時，工程師需要直接深入研究(dive-deep)問題，以確認哪個先行錯誤引發了特定錯誤。如上所述，於微型服務架構中，需要依靠工程師個人之能力來找出錯誤之根本原因，其於服務之維護及管理方面可能成為相當大之缺點，因此有必要解決如上所述之方面之問題。Most recent Internet services are implemented using a microservice architecture. The microservice architecture described above may refer to a structure that has multiple service servers that implement small functions, and that perform chained service calls with each other to implement an overall large service. In a structure where chained service calls occur as described above, it is difficult to find the root cause of any service failure. That is, when a specific error occurs, engineers need to dive deep into the problem directly to confirm which previous error caused the specific error. As described above, in a microservice architecture, it is necessary to rely on the individual capabilities of engineers to find the root cause of the error, which may become a significant disadvantage in terms of service maintenance and management. Therefore, it is necessary to solve the problems described above.

與此相關，可參照KR10-1684405B1等先前文獻。For related information, please refer to previous literature such as KR10-1684405B1.

[發明所欲解決之問題][Identify the problem you want to solve]

所揭示之實施例欲提供一種伺服器及其錯誤分析方法。具體而言，其目的之一在於：對彼此間進行服務呼叫之微型服務架構之服務伺服器間之鏈式服務呼叫中所發生之錯誤之原因進行自動分析。The disclosed embodiments provide a server and an error analysis method thereof. Specifically, one of the purposes is to automatically analyze the causes of errors that occur in chained service calls between service servers in a microservice architecture that perform service calls between each other.

本實施例所欲實現之技術課題並不限於如上所述之技術課題，可根據以下之實施例而推斷出其他技術課題。 [解決問題之技術手段] The technical issues to be achieved by this embodiment are not limited to the technical issues described above; other technical issues can be inferred based on the following embodiments. [Technical Means for Solving the Problem]

本發明之一態樣可包括一種錯誤分析方法，其係藉由伺服器而進行者，其包括如下步驟：確認第1服務伺服器中發生之第1錯誤，該第1服務伺服器係彼此間進行服務呼叫之複數個服務伺服器中之一者；基於上述第1錯誤之類型資訊，判斷上述第1錯誤是否具有呼叫依存性；基於上述第1錯誤是否具有上述呼叫依存性，選擇上述複數個服務伺服器之錯誤歷史資訊及資源歷史資訊中之至少一部分作為分析對象資訊；及基於上述分析對象資訊，確認上述第1錯誤之原因。One aspect of the present invention may include an error analysis method, which is performed by a server and includes the following steps: confirming a first error occurring in a first service server, where the first service server is one of a plurality of service servers that perform service calls with each other; based on type information of the first error, determining whether the first error has call dependency; based on whether the first error has the call dependency, selecting at least a portion of error history information and resource history information of the plurality of service servers as analysis target information; and based on the analysis target information, confirming the cause of the first error.

於本發明之一實施例中可包括錯誤分析方法，其中選擇上述分析對象資訊之步驟係若確認為上述第1錯誤具有上述呼叫依存性，則選擇如下兩種資訊作為上述分析對象資訊：(i)至少一個第2錯誤之第2屬性資訊，其係於上述錯誤歷史資訊中包括之複數個錯誤之屬性資訊中，基於上述第1錯誤之第1屬性資訊中之至少一部分而選擇者；及(ii)上述第2錯誤中之至少一部分之第2資源資訊，其係於發生上述第2錯誤之第2服務伺服器之計算資源相關資料中，與上述第2錯誤中之至少一部分之發生情況對應者，該計算資源相關資料包括於上述資源歷史資訊中。In one embodiment of the present invention, an error analysis method may be included, wherein the step of selecting the above-mentioned analysis target information is to select the following two types of information as the above-mentioned analysis target information if it is confirmed that the above-mentioned first error has the above-mentioned call dependency: (i) second attribute information of at least one second error, which is the attribute information of multiple errors included in the above-mentioned error history information based on the above-mentioned second attribute information; and (ii) second resource information of at least a portion of the second error, which is computing resource-related data of a second service server where the second error occurs, corresponding to the occurrence of at least a portion of the second error, and the computing resource-related data is included in the resource history information.

又，於本發明之一實施例中可包括錯誤分析方法，其中上述第1屬性資訊包括上述第1錯誤之追蹤ID(identifier，識別碼)、發生時刻、輸入參數、API(application programming interface，應用程式設計介面)路徑、伺服器IP(Internet Protocol，網際網路協定)、異常處理方式、錯誤訊息及上述類型資訊中之至少一部分。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the first attribute information includes a tracking ID (identifier) of the first error, occurrence time, input parameters, API (application programming interface) path, server IP (Internet Protocol), exception handling method, error message, and at least a portion of the above-mentioned type information.

又，於本發明之一實施例中可包括錯誤分析方法，其中上述第2錯誤係藉由計算上述複數個錯誤之上述屬性資訊中之至少一部分與上述第1屬性資訊中之至少一部分間之相似度，並確認上述相似度為臨界值以上之錯誤而選擇。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the second error is selected by calculating the similarity between at least a portion of the attribute information of the plurality of errors and at least a portion of the first attribute information, and confirming that the similarity is above a critical value.

又，於本發明之一實施例中可包括錯誤分析方法，其中確認上述原因之步驟係基於上述第2屬性資訊中之至少一部分，導出上述第1及第2錯誤間之呼叫關係，並分析上述呼叫關係，將上述第2錯誤中之於上述呼叫關係中最先發生並誘發其他錯誤之特定錯誤確認為上述原因。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the step of confirming the above-mentioned cause is based on at least a part of the above-mentioned second attribute information, deriving the call relationship between the above-mentioned first and second errors, and analyzing the above-mentioned call relationship, and confirming the specific error in the above-mentioned second error that occurs first in the above-mentioned call relationship and induces other errors as the above-mentioned cause.

又，於本發明之一實施例中可包括錯誤分析方法，其中確認上述原因之步驟係將於上述特定錯誤發生前產生臨界值以上之資源之特定服務追加確認為參考因素，該特定服務係基於與上述特定錯誤對應之特定資源資訊而確認者，該特定資源資訊係上述第2資源資訊之至少一部分。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the step of confirming the above-mentioned cause is to additionally confirm as a reference factor a specific service of a resource that generates a critical value before the above-mentioned specific error occurs, and the specific service is confirmed based on specific resource information corresponding to the above-mentioned specific error, and the specific resource information is at least a part of the above-mentioned second resource information.

又，於本發明之一實施例中可包括錯誤分析方法，其中選擇上述分析對象資訊之步驟係於確認為上述第1錯誤不具有上述呼叫依存性之情形時，於上述資源歷史資訊中包括之上述第1服務伺服器之計算資源相關資料中，選擇與上述第1錯誤之發生情況對應之第1資源資訊作為上述分析對象資訊；上述確認原因之步驟係將上述第1錯誤確認為上述原因。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the step of selecting the above-mentioned analysis object information is to select the first resource information corresponding to the occurrence of the above-mentioned first error from the computing resource-related data of the above-mentioned first service server included in the above-mentioned resource history information when it is confirmed that the above-mentioned first error does not have the above-mentioned call dependency as the above-mentioned analysis object information; and the step of confirming the cause is to confirm the above-mentioned first error as the above-mentioned cause.

又，於本發明之一實施例中可包括錯誤分析方法，其中確認上述原因資訊之步驟係將於上述第1錯誤發生前產生臨界值以上之資源之特定服務追加確認為參考因素，該特定服務係基於上述第1資源資訊而確認。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the step of confirming the cause information is to additionally confirm a specific service of a resource that generated a critical value before the first error occurred as a reference factor, and the specific service is confirmed based on the first resource information.

又，於本發明之一實施例中可包括錯誤分析方法，其進而包括如下步驟：基於上述分析對象資訊中之至少一部分，產生上述原因之報告資訊並傳輸至管理者終端。Furthermore, one embodiment of the present invention may include an error analysis method, which further includes the following steps: based on at least a portion of the analysis object information, generating report information of the above cause and transmitting it to a management terminal.

又，於本發明之一實施例中可包括錯誤分析方法，其中上述報告資訊係基於上述原因之屬性資訊及相應原因發生情況之資源資訊而產生，該屬性資訊及相應原因發生情況之資源資訊係基於上述錯誤歷史資訊及上述資源歷史資訊中之至少一部分而確認。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the report information is generated based on attribute information of the cause and resource information of the corresponding cause occurrence, and the attribute information and resource information of the corresponding cause occurrence are confirmed based on the error history information and at least a portion of the resource history information.

又，於本發明之一實施例中可包括錯誤分析方法，其中若基於上述類型資訊而確認為上述第1錯誤具有上述呼叫依存性，則上述報告資訊係進而基於上述原因與上述第1錯誤間之呼叫關係而產生。Furthermore, one embodiment of the present invention may include an error analysis method, wherein if the first error is determined to have the call dependency based on the type information, the report information is further generated based on the call relationship between the cause and the first error.

又，於本發明之一實施例中可包括錯誤分析方法，其中於上述原因之類型資訊指示資料庫錯誤之情形時，上述報告資訊係進而基於與上述原因相關之資料庫之上述相應原因發生時點之計算資源相關資料而產生。Furthermore, one embodiment of the present invention may include an error analysis method, wherein when the type information of the cause indicates a database error, the report information is further generated based on computing resource-related data of the database associated with the cause at the time when the corresponding cause occurred.

又，於本發明之一實施例中可包括錯誤分析方法，其中於除上述原因之外追加確認參考因素之情形時，上述報告資訊係進而基於上述參考因素之屬性資訊及上述參考因素與上述原因之關係資訊而產生。Furthermore, one embodiment of the present invention may include an error analysis method, wherein when a reference factor is additionally identified in addition to the aforementioned cause, the report information is further generated based on attribute information of the reference factor and relationship information between the reference factor and the aforementioned cause.

又，於本發明之一實施例中可包括錯誤分析方法，其中確認作為上述第1錯誤之上述原因之特定錯誤被確認為原因之每小時次數資訊，僅於上述每小時次數資訊為臨界值以上之情形時，將上述報告資訊傳輸至上述管理者終端。Furthermore, one embodiment of the present invention may include an error analysis method, wherein hourly frequency information of a specific error confirmed as the cause of the first error is received, and only when the hourly frequency information is above a critical value, the report information is transmitted to the administrator terminal.

又，於本發明之一實施例中可包括錯誤分析方法，其中上述報告資訊進而包括被確認為指定之時間段內最多原因之至少一個注意錯誤之資訊。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the report information further includes information of at least one attention error identified as the most common cause within a specified time period.

又，於本發明之一實施例中可包括錯誤分析方法，其中上述錯誤歷史資訊係按照上述錯誤發生之時間順序來包括上述各服務伺服器發生之錯誤之屬性資訊，並將發生之後經過臨界值以上之時間之錯誤自上述錯誤歷史資訊刪除。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the error history information includes attribute information of errors occurring on each service server in chronological order of the errors, and errors that occurred more than a critical time later are deleted from the error history information.

又，於本發明之一實施例中可包括錯誤分析方法，其中上述資源歷史資訊包括按指定之週期記錄上述各服務伺服器之CPU(central processing unit，中央處理單元)、記憶體及網路資料中之至少一部分所得之資料。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the resource history information includes data obtained by recording at least a portion of the CPU (central processing unit), memory, and network data of each service server according to a specified period.

又，於本發明之一實施例中可包括錯誤分析方法，其中上述資源歷史資訊進而包括按指定之週期記錄資料庫之上述CPU、記憶體、網路資料及上述資料庫接收到之資料請求中之至少一部分所得之資料，該資料庫係與上述各服務伺服器中之至少一部分繫結而進行動作。Furthermore, one embodiment of the present invention may include an error analysis method, wherein the resource history information further includes data obtained by recording the CPU, memory, network data of the database and at least a portion of the data requests received by the database according to a specified period, and the database is linked to at least a portion of the service servers mentioned above to perform operations.

本發明之另一態樣可提供一種伺服器，其係分析錯誤者，其包括儲存命令之記憶體、及以如下方式設定之處理器：與上述記憶體連接，確認第1服務伺服器中發生之第1錯誤，該第1服務伺服器係彼此間進行服務呼叫之複數個服務伺服器中之一者；基於上述第1錯誤之類型資訊，判斷上述第1錯誤是否具有呼叫依存性；基於上述第1錯誤是否具有呼叫依存性，選擇上述複數個服務伺服器之錯誤歷史資訊及資源歷史資訊中之至少一部分作為分析對象資訊；基於上述分析對象資訊，確認上述第1錯誤之原因。Another aspect of the present invention may provide a server for analyzing errors, comprising a memory for storing commands and a processor configured as follows: connected to the memory, confirming a first error occurring in a first service server, the first service server being one of a plurality of service servers that perform service calls with each other; determining whether the first error is call-dependent based on type information of the first error; selecting at least a portion of error history information and resource history information of the plurality of service servers as analysis target information based on whether the first error is call-dependent; and confirming the cause of the first error based on the analysis target information.

本發明之又一態樣可提供一種非暫時性電腦可讀記錄媒體，其記錄有用以於電腦中執行上述資訊提供方法之程式。Another aspect of the present invention can provide a non-transitory computer-readable recording medium that records a program for executing the above-mentioned information providing method on a computer.

其他實施例之具體事項包括於詳細說明及圖式中。 [發明之效果] Specific details of other embodiments are included in the detailed description and drawings. [Effects of the Invention]

根據所提出之實施例，可期待一個或一個以上之如下效果。According to the proposed embodiments, one or more of the following effects can be expected.

根據本說明書之實施例，可自動分析各伺服器中發生之錯誤之歷史資訊及錯誤發生時之計算資源相關資料資訊，找出故障現象之根本原因並進行診斷。According to the embodiments of this specification, historical information about errors that occurred in each server and data related to computing resources at the time of the error can be automatically analyzed to identify the root cause of the failure and perform diagnosis.

又，根據本說明書之實施例，亦可藉由如下之逐案優化之分析方式來分析錯誤之原因，該分析方式係於具有呼叫依存性之錯誤之情形時，以錯誤之歷史資訊為中心進行分析，於不具有呼叫依存性之錯誤之情形時，以計算資源相關資料資訊為中心進行分析等。Furthermore, according to the embodiments of this specification, the cause of the error can also be analyzed by the following case-by-case optimized analysis method. In the case of call-dependent errors, the analysis is centered on historical error information, and in the case of non-call-dependent errors, the analysis is centered on data information related to computing resources.

又，根據本說明書之實施例，可將所診斷之故障現象之根本原因傳達給用戶，以便對根本原因採取措施。Furthermore, according to the embodiments of this specification, the root cause of the diagnosed fault phenomenon can be communicated to the user so that measures can be taken to address the root cause.

發明之效果並不限於以上提及之效果，本技術領域之普通技術人員可根據申請專利範圍之記載而明確地理解未提及之其他效果。The effects of the invention are not limited to the effects mentioned above. A person skilled in the art can clearly understand other effects not mentioned based on the description of the patent application.

實施例中使用之用語係考慮本發明中之功能而儘可能地選擇目前廣泛使用之普通用語，但會根據本領域之技術人員之意圖或先例、新技術之出現等而有所不同。又，於特定之情形時，亦存在申請人任意選定之用語，於該情形時，將於相應之說明部分詳細地記載其含義。因此，本發明中使用之用語應基於該用語所具有之含義及本發明之全部內容而定義，而並非簡單地基於用語之名稱而定義。The terms used in the embodiments are selected from commonly used terms, as much as possible, taking into account the functions of the present invention. However, they may vary depending on the intentions of those skilled in the art, precedents, the emergence of new technologies, and so on. Furthermore, in certain cases, there may be terms arbitrarily selected by the applicant. In such cases, their meanings will be described in detail in the corresponding description. Therefore, the terms used in this invention should be defined based on the meaning of the terms and the overall content of the invention, rather than simply based on the names of the terms.

於整篇說明書中提及某個構成要素「包括」另一構成要素時，若無特別相反之記載，則意味著可進而包括其他要素，而非排除其他構成要素。When it is mentioned throughout the specification that a certain constituent element “includes” another constituent element, unless there is any special description to the contrary, it means that the other constituent elements may be included rather than excluded.

整篇說明書中記載之「a、b及c中之至少一者」之表述可包括「單獨之a」、「單獨之b」、「單獨之c」、「a及b」、「a及c」、「b及c」、或「a、b、c三者」。The expression "at least one of a, b, and c" described throughout the specification may include "a alone," "b alone," "c alone," "a and b," "a and c," "b and c," or "all three of a, b, and c."

以下提及之「終端」可由電腦或攜帶用終端來實現，該電腦或攜帶用終端可藉由網路而與伺服器或其他終端連接。此處，電腦例如包括裝載有網頁瀏覽器(WEB Browser)之筆記型電腦、桌上型電腦(desktop)、膝上型電腦(laptop)等，攜帶用終端作為確保攜帶性及行動性之無線通訊裝置，例如可包括IMT(International Mobile Telecommunication，國際行動通訊)、CDMA(Code Division Multiple Access，分碼多重存取)、W-CDMA(W-Code Division Multiple Access，寬頻分碼多重存取)、LTE(Long Term Evolution，長期演進)等通訊終端、如智慧型手機、平板電腦等所有類型之手持式(Handheld)無線通訊裝置。The "terminal" mentioned below can be implemented as a computer or portable terminal, which can be connected to a server or other terminal via a network. Here, computers include, for example, laptops, desktops, and the like equipped with a web browser. Portable terminals are wireless communication devices that ensure portability and mobility, and can include, for example, IMT (International Mobile Telecommunication), CDMA (Code Division Multiple Access), W-CDMA (W-Code Division Multiple Access), LTE (Long Term Evolution), and other communication terminals, as well as all types of handheld wireless communication devices such as smartphones and tablets.

以下，參考附圖對本發明之實施例進行詳細說明，以便本發明所屬之本領域內具有常識者能夠容易地實施。然而，本發明能夠以各種不同之形態實現，並不限定於此處所說明之實施例。Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement the present invention. However, the present invention can be implemented in various forms and is not limited to the embodiments described herein.

以下，參照圖式對本發明之實施例進行詳細說明。Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

圖1係表示根據一實施例來分析錯誤之伺服器及其繫結關係之圖。FIG. 1 is a diagram illustrating an example of analyzing an erroneous server and its associated relationships according to an embodiment.

參照圖1，伺服器100可與錯誤歷史DB(Database，資料庫)300及資源歷史DB 400繫結而進行動作，該錯誤歷史DB 300係儲存服務伺服器210、220及230之錯誤之資訊者，該資源歷史DB 400係儲存服務伺服器210、220及230與服務DB 240之計算資源使用資訊者。另一方面，圖1中僅示出了與本實施例相關之構成要素。因此，與本實施例相關之技術領域內具有常識者可理解為，除圖1所示之構成要素外，可進而包括其他通用之構成要素。Referring to Figure 1 , server 100 operates in conjunction with an error history database (DB) 300 and a resource history database 400. Error history database 300 stores error information from service servers 210 , 220 , and 230 , while resource history database 400 stores resource usage information from service servers 210 , 220 , 230 and service database 240 . Meanwhile, Figure 1 only illustrates components relevant to this embodiment. Therefore, those skilled in the art will appreciate that this embodiment may include other commonly used components in addition to those shown in Figure 1 .

服務伺服器210、220及230可為構成微型服務架構之各服務伺服器中之至少一部分。即，作為多個實行小服務之用以提供服務之服務伺服器中之至少一部分，可為彼此間呼叫服務之伺服器。例如，於用戶在購物中心網站中進入Search Detail Page(SDP，檢索詳細頁面)之情形時，與用戶終端進行通訊之伺服器可呼叫實現如安全服務、商品價格計算服務、結賬服務等所需之微型服務，並向用戶終端發送呼叫之結果資訊。如上所述，所呼叫之微型服務可呼叫又一服務，於圖1中，如下之構成係如上所述之呼叫結構之一示例，該構成係服務伺服器210呼叫另一服務伺服器220，該服務伺服器220呼叫又一服務伺服器230。又，如服務DB 240之資料庫亦可由服務伺服器210使用。如上所述之圖1之呼叫結構之示例僅為一示例，本發明之錯誤分析方法可應用於任何與此不同之呼叫結構之微型服務架構。Service servers 210, 220, and 230 may be at least a portion of the service servers that comprise a microservices architecture. Specifically, at least a portion of the service servers that implement small services and provide services may be servers that call services between each other. For example, when a user enters the Search Detail Page (SDP) on a shopping mall website, the server communicating with the user's terminal may call microservices required to implement security services, product price calculation services, and checkout services, and send the call result information to the user terminal. As described above, the called microservice can call another service. In FIG1 , the following structure is an example of the call structure described above: service server 210 calls another service server 220, and service server 220 calls yet another service server 230. Furthermore, databases such as service DB 240 can also be used by service server 210. The example call structure in FIG1 is merely an example; the error analysis method of the present invention can be applied to any microservice architecture with a different call structure.

此時，錯誤歷史DB 300與資源歷史DB 400可為分別包括錯誤歷史資訊及資源歷史資訊之資料庫。首先，對錯誤歷史DB 300進行說明，其可為如下之資料庫，即，服務伺服器210、220及230中發生錯誤，根據錯誤訊息之發佈而接收該錯誤訊息，並將錯誤之屬性資訊以先進先出之形態來儲存，例如，隊列形態。或者，與隊列形態相似但略有不同地，錯誤歷史DB 300可按照錯誤發生之時間順序將各服務伺服器210、220及230中發生之錯誤之屬性資訊包括於錯誤歷史資訊中，可將發生之後經過臨界值以上之時間之錯誤自錯誤歷史資訊中刪除。此處，臨界值可為用以分析服務之呼叫關係而設定之值。根據一實施例，通常，於微型服務架構中，可將其設定得較根據服務呼叫而收發之資料之逾時間隔略長。又，屬性資訊之各項目可包括各錯誤之追蹤ID、發生時刻、輸入參數、服務名稱、API路徑、IP、錯誤訊息、異常處理方式及類型資訊。追蹤ID可為藉由實行微型服務而實行之整體服務之trace ID，輸入參數可指相應服務之輸入參數。發生時刻可指各錯誤之發生時刻，服務名稱可指相應微型服務之名稱，IP可指相應服務伺服器之IP。錯誤訊息可指相應錯誤之系統訊息，異常處理方式可為表示相應錯誤發生後以何種方式進行了異常處理之資訊。類型資訊係表示相應錯誤是否包括於某種類別之資訊，一部分之類型可設定為具有對其他服務之呼叫依存性。即，於呼叫了其他服務但由於該其他服務並未正常動作而發生之錯誤之情形時，可判斷為具有呼叫依存性之類型之錯誤。類型資訊之示例例如可包括呼叫依存性服務逾時(Call Dependency Service Timeout)、呼叫依存性服務內部錯誤(Call Dependency Service Internal Error)、呼叫依存性服務無效請求(Call Dependency Service Invalid Request)、存取資料庫錯誤(Access Database Error)、存取快取錯誤(Access Cache Error)等，當然並不限定於此。根據一實施例，上述類型資訊中具有呼叫依存性之類型之錯誤可為呼叫依存性服務逾時及呼叫依存性服務內部錯誤，其他類型可判斷為不具有呼叫依存性。又，如存取資料庫錯誤之類型可判斷為與資料庫具有關聯性之錯誤。At this time, the error history DB 300 and the resource history DB 400 can be databases containing error history information and resource history information, respectively. First, the error history DB 300 will be described. It can be a database that receives error messages from service servers 210, 220, and 230 upon their publication and stores the error attribute information in a first-in, first-out format, such as a queue. Alternatively, similar to but slightly different from the queue format, error history DB 300 can include attribute information of errors occurring in each service server 210, 220, and 230 in the error history information in chronological order of error occurrence. Errors that have occurred for a time exceeding a threshold value can be deleted from the error history information. Here, the threshold value can be a value set to analyze the call relationship of the service. According to one embodiment, in a microservices architecture, it can be set slightly longer than the time interval of data sent and received based on the service call. Furthermore, each item of attribute information may include each error's tracking ID, occurrence time, input parameters, service name, API path, IP address, error message, exception handling method, and type information. The tracking ID may be the trace ID of the entire service implemented by the microservice, and the input parameters may refer to the input parameters of the corresponding service. The occurrence time may refer to the time when each error occurred, the service name may refer to the name of the corresponding microservice, and the IP address may refer to the IP address of the corresponding service server. The error message may refer to the system message of the corresponding error, and the exception handling method may be information indicating how the exception was handled after the corresponding error occurred. Type information indicates whether the error falls into a certain category. Some types can be set to indicate a call dependency on another service. Specifically, if an error occurs due to a call to another service but that service is not functioning properly, it can be considered a call dependency error. Examples of type information include, but are not limited to, Call Dependency Service Timeout, Call Dependency Service Internal Error, Call Dependency Service Invalid Request, Access Database Error, and Access Cache Error. According to one embodiment, the error types with call dependencies in the aforementioned type information may include call dependency service timeouts and call dependency service internal errors, while other types may be determined to be non-call dependent. Furthermore, errors such as database access errors may be determined to be errors related to the database.

其次，對資源歷史DB 400進行說明，其可為將服務伺服器210、220及230與服務DB 240之基礎架構度量資料、即計算資源之使用歷史相關資料作為資源歷史資訊來儲存之資料庫。根據一實施例，資源歷史資訊可包括按指定之週期記錄各服務伺服器210、220及230與服務DB 240中之至少一部分之CPU、記憶體及網路資料中之至少一部分所得之資料。於服務DB 240之情形時，接收到之資料請求之資訊亦可包括於資源歷史資訊中。此處，根據一實施例，CPU、記憶體及網路資料等可指使用量，但並不限定於此。或者，作為另一實施例，亦可僅於錯誤發生之時點將相應服務伺服器或服務DB之相應資料儲存於資源歷史DB 400中。Next, the resource history database 400 will be described. It can be a database that stores infrastructure metrics data for service servers 210, 220, and 230 and service database 240, namely, data related to the usage history of computing resources, as resource history information. According to one embodiment, the resource history information may include data obtained by recording at least a portion of the CPU, memory, and network data of at least a portion of each service server 210, 220, and 230 and service database 240 at specified intervals. In the case of service database 240, information regarding received data requests may also be included in the resource history information. Here, according to one embodiment, CPU, memory, and network data may refer to usage, but this is not a limitation. Alternatively, as another embodiment, the corresponding data of the corresponding service server or service DB may be stored in the resource history DB 400 only at the time when the error occurs.

以下，對本發明之一實施例之錯誤分析方法進行說明。The following describes an error analysis method according to an embodiment of the present invention.

圖2係用以說明一實施例之錯誤分析方法之流程圖。FIG2 is a flow chart illustrating an error analysis method according to an embodiment.

參照圖2，於S210中，伺服器100可確認第1服務伺服器中發生之第1錯誤，該第1服務伺服器係彼此間進行服務呼叫之複數個服務伺服器中之一者。於S220中，可基於第1錯誤之類型資訊，判斷第1錯誤是否具有呼叫依存性。於S230中，可基於第1錯誤是否具有呼叫依存性，選擇複數個服務伺服器之錯誤歷史資訊及資源歷史資訊中之至少一部分作為分析對象資訊。於S240中，可基於分析對象資訊，確認第1錯誤之原因。以下，對各步驟進行具體說明。Referring to Figure 2, in S210, server 100 may confirm a first error occurring in a first service server, which is one of a plurality of service servers that are engaged in a service call. In S220, based on the type information of the first error, it may be determined whether the first error is call-dependent. In S230, based on whether the first error is call-dependent, at least a portion of the error history information and resource history information of the plurality of service servers may be selected as analysis target information. In S240, based on the analysis target information, the cause of the first error may be determined. Each step is described in detail below.

首先，若第1服務伺服器中發生第1錯誤，則伺服器100可確認該錯誤。如上所述，若根據第1錯誤之發生而自第1服務伺服器發佈錯誤訊息，則該錯誤訊息被立即發送至錯誤歷史DB 300，並且該錯誤訊息中所儲存之錯誤歷史資訊中包括第1錯誤之第1屬性資訊。此時，伺服器100可自錯誤歷史DB 300獲得如上所述之第1屬性資訊。First, if a first error occurs on the first service server, server 100 can confirm the error. As described above, if an error message is issued from the first service server in response to the occurrence of the first error, the error message is immediately sent to error history database 300. The error history information stored in the error message includes the first attribute information of the first error. At this point, server 100 can obtain the first attribute information described above from error history database 300.

此後，伺服器100可基於第1錯誤之類型資訊，判斷是否具有呼叫依存性。如上所述，第1屬性資訊可包括第1錯誤之類型資訊，並且例如呼叫依存性服務逾時及呼叫依存性服務內部錯誤之特定類型可意味著第1錯誤具有呼叫依存性。基於如上所述之類型資訊，伺服器100可判斷是否由於第1錯誤呼叫了其他服務但該其他服務並未正常動作而發生了錯誤，即，是否具有呼叫依存性。根據是否具有如上所述之呼叫依存性，伺服器100可選擇複數個服務伺服器之錯誤歷史資訊及資源歷史資訊中之至少一部分作為分析對象資訊。以下，對是否具有呼叫依存性之各流程進行說明。Thereafter, server 100 can determine whether there is call dependency based on the type information of the first error. As described above, the first attribute information may include the type information of the first error, and specific types such as a call dependency service timeout and a call dependency service internal error may indicate that the first error is call dependent. Based on the type information described above, server 100 can determine whether the error occurred because the first error called another service but the other service did not operate normally, that is, whether there is call dependency. Based on whether there is call dependency as described above, server 100 can select at least a portion of the error history information and resource history information of multiple service servers as analysis target information. The following describes each process based on whether it is call-dependent.

即，若確認為第1錯誤具有呼叫依存性，則伺服器100可首先選擇至少一個第2錯誤之第2屬性資訊作為分析對象資訊，該第2屬性資訊係於錯誤歷史資訊中包括之複數個錯誤之屬性資訊中，基於第1錯誤之第1屬性資訊中之至少一部分而選擇者。如上所述之動作可藉由將伺服器100與錯誤歷史DB 300繫結而實行。此處，第2錯誤可藉由計算複數個錯誤之屬性資訊中之至少一部分與第1屬性資訊中之至少一部分間之相似度，並確認相似度為臨界值以上之錯誤而選擇。即，使上述屬性資訊之各項目、即錯誤之追蹤ID、發生時刻、輸入參數、服務名稱、API路徑、IP、錯誤訊息、異常處理方式及類型資訊中之至少一部分向量化，對各成分進行加權後，以計算向量間之相似度之方式來計算上述相似度，將預設之臨界值以上之錯誤確認為第2錯誤。根據一實施例，亦可僅比較追蹤ID，將與第1錯誤之追蹤ID相同之錯誤確認為第2錯誤。Specifically, if the first error is determined to be call-dependent, server 100 may first select second attribute information of at least one second error as analysis target information. This second attribute information is selected based on at least a portion of the first attribute information of the first error from the attribute information of the plurality of errors included in the error history information. This operation can be performed by linking server 100 with error history database 300. Here, the second error can be selected by calculating the similarity between at least a portion of the attribute information of the plurality of errors and at least a portion of the first attribute information, and then confirming that the similarity exceeds a threshold value. Specifically, at least a portion of the aforementioned attribute information, namely the error tracking ID, occurrence time, input parameters, service name, API path, IP address, error message, exception handling method, and type information, is vectorized. After weighting each component, similarity is calculated between vectors. Errors exceeding a preset threshold are identified as second errors. Alternatively, only tracking IDs may be compared, and errors with the same tracking ID as the first error are identified as second errors.

又，追加地，若第2錯誤已確認，則伺服器100可於發生第2錯誤之服務伺服器之計算資源相關資料中，選擇與第2錯誤中之至少一部分之發生情況對應之第2錯誤中之至少一部分之第2資源資訊作為分析對象資訊，該計算資源相關資料包括於資源歷史資訊中。即，可選擇發生第2錯誤之情況時之CPU、記憶體或網路相關資料作為分析對象資訊。於第2錯誤之類型資訊指示如存取資料庫錯誤之資料庫相關錯誤之情形時，與相應服務伺服器繫結而進行動作之資料庫之相應情況之資源資訊亦可與該服務伺服器之資源資訊一併被選擇為第2資源資訊。根據一實施例，如上所述之第2資源資訊亦可一併被選擇為分析對象資訊。Additionally, if the second error has been confirmed, the server 100 may select, from the computing resource-related data of the service server where the second error occurred, at least a portion of the second resource information corresponding to at least a portion of the occurrence of the second error as analysis target information. This computing resource-related data is included in the resource history information. In other words, the CPU, memory, or network-related data at the time of the second error may be selected as analysis target information. When the second error type information indicates a database-related error, such as a database access error, resource information related to the database associated with the corresponding service server and performing the operation may also be selected as the second resource information along with the resource information of the service server. According to one embodiment, the second resource information described above may also be selected as the analysis target information.

根據一實施例，伺服器100可基於包括如上所述之第2屬性資訊之分析對象資訊，確認第1錯誤之原因。首先，伺服器100可基於第2屬性資訊中之至少一部分，導出第1及第2錯誤間之呼叫關係，並分析呼叫關係，將第2錯誤中之於呼叫關係中最先發生並誘發其他錯誤之特定錯誤確認為原因。根據一實施例，可確認第2屬性資訊中各錯誤之發生時刻資訊，將第2錯誤中最先發生之錯誤確認為特定錯誤，並將該特定錯誤確認為第1錯誤之原因。根據另一實施例，可基於第2屬性資訊中API路徑、輸入參數、發生時刻及類型資訊是否具有呼叫依存性中之至少一部分，藉由掌握彼此間之呼叫關係而產生呼叫樹，然後將位於呼叫樹之最上端之特定錯誤確認為第1錯誤之原因。參照圖3來查看如上所述之情形時之一實施例。According to one embodiment, server 100 can identify the cause of the first error based on analysis target information including the second attribute information described above. First, server 100 can derive a call relationship between the first and second errors based on at least a portion of the second attribute information, analyze the call relationship, and identify the specific error in the second error that occurred first in the call relationship and triggered the other errors as the cause. According to one embodiment, the occurrence time information of each error in the second attribute information can be confirmed, and the first error in the second error can be identified as the specific error, and the specific error can be identified as the cause of the first error. According to another embodiment, based on whether at least a portion of the API path, input parameters, occurrence time, and type information in the second attribute information has call dependencies, a call tree can be generated by understanding the call relationships between them. The specific error at the top of the call tree is then identified as the cause of the first error. Refer to FIG3 for an embodiment of the above scenario.

圖3係表示一實施例之第1錯誤及與其相關之第2錯誤之呼叫關係之例示圖。FIG3 is an exemplary diagram showing the call relationship between the first error and the second error related thereto according to an embodiment.

參照圖3，可確認於9.21. 14點45分28秒發生第1錯誤。若獲得如上所述之第1錯誤之第1屬性資訊，則如上所述，伺服器100可確認與第1錯誤相關之第2錯誤之第2屬性資訊310及320。根據第1屬性資訊110及第2屬性資訊310及320，所有追蹤ID(TraceID)均相同，為A3547D58，可確認第1錯誤與第2錯誤中較早發生於14點45分11秒者(對應於310)皆具有呼叫依存性，但第2錯誤中較晚發生者(對應於320)不具有呼叫依存性，並且可確認最快發生。於該情形時，成為第1錯誤之原因之特定錯誤可確認為與320對應之錯誤。3 , it can be confirmed that the first error occurred at 14:45:28 on September 21. If the first attribute information of the first error is obtained, the server 100 can confirm the second attribute information 310 and 320 of the second error related to the first error as described above. Based on the first attribute information 110 and the second attribute information 310 and 320, all trace IDs are the same: A3547D58. This indicates that the earlier of the first and second errors, occurring at 2:45:11 PM (corresponding to 310), is call-dependent. However, the later of the second errors (corresponding to 320) is not call-dependent and is confirmed to have occurred the fastest. In this case, the specific error that caused the first error can be confirmed to be the error corresponding to 320.

又，根據一實施例，伺服器100可基於進而包括第2資源資訊之分析對象資訊，確認與特定錯誤對應之特定資源資訊。並且，伺服器100可基於特定資源資訊，將於特定錯誤發生前產生臨界值以上之資源之特定服務追加確認為參考因素。此處，臨界值可設定為系統上設定之CPU、記憶體或網路之使用量，該使用量係可能對伺服器造成負擔之量。又，對於發生特定錯誤前之特定時間範圍，可探索如上所述之特定服務。如上所述，可藉由將於發生特定錯誤前對伺服器造成負擔之特定服務確認為參考因素，從而確認間接原因。Furthermore, according to one embodiment, the server 100 can confirm the specific resource information corresponding to the specific error based on the analysis object information further including the second resource information. Furthermore, the server 100 can additionally confirm as a reference factor a specific service that generates resources exceeding a critical value before the specific error occurs based on the specific resource information. Here, the critical value can be set as the CPU, memory, or network usage set on the system, which is the amount that may cause a burden on the server. Furthermore, for a specific time range before the specific error occurs, the specific service described above can be explored. As described above, by confirming as a reference factor the specific service that caused a burden on the server before the specific error occurs.

以下，對確認為第1錯誤不具有呼叫依存性之情形之實施例進行說明。根據一實施例，於確認第1錯誤不具有呼叫依存性之情形時，可於資源歷史資訊中包括之第1伺服器之計算資源相關資料中，選擇與第1錯誤之發生情況對應之第1資源資訊作為分析對象資訊。於第1錯誤之類型資訊指示如存取資料庫錯誤之資料庫相關錯誤之情形時，與相應第1服務伺服器繫結而進行動作之資料庫之相應情況之資源資訊亦可與該第1服務伺服器之資源資訊一併被選擇為第1資源資訊。根據一實施例，如上所述之第1資源資訊亦可一併被選擇為分析對象資訊。The following describes an embodiment in which it is determined that the first error does not have call dependency. According to one embodiment, when the first error is determined to not have call dependency, the first resource information corresponding to the occurrence of the first error can be selected as the analysis target information from the computing resource-related data of the first server included in the resource history information. If the type information of the first error indicates a database-related error, such as a database access error, resource information corresponding to the database associated with the corresponding first service server and performing operations can also be selected as the first resource information along with the resource information of the first service server. According to one embodiment, the first resource information described above may also be selected as analysis target information.

此後，根據一實施例，伺服器100可將第1錯誤確認為原因。即，可確認第1錯誤其本身有問題，而並非由服務呼叫之鏈式結構所導致。又，伺服器100可基於第1資源資訊，將產生臨界值以上之資源之特定服務追加確認為參考因素。可與上述第1錯誤具有呼叫依存性之情形時之參考因素確認過程相似地，將於第1錯誤發生前對伺服器造成負擔之特定服務確認為參考因素。Afterward, according to one embodiment, server 100 can identify the first error as the cause. That is, it can be confirmed that the first error itself is a problem, and not caused by the chained structure of service calls. Furthermore, based on the first resource information, server 100 can additionally identify the specific service that generated resources exceeding the critical value as a reference factor. Similar to the reference factor identification process described above when the first error is call-dependent, specific services that placed a burden on the server before the first error occurred can be identified as reference factors.

如上所述，根據一實施例，若根據是否具有呼叫依存性來確認原因，則伺服器100可基於分析對象資訊中之至少一部分，產生原因之報告資訊並傳輸至管理者終端。即，由於已自錯誤歷史DB 300及資源歷史DB 400提取出分析對象資訊，故而可基於如上所述之分析對象資訊中之至少一部分，產生所確認之原因之報告資訊並傳輸至管理者終端。As described above, according to one embodiment, if the cause is determined based on whether or not there is call dependency, server 100 can generate and transmit cause report information to the administrator terminal based on at least a portion of the analysis target information. Specifically, since analysis target information has been extracted from error history DB 300 and resource history DB 400, report information on the determined cause can be generated and transmitted to the administrator terminal based on at least a portion of the analysis target information described above.

根據一實施例，報告資訊可基於原因之屬性資訊及該原因發生情況之資源資訊而產生，該原因之屬性資訊及該原因發生情況之資源資訊係基於錯誤歷史資訊及資源歷史資訊中之至少一部分而確認。根據一實施例，於第1錯誤不具有呼叫依存性，報告資訊可包括上述第1錯誤之第1屬性資訊、即錯誤之追蹤ID、發生時刻、輸入參數、服務名稱、API路徑、IP、錯誤訊息、異常處理方式及類型資訊中之至少一部分，亦可包括第1錯誤之第1資源資訊、即第1錯誤發生時之第1服務伺服器之CPU、記憶體及網路資料等之資訊。According to one embodiment, the report information may be generated based on attribute information of a cause and resource information of a situation in which the cause occurs, wherein the attribute information of the cause and the resource information of the situation in which the cause occurs are confirmed based on at least a portion of the error history information and the resource history information. According to one embodiment, when the first error is not call-dependent, the report information may include the first attribute information of the first error, i.e., at least a portion of the error tracking ID, occurrence time, input parameters, service name, API path, IP address, error message, exception handling method, and type information. It may also include the first resource information of the first error, i.e., information such as the CPU, memory, and network data of the first service server at the time the first error occurred.

根據一實施例，於第1錯誤具有呼叫依存性之情形時，報告資訊可進而基於原因與第1錯誤間之呼叫關係而產生。即，可基於原因與第1錯誤間之呼叫關係，進而包括受原因影響之服務及受影響之API路徑之資訊。為了統一格式，即便於不具有呼叫依存性之情形時，報告資訊亦可包括受影響之服務及受影響之API路徑之資訊(但與第1屬性資訊中包括之服務名稱及API路徑相同)。參照圖4來查看如上所述之情形之一實施例。According to one embodiment, when the first error has a call dependency, the report information can be further generated based on the call relationship between the cause and the first error. That is, based on the call relationship between the cause and the first error, information about the services and API paths affected by the cause can be included. To maintain a unified format, even when there is no call dependency, the report information can include information about the affected services and API paths (but the service names and API paths included in the first attribute information are the same). Refer to Figure 4 for an example of this scenario.

圖4係一實施例之報告資訊之例示圖。FIG4 is an example diagram of report information according to an embodiment.

參照圖4，如上所述，報告資訊可包括第1屬性資訊中之至少一部分，即發生時刻(Timestamp，時戳)、輸入參數(Input，輸入)、服務名稱(Error Cause Service，錯誤產生服務)、API路徑(Error Cause API，錯誤產生API)、IP(Error Cause Host IP，錯誤產生主機IP)及類型資訊(Error Cause Reason，錯誤產生原因)。又，可包括受影響之服務(Impacted Services)及受影響之API路徑(Impacted APIs，受影響之APIs)之資訊、及資源資訊(Error Cause Infrastructure Metrics Data，錯誤產生基礎架構度量資料)。於該情形時，作為基於圖3之示例而產生報告資訊之示例，如上所述，原因為230服務之錯誤。因此，大部分之內容係基於230服務之屬性資訊及資源資訊而產生，並且受影響之服務及受影響之API路徑可包括210、220及230服務之所有內容。As described above with reference to FIG. 4 , the report information may include at least a portion of the first attribute information, namely, the occurrence time (timestamp), input parameters (input), service name (Error Cause Service), API path (Error Cause API), IP address (Error Cause Host IP), and error type information (Error Cause Reason). Furthermore, the report information may include information about affected services (Impacted Services), affected API paths (Impacted APIs), and resource information (Error Cause Infrastructure Metrics Data). In this case, as an example of generating report information based on the example in Figure 3, as described above, the cause is an error in service 230. Therefore, most of the content is generated based on the attribute information and resource information of service 230, and the affected services and affected API paths may include all content of services 210, 220, and 230.

根據一實施例，於原因之類型資訊指示資料庫錯誤之情形時，報告資訊可進而基於與原因相關之資料庫之該原因發生時點之計算資源相關資料、即上述追加選擇之第1或第2資源資訊而產生。即，相應資料亦可包括於報告資訊中。According to one embodiment, when the cause type information indicates a database error, the report information can be further generated based on the calculated resource-related data of the database associated with the cause at the time the cause occurred, that is, the first or second resource information selected above. In other words, the corresponding data can also be included in the report information.

根據一實施例，於上述參考因素已確認之情形時，報告資訊可進而基於參考因素之屬性資訊及參考因素與原因之關係資訊而產生。即，由於參考因素為服務，故而可儲存有與上述錯誤之屬性資訊相似之資訊。因此，報告資訊可進而包括參考因素之追蹤ID、服務實行時刻、輸入參數、服務名稱、API路徑及IP等之資訊。又，原因與參考因素之關係資訊亦可包括於報告資訊中。例如，可一併包括原因與參考因素之發生/實行時刻差異等數值資訊、及參考因素是否為原因之先行服務等資訊。According to one embodiment, when the aforementioned reference factor has been confirmed, report information can be generated based on the attribute information of the reference factor and the relationship information between the reference factor and the cause. Specifically, since the reference factor is a service, it may store information similar to the attribute information of the aforementioned error. Therefore, the report information can further include information such as the reference factor's tracking ID, service execution time, input parameters, service name, API path, and IP address. Furthermore, the relationship information between the cause and the reference factor can also be included in the report information. For example, numerical information such as the difference in occurrence/execution time between the cause and the reference factor, as well as information such as whether the reference factor is a preceding service for which the cause was the cause, can be included.

根據一實施例，於發生如上所述之第1錯誤時，確認作為第1錯誤之原因之特定錯誤被確認為原因之每小時次數資訊，僅於每小時次數資訊為臨界值以上之情形時，將報告資訊傳輸至管理者終端，而非立刻將報告資訊傳輸至管理者終端。由於錯誤因負載減少等其他因素而得到自行解決之情形較多，故而僅於無法自行解決之重要情況下可呼叫管理者。According to one embodiment, when the first error described above occurs, the hourly frequency information of the specific error identified as the cause of the first error is confirmed. A report is transmitted to the administrator terminal only when the hourly frequency information exceeds a critical value, rather than immediately transmitting the report to the administrator terminal. Because errors are often resolved automatically due to other factors such as load reduction, administrator calls are only made in critical situations where they cannot be resolved automatically.

根據一實施例，報告資訊可進而包括被確認為指定之時間段內最多原因之至少一個注意錯誤之資訊。例如，可包括確認為最多原因之TOP 10之注意錯誤之資訊。如上所述之順序可按照錯誤之類型資訊來導出，亦可按照引起錯誤之各服務來導出。參照圖5a及圖5b來查看如上所述之各實施例。According to one embodiment, the report information may further include information on at least one attention error identified as the most common cause within a specified time period. For example, information may be included on the top 10 attention errors identified as the most common causes. The above-described order can be derived based on error type information or by the services that caused the errors. See Figures 5a and 5b for a more detailed description of the various embodiments described above.

圖5a及圖5b係各實施例之注意錯誤之資訊之例示圖。FIG5a and FIG5b are diagrams illustrating examples of error-noticing information in various embodiments.

首先，可確認圖5a係按照錯誤之類型資訊來導出注意錯誤，並根據作為原因之確認次數順序而選擇之情形之示例。於圖5a之示例中，可確認呼叫依存性內部錯誤(Call Dependency Internal Error)引起了最多之164次錯誤，其次呼叫依存性逾時(Call Dependency Timeout)引起了85次之錯誤。於發生除此之外之其他錯誤時，如上所述，伺服器100可將作為原因之次數順序靠前之錯誤之類型作為注意錯誤而報告給管理者。First, it can be seen that Figure 5a illustrates an example of deriving attention errors based on error type information and selecting them in order of the number of times they were identified as the cause. In the example of Figure 5a, it can be seen that Call Dependency Internal Error caused the most errors, with 164, followed by Call Dependency Timeout, with 85. If other errors occur, as described above, server 100 can report the error type with the highest number of errors as the cause to the administrator as an attention error.

其次，可確認圖5b係按照各服務來導出注意錯誤，並根據作為原因之確認次數順序而選擇之情形之示例。圖5b之示例中，可確認服務230引起了最多之147次錯誤，其次為服務ZZZ、繼而為AAA引起了較多之錯誤。如上所述之注意錯誤之資訊可包括於報告資訊中。Next, Figure 5b shows an example of a scenario where attention errors are generated for each service and selected in order of the number of times they were identified as a cause. In the example of Figure 5b, it can be confirmed that service 230 caused the most errors, with 147, followed by service ZZZ, and then AAA. Information about these attention errors can be included in the report information.

如上所述之報告資訊可傳輸至自管理者中選擇之一部分管理者終端。即，根據一實施例，可僅傳輸至成為相應原因之服務伺服器之負責管理者終端，或者，亦可傳輸至所有受影響之服務之所有相應伺服器之負責管理者終端。The report information described above can be transmitted to a portion of administrator terminals selected from the administrators. That is, according to one embodiment, it can be transmitted only to the administrator terminal responsible for the service server that is the cause of the corresponding failure, or it can be transmitted to the administrator terminals responsible for all corresponding servers of all affected services.

圖6係表示一實施例之伺服器之方塊圖。FIG6 is a block diagram showing a server according to an embodiment.

根據一實施例，伺服器100可包括記憶體101及處理器(processor)103。圖6所示之伺服器100僅示出了與本實施例相關之構成要素。因此，與本實施例相關之技術領域內具有常識者可理解為，除圖6所示之構成要素之外，可進而包括其他通用之構成要素。於一實施例中，處理器103可包括控制部(controller)。According to one embodiment, server 100 may include memory 101 and processor 103. The server 100 shown in FIG6 only illustrates components relevant to this embodiment. Therefore, those skilled in the art will understand that, in addition to the components shown in FIG6 , other common components may be included. In one embodiment, processor 103 may include a controller.

處理器102可控制伺服器100之整體動作並處理資料及信號。處理器102可構成為至少一個硬體單元。又，處理器102可藉由執行儲存於記憶體101中之程式代碼而產生之一個以上之軟體模組來進行動作。處理器102可包括記憶體，處理器102可執行儲存於記憶體中之程式代碼來控制伺服器100之整體動作並處理資料及信號。Processor 102 controls the overall operation of server 100 and processes data and signals. Processor 102 may be configured as at least one hardware unit. Furthermore, processor 102 may operate by executing one or more software modules generated by program code stored in memory 101. Processor 102 may include memory and execute program code stored in the memory to control the overall operation of server 100 and process data and signals.

處理器102可以如下方式設定：確認第1服務伺服器中發生之第1錯誤，該第1服務伺服器係彼此間進行服務呼叫之複數個服務伺服器中之一者；基於上述第1錯誤之類型資訊，判斷上述第1錯誤是否具有呼叫依存性；基於上述第1錯誤是否具有呼叫依存性，選擇上述複數個服務伺服器之錯誤歷史資訊及資源歷史資訊中之至少一部分作為分析對象資訊；基於上述分析對象資訊，確認上述第1錯誤之原因。The processor 102 may be configured as follows: confirming a first error occurring in a first service server, which is one of a plurality of service servers that perform service calls with each other; determining whether the first error is call-dependent based on type information of the first error; selecting at least a portion of error history information and resource history information of the plurality of service servers as analysis target information based on whether the first error is call-dependent; and confirming a cause of the first error based on the analysis target information.

根據實施例，伺服器100可追加包括用以實行有線/無線通訊之收發器(transceiver)。伺服器100可利用收發器來與外部之電子裝置進行通訊。外部之電子裝置可為終端或伺服器。又，收發器所利用之通訊技術可有GSM(global system for mobile communication，全球行動通訊系統)、CDMA(code division multi access，分碼多重存取)、LTE(long term evolution，長期演進)、5G(5th Generation Mobile Communication Technology，第五代行動通訊技術)、WLAN(wireless LAN，無線區域網路)、Wi-Fi(wireless-fidelity，無線保真)、藍牙(Bluetooth™)、RFID(radio frequency identification，無線射頻識別)、紅外線通訊(infrared data association，IrDA)、紫蜂(ZigBee)、NFC(near field communication，近場通訊)等。According to an embodiment, the server 100 may additionally include a transceiver for implementing wired/wireless communication. The server 100 may utilize the transceiver to communicate with an external electronic device. The external electronic device may be a terminal or a server. Furthermore, the communication technologies utilized by the transceiver may include GSM (global system for mobile communication), CDMA (code division multiple access), LTE (long term evolution), 5G (5th Generation Mobile Communication Technology), WLAN (wireless LAN), Wi-Fi (wireless-fidelity), Bluetooth™, RFID (radio frequency identification), infrared data association (IrDA), ZigBee, and NFC (near field communication).

上述實施例之電子裝置或終端可包括處理器、儲存並執行程式資料之記憶體、如磁碟驅動器之永久儲存器(permanent storage)、與外部裝置通訊之通訊埠、如觸控面板、按鍵(key)、按鈕之購買者介面裝置等。藉由軟體模組或演算法實現之方法作為可於上述處理器上執行之電腦可讀代碼或程式命令，可儲存於電腦可讀記錄媒體上。此處，作為電腦可讀記錄媒體，有磁儲存媒體(例如，ROM(read-only memory，唯讀記憶體)、RAM(random-Access memory，隨機存取記憶體)、軟磁碟、硬磁碟等)及光學讀取媒體(例如，光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD：Digital Versatile Disc))等。電腦可讀記錄媒體分散於連接於網路之電腦系統，從而能夠以分散方式儲存電腦可讀代碼並執行。媒體可藉由電腦讀取，儲存於記憶體中，可於處理器中執行。The electronic device or terminal of the above-described embodiment may include a processor, a memory for storing and executing program data, a permanent storage such as a disk drive, a communication port for communicating with external devices, and a purchaser interface device such as a touch panel, keys, or buttons. The method implemented by the software module or algorithm may be stored on a computer-readable recording medium as computer-readable code or program commands executable on the processor. Here, computer-readable recording media include magnetic storage media (e.g., ROM (read-only memory), RAM (random-access memory), floppy disks, hard disks, etc.) and optical readable media (e.g., compact disc read-only memory (CD-ROM) and digital versatile discs (DVD)). Computer-readable recording media are distributed across computer systems connected to a network, enabling the storage and execution of computer-readable code in a distributed manner. The media can be read by the computer, stored in memory, and executed by the processor.

本實施例可由功能塊構成及各種處理步驟表示。該等功能塊可藉由執行特定功能之不同個數之硬體或/及軟體構成來實現。例如，實施例可採用能夠藉由一個以上之微處理器之控制或其他控制裝置而執行各種功能之積體電路構成，如記憶體、處理、邏輯(logic)、查找表(look-uptable)等。構成要素可藉由軟體程式或軟體元件而執行，與此相似，本實施例包括以資料結構、程序、常式或其他程式構成之組合實現之演算法，因此可藉由如C、C++、Java及組譯程式(assembler)等程式設計或腳本語言來實現。於功能方面而言，可藉由在一個以上之處理器中執行之演算法來實現。又，本實施例可採用先前技術來進行電子環境設定、信號處理及/或資料處理。「機制」、「元件」、「機構」、「構成」等用語可廣泛地使用，並不限定於機械與物理構成。上述用語可與處理器等關聯而包括軟體之一系列處理(routines)之含義。The present embodiment may be represented by a functional block structure and various processing steps. These functional blocks may be implemented by a different number of hardware and/or software components that perform specific functions. For example, the embodiment may be implemented using an integrated circuit structure that can perform various functions, such as memory, processing, logic, look-up tables, etc., under the control of one or more microprocessors or other control devices. Components may be implemented by software programs or software components. Similarly, the present embodiment includes algorithms implemented as a combination of data structures, programs, routines, or other program structures, and thus may be implemented using programming or scripting languages such as C, C++, Java, and assemblers. Functionally, this can be achieved through algorithms executed on one or more processors. Furthermore, this embodiment can utilize existing technologies for electronic environment configuration, signal processing, and/or data processing. Terms such as "mechanism," "element," "mechanism," and "configuration" are used broadly and are not limited to mechanical or physical configurations. These terms can be associated with processors and the like and include software routines.

上述實施例僅為一示例，可於下文敍述之發明申請專利範圍內實現其他實施例。The above embodiment is merely an example, and other embodiments may be implemented within the scope of the invention application described below.

100: 伺服器 101: 記憶體 102: 處理器 110: 第1屬性資訊 210: 服務伺服器 220: 服務伺服器 230: 服務伺服器 240: 服務DB 300: 錯誤歷史DB 310: 第2屬性資訊 320: 第2屬性資訊 400: 資源歷史DB S210: 步驟 S220: 步驟 S230: 步驟 S240: 步驟 100: Server 101: Memory 102: Processor 110: First attribute information 210: Service server 220: Service server 230: Service server 240: Service database 300: Error history database 310: Second attribute information 320: Second attribute information 400: Resource history database S210: Step S220: Step S230: Step S240: Step

圖1係表示根據一實施例來分析錯誤之伺服器及其繫結關係之圖。圖2係用以說明一實施例之錯誤分析方法之流程圖。圖3係表示一實施例之第1錯誤及與其相關之第2錯誤之呼叫關係之例示圖。圖4係一實施例之報告資訊之例示圖。圖5a及圖5b係各實施例之注意錯誤之資訊之例示圖。圖6係表示一實施例之伺服器之方塊圖。 Figure 1 shows a server and its associated relationships for error analysis according to one embodiment. Figure 2 is a flow chart illustrating an error analysis method according to one embodiment. Figure 3 is an example diagram illustrating the call relationship between a first error and its associated second error according to one embodiment. Figure 4 is an example diagram illustrating report information according to one embodiment. Figures 5a and 5b are examples of error information according to various embodiments. Figure 6 is a block diagram illustrating a server according to one embodiment.

100: 伺服器 210: 服務伺服器 220: 服務伺服器 230: 服務伺服器 240: 服務DB 300: 錯誤歷史DB 400: 資源歷史DB 100: Server 210: Service Server 220: Service Server 230: Service Server 240: Service DB 300: Error History DB 400: Resource History DB

Claims

An error analysis method, performed by a server, comprises the following steps: Identifying a first error occurring in a first service server, the first service server being one of a plurality of service servers that are engaged in a service call; Determining whether the first error has call dependency based on type information of the first error; Based on whether the first error has call dependency, selecting at least a portion of error history information and resource history information of the plurality of service servers as analysis target information; and Identifying a cause of the first error based on the analysis target information.

In the error analysis method of claim 1, the step of selecting the above-mentioned analysis object information is to select the following two types of information as the above-mentioned analysis object information if it is confirmed that the above-mentioned first error has the above-mentioned call dependency: (i) second attribute information of at least one second error, which is the attribute information of multiple errors included in the above-mentioned error history information based on the above-mentioned first error; and (ii) second resource information of at least a portion of the second error, which is computing resource-related data of a second service server where the second error occurs, corresponding to the occurrence of at least a portion of the second error, the computing resource-related data being included in the resource history information.

The error analysis method of claim 2, wherein the first attribute information includes at least a portion of the tracking ID, occurrence time, input parameters, API path, server IP address, exception handling method, error message, and the type information of the first error.

The error analysis method of claim 2, wherein the second error is selected by calculating the similarity between at least a portion of the attribute information of the plurality of errors and at least a portion of the first attribute information, and confirming that the similarity is above a critical value.

As in the error analysis method of claim 2, the step of confirming the above-mentioned cause is based on at least a part of the above-mentioned second attribute information, deriving the call relationship between the above-mentioned first and second errors, and analyzing the above-mentioned call relationship, and confirming the specific error in the above-mentioned second error that occurs first in the above-mentioned call relationship and induces other errors as the above-mentioned cause.

As in the error analysis method of claim 5, the step of confirming the above-mentioned cause is to additionally confirm as a reference factor a specific service of a resource that generated a critical value before the above-mentioned specific error occurred, and the specific service is confirmed based on specific resource information corresponding to the above-mentioned specific error, and the specific resource information is at least a part of the above-mentioned second resource information.

In the error analysis method of claim 1, the step of selecting the analysis target information comprises, when it is determined that the first error does not have the call dependency, selecting, from the computing resource-related data of the first service server included in the resource history information, the first resource information corresponding to the occurrence of the first error as the analysis target information; The step of determining the cause comprises determining the first error as the cause.

In the error analysis method of claim 7, the step of confirming the cause information is to additionally confirm a specific service of a resource that generated a critical value before the occurrence of the first error as a reference factor, and the specific service is confirmed based on the first resource information.

The error analysis method of claim 1 further includes the following steps: Based on at least a portion of the analysis target information, generating report information indicating the cause and transmitting it to a management terminal.

As in claim 9, the error analysis method, wherein the report information is generated based on the attribute information of the cause and the resource information of the occurrence of the corresponding cause, and the attribute information and the resource information of the occurrence of the corresponding cause are confirmed based on the error history information and at least a portion of the resource history information.

In the error analysis method of claim 10, if it is determined based on the type information that the first error has the call dependency, the report information is further generated based on the call relationship between the cause and the first error.

In the error analysis method of claim 10, when the type information of the cause indicates a database error, the report information is further generated based on computing resource-related data of the database related to the cause at the time when the corresponding cause occurred.

For example, in the error analysis method of claim 10, when additional reference factors are identified in addition to the aforementioned causes, the aforementioned report information is further generated based on the attribute information of the aforementioned reference factors and the relationship information between the aforementioned reference factors and the aforementioned causes.

In the error analysis method of claim 9, information on the number of times per hour that the specific error that is the cause of the first error is confirmed as the cause is transmitted to the management terminal only when the number of times per hour exceeds a critical value.

The error analysis method of claim 14, wherein the report information further includes information on at least one attention error identified as the most common cause within a specified time period.

The error analysis method of claim 1, wherein the error history information includes attribute information of errors occurring on each service server in chronological order of the errors, and errors that occurred more than a threshold amount of time after their occurrence are deleted from the error history information.

The error analysis method of claim 1, wherein the resource history information includes data obtained by recording at least a portion of the CPU, memory, and network data of each service server according to a specified period.

As in the error analysis method of claim 17, the resource history information further includes recording the CPU, memory, network data of the database at a specified period, and data obtained from at least a portion of the data requests received by the database, and the database is linked to at least a portion of the service servers to perform operations.

A non-transitory computer-readable recording medium records a program for executing the method of any one of claims 1 to 18 on a computer.

A server for analyzing errors includes a memory for storing commands and a processor configured as follows: Connected to the memory, the processor identifies a first error occurring in a first service server, the first service server being one of a plurality of service servers that are engaged in a service call; based on type information of the first error, determines whether the first error is call-dependent; based on whether the first error is call-dependent, selects at least a portion of error history information and resource history information of the plurality of service servers as analysis target information; and identifies a cause of the first error based on the analysis target information.