JP2018156348A

JP2018156348A - Fault monitoring apparatus, fault monitoring system, and program

Info

Publication number: JP2018156348A
Application number: JP2017052127A
Authority: JP
Inventors: 美千子藤井; Michiko Fujii
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2018-10-04
Anticipated expiration: 2037-03-17
Also published as: JP6907622B2

Abstract

PROBLEM TO BE SOLVED: To provide a fault monitoring apparatus capable of detecting a fault in an information processing system on the basis of a result of internal monitoring.SOLUTION: A fault monitoring apparatus 100 includes external monitoring means 10 for periodically accessing an information processing system 200 and time-sequentially accumulating successes and failures in response, internal monitoring means 20 for time-sequentially accumulating internal states of respective elements constituting the information processing system, and fault determination means 30 for determining a fault of the information processing system. The fault determination means includes means 32 for converting time-sequential information on the success and failure in response into external metrics data, means 34 for converting time-sequential information on the internal state into internal metrics data, means 35 for generating teacher data whose output is the value of the external metrics data and whose input is the value of the internal metrics data temporally corresponding to its value, a learning device 36 for machine learning a fault determination condition to be used in fault determination of the information processing system by using the teacher data, and a determination device 37 for setting the fault determination condition.SELECTED DRAWING: Figure 2

Description

本発明は、障害監視装置、障害監視システムおよびプログラムに関する。 The present invention relates to a failure monitoring device, a failure monitoring system, and a program.

従来、複数のサーバやモジュールから構成される情報処理システムの障害をネットワークを介して遠隔監視する障害監視システムが知られている。 Conventionally, a failure monitoring system that remotely monitors a failure of an information processing system including a plurality of servers and modules via a network is known.

例えば、特許文献１は、情報処理システムの構成要素とログに出力されるメッセージパターンの関連性を事前学習し、運用時において、出力されるメッセージパターンと学習したメッセ―ジパターンを照合することにより、構成要素が異なる情報処理システムの障害を適切に検知することができる障害検知装置を開示する。 For example, Patent Literature 1 learns in advance the relationship between components of an information processing system and message patterns output to a log, and collates the output message pattern with the learned message pattern during operation. Disclosed is a failure detection device capable of appropriately detecting a failure in an information processing system having different components.

ここで、情報処理システムを遠隔監視する手法には、外部から定期的に情報処理システムにアクセスし、その応答結果を元に判定する外部監視（例えば、死活監視やサービス監視など）と、情報処理システムを構成する各要素の内部状態を取得して判定する内部監視（例えば、リソース監視やログ監視など）という２つの手法がある。 Here, remote monitoring of the information processing system includes external monitoring (for example, alive monitoring, service monitoring, etc.) that periodically accesses the information processing system from the outside and determines based on the response result, information processing There are two methods of internal monitoring (for example, resource monitoring and log monitoring) by acquiring and determining the internal state of each element constituting the system.

内部監視によれば、情報処理システムを構成する各要素の内部状態（例えば、ＣＰＵ使用率、ディスク空き容量、プロセス数など）を所定の閾値に照らすことで、個々の要素の状態を把握することができるが、情報処理システム全体として見た場合、それが正常に動作しているかどうかは、個々の要素の状態から一義的に判定することができない。この点、外部監視によれば、情報処理システムに発生した障害を直接的に検知することができるが、定期的なアクセスに伴って情報処理システムに負荷が生じる。 According to internal monitoring, the internal state of each element constituting the information processing system (for example, the CPU usage rate, the free disk capacity, the number of processes, etc.) is grasped against a predetermined threshold value to grasp the state of each element. However, when viewed as the entire information processing system, whether or not it is operating normally cannot be uniquely determined from the state of each element. In this regard, according to the external monitoring, a failure occurring in the information processing system can be directly detected, but a load is generated in the information processing system with regular access.

本発明は、上記に鑑みてなされたものであり、内部監視の結果に基づいて情報処理システムの障害を検知することができる障害監視装置を提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a failure monitoring apparatus that can detect a failure in an information processing system based on a result of internal monitoring.

本発明者は、内部監視の結果に基づいて情報処理システムの障害を検知することができる障害監視装置の構成につき鋭意検討した結果、以下の構成に想到し、本発明に至ったのである。 As a result of intensive studies on the configuration of a failure monitoring apparatus that can detect a failure in an information processing system based on the results of internal monitoring, the inventors have conceived the following configuration and have reached the present invention.

すなわち、本発明によれば、情報処理システムの障害を検知するための障害監視装置であって、前記情報処理システムに対して定期的にアクセスし、その応答の成否を時系列に蓄積する外部監視手段と、前記情報処理システムを構成する各要素の内部状態を時系列に蓄積する内部監視手段と、前記情報処理システムの障害を判定する障害判定手段と、を含み、前記障害判定手段は、前記応答の成否の時系列情報を外部メトリクスデータに変換する手段と、前記内部状態の時系列情報を内部メトリクスデータに変換する手段と、前記外部メトリクスデータの値を出力とし、該値に時間的に対応する前記内部メトリクスデータの値を入力とする教師データを生成する手段と、前記教師データを使用して前記情報処理システムの障害を判定するための障害判定条件を機械学習する学習器と、前記障害判定条件が設定される判定器であって、前記内部メトリクスデータを入力として受け取り、前記情報処理システムの障害に係る判定結果を出力する判定器と、を含む障害監視装置が提供される。 That is, according to the present invention, there is provided a failure monitoring apparatus for detecting a failure in an information processing system, which periodically accesses the information processing system and accumulates success or failure of the response in time series. Means, internal monitoring means for accumulating the internal state of each element constituting the information processing system in time series, failure determination means for determining a failure of the information processing system, the failure determination means, Means for converting time series information of success or failure of response to external metric data; means for converting time series information of the internal state to internal metric data; and outputting the value of the external metric data to the value in time. Means for generating teacher data having the corresponding value of the internal metric data as input, and for determining a failure of the information processing system using the teacher data A learning device for machine learning of failure determination conditions, a determination device in which the failure determination conditions are set, a determination device that receives the internal metrics data as an input, and outputs a determination result related to a failure of the information processing system; Are provided.

上述したように、本発明によれば、内部監視の結果に基づいて情報処理システムの障害を検知することができる障害監視装置が提供される。 As described above, according to the present invention, a failure monitoring apparatus capable of detecting a failure in an information processing system based on the result of internal monitoring is provided.

本実施形態の障害監視装置の構成図。The block diagram of the failure monitoring apparatus of this embodiment. 本実施形態の障害監視装置の機能ブロック図。The functional block diagram of the failure monitoring apparatus of this embodiment. 本実施形態の監視シナリオを示す図。The figure which shows the monitoring scenario of this embodiment. 本実施形態の内部監視設定を示す図。The figure which shows the internal monitoring setting of this embodiment. 本実施形態の障害監視装置が実行する処理を示すフローチャート。The flowchart which shows the process which the failure monitoring apparatus of this embodiment performs. 本実施形態の外部監視情報を示す図。The figure which shows the external monitoring information of this embodiment. 本実施形態の障害監視装置が実行する処理を示すフローチャート。The flowchart which shows the process which the failure monitoring apparatus of this embodiment performs. 本実施形態の内部状態情報を示す図。The figure which shows the internal state information of this embodiment. 本実施形態の障害監視装置が実行する処理を示すフローチャート。The flowchart which shows the process which the failure monitoring apparatus of this embodiment performs. 本実施形態の外部および内部メトリクスデータを示す図。The figure which shows the external and internal metrics data of this embodiment. 本実施形態の判定エンジン（ニューラルネットワーク）を示す図。The figure which shows the determination engine (neural network) of this embodiment. 本実施形態の障害監視システムのシステム構成図。The system block diagram of the failure monitoring system of this embodiment. 本実施形態の障害監視装置（コンピュータ）のハードウェア構成図。The hardware block diagram of the failure monitoring apparatus (computer) of this embodiment.

以下、本発明を、実施形態をもって説明するが、本発明は後述する実施形態に限定されるものではない。なお、以下に参照する各図においては、共通する要素について同じ符号を用い、適宜、その説明を省略するものとする。 Hereinafter, although this invention is demonstrated with embodiment, this invention is not limited to embodiment mentioned later. In the drawings referred to below, the same reference numerals are used for common elements, and the description thereof is omitted as appropriate.

図１は、本発明の実施形態である障害監視装置１００の概略構成を示す。本実施形態の障害監視装置１００は、複数のサーバやモジュールから構成される情報処理システム２００の状態を遠隔監視するための装置であり、障害監視装置１００と監視対象となる情報処理システム２００は、ＬＡＮ、ＶＡＮなどとして参照されるネットワーク５０を介して相互通信可能に接続されている。 FIG. 1 shows a schematic configuration of a failure monitoring apparatus 100 according to an embodiment of the present invention. The fault monitoring apparatus 100 of the present embodiment is an apparatus for remotely monitoring the state of the information processing system 200 composed of a plurality of servers and modules. The fault monitoring apparatus 100 and the information processing system 200 to be monitored are They are connected to each other via a network 50 referred to as a LAN, VAN or the like so as to be able to communicate with each other.

図１に示すように、本実施形態の障害監視装置１００は、外部監視手段１０と、内部監視手段２０と、障害判定手段３０とを含んで構成されている。 As shown in FIG. 1, the fault monitoring apparatus 100 according to the present embodiment includes an external monitoring unit 10, an internal monitoring unit 20, and a fault determination unit 30.

外部監視手段１０は、情報処理システム２００に対して外部監視を実行する手段であり、ネットワーク５０を介して情報処理システム２００に定期的にアクセス処理を実行し、その応答結果を受信する。ここで、外部監視としては、ＵＲＬ監視、ＰＩＮＧ監視、ＦＴＰ監視、ＰＯＰ監視、ＳＭＴＰ監視、ポート監視などを挙げることができる。外部監視手段１０は、受信した応答結果から外部監視情報（後述する）を生成して障害判定手段３０へ送る。 The external monitoring unit 10 is a unit that performs external monitoring on the information processing system 200, periodically executes access processing to the information processing system 200 via the network 50, and receives a response result. Here, examples of external monitoring include URL monitoring, PING monitoring, FTP monitoring, POP monitoring, SMTP monitoring, and port monitoring. The external monitoring unit 10 generates external monitoring information (described later) from the received response result and sends it to the failure determination unit 30.

内部監視手段２０は、情報処理システム２００に対して内部監視を実行する手段であり、ネットワーク５０を介して情報処理システム２００を構成する各要素（サーバ、モジュール）の内部状態を収集する。ここで、内部監視としては、ＣＰＵ監視、ディスク監視、プロセス監視、ログ監視などを挙げることができ、内部状態としては、ＣＰＵ使用率、ディスク空き容量、指定されたプロセスの有無やプロセスの数、ログファイルに出力されるキーワードの有無などを挙げることができる。内部監視手段２０は、収集した内部状態から内部状態情報（後述する）を生成して障害判定手段３０へ送る。 The internal monitoring unit 20 is a unit that performs internal monitoring on the information processing system 200, and collects the internal state of each element (server, module) constituting the information processing system 200 via the network 50. Here, the internal monitoring can include CPU monitoring, disk monitoring, process monitoring, log monitoring, etc., and the internal state includes CPU usage rate, disk free space, presence / absence of designated process, number of processes, The presence or absence of keywords that are output to the log file can be listed. The internal monitoring unit 20 generates internal state information (described later) from the collected internal state and sends it to the failure determination unit 30.

障害判定手段３０は、情報処理システム２００の障害を判定する手段であり、外部監視手段１０から受領した外部監視情報と、内部監視手段２０から受領した内部状態情報に基づいて障害判定条件を学習し、学習した障害判定条件に基づいて情報処理システム２００の障害を判定する。 The failure determination unit 30 is a unit that determines a failure in the information processing system 200 and learns failure determination conditions based on the external monitoring information received from the external monitoring unit 10 and the internal state information received from the internal monitoring unit 20. The failure of the information processing system 200 is determined based on the learned failure determination condition.

以上、本実施形態の障害監視装置１００の概略構成を説明してきたが、続いて、上述した各手段の機能構成を図２に基づいて説明する。 The schematic configuration of the failure monitoring apparatus 100 according to this embodiment has been described above. Next, the functional configuration of each unit described above will be described with reference to FIG.

外部監視手段１０は、外部監視エンジン１２と、記憶手段１４とを含んで構成されている。ここで、記憶手段１４には、後述する監視シナリオが格納されており、外部監視エンジン１２は、記憶手段１４に格納された監視シナリオに基づいて監視対象となる情報処理システム２００に対して定期的にアクセス処理を実行し、情報処理システム２００からの応答を受信する。そして、外部監視エンジン１２は、情報処理システム２００からの応答結果に基づいて外部監視情報を生成し、障害判定手段３０に送る。 The external monitoring unit 10 includes an external monitoring engine 12 and a storage unit 14. Here, a monitoring scenario to be described later is stored in the storage unit 14, and the external monitoring engine 12 periodically performs processing on the information processing system 200 to be monitored based on the monitoring scenario stored in the storage unit 14. The access process is executed, and a response from the information processing system 200 is received. Then, the external monitoring engine 12 generates external monitoring information based on the response result from the information processing system 200 and sends it to the failure determination means 30.

図３は、記憶手段１４に格納される監視シナリオ３００を例示的に示す。監視シナリオ３００は、ユーザによるアクセス操作を疑似的に再現するアクセス処理に必要な情報と応答期待値の組み合せを、その実行順に記述したものであり、図３に示すように、シナリオ番号を格納するためのフィールド３０１と、処理番号を格納するためのフィールド３０２と、監視対象にアクセスする通信プロトコルを格納するためのフィールド３０３と、監視対象のアドレスを格納するためのフィールド３０４と、監視対象にアクセスする際に必要なオプション情報（ユーザアカウント、ファイル名など）を格納するためのフィールド３０５と、正常時の応答期待値を格納するためのフィールド３０６とを含んで構成されている。 FIG. 3 exemplarily shows a monitoring scenario 300 stored in the storage unit 14. The monitoring scenario 300 describes a combination of information necessary for access processing that simulates an access operation by a user and an expected response value in the order of execution, and stores a scenario number as shown in FIG. Field 301 for storing the processing number, field 303 for storing the communication protocol for accessing the monitoring target, field 304 for storing the monitoring target address, and access to the monitoring target It includes a field 305 for storing option information (user account, file name, etc.) necessary for the operation, and a field 306 for storing an expected response value at normal time.

内部監視手段２０は、内部監視エンジン２２と、記憶手段２４とを含んで構成されている。ここで、記憶手段２４には、内部監視を行うための設定集である内部監視設定が格納されており、内部監視エンジン２２は、記憶手段２４に格納された内部監視設定に基づいて監視対象となる情報処理システム２００を構成する各要素（サーバ、モジュール）にアクセスして、それぞれの内部状態を収集する。なお、情報処理システム２００に監視エージェント２０２を常駐させている場合、内部監視エンジン２２は、監視エージェント２０２から内部状態を収集する。そして、内部監視手段２０は、収集した各種の内部状態に基づいて内部状態情報を生成し、障害判定手段３０に送る。 The internal monitoring unit 20 includes an internal monitoring engine 22 and a storage unit 24. Here, the storage unit 24 stores internal monitoring settings, which are a set of settings for performing internal monitoring. The internal monitoring engine 22 determines whether the monitoring target is based on the internal monitoring settings stored in the storage unit 24. Each element (server, module) constituting the information processing system 200 is accessed, and the internal state is collected. When the monitoring agent 202 is resident in the information processing system 200, the internal monitoring engine 22 collects the internal state from the monitoring agent 202. The internal monitoring unit 20 generates internal state information based on the collected various internal states and sends the internal state information to the failure determination unit 30.

図４は、記憶手段２４に格納される内部監視設定４００を例示的に示す。図４に示すように、内部監視設定４００には、収集する内部状態（メモリ使用率、ログ出力、トラフィック…）に対応付けて、「監視対象」、「監視タイミング」、「待ち時間」、「再実行回数」、「検索文字列」、「出力形式」といった項目の設定値が記述されている。 FIG. 4 exemplarily shows the internal monitoring setting 400 stored in the storage unit 24. As shown in FIG. 4, in the internal monitoring setting 400, “monitoring target”, “monitoring timing”, “waiting time”, “ Setting values of items such as “re-execution count”, “search character string”, and “output format” are described.

障害判定手段３０は、外部監視情報変換エンジン３２と、内部状態情報変換エンジン３４と、教師データ生成手段３５と、学習エンジン３６と、判定エンジン３７と、記憶手段３８とを含んで構成されている。 The failure determination unit 30 includes an external monitoring information conversion engine 32, an internal state information conversion engine 34, a teacher data generation unit 35, a learning engine 36, a determination engine 37, and a storage unit 38. .

外部監視情報変換エンジン３２は、外部監視手段１０から受領した外部監視情報を外部メトリクスデータ（後述する）に変換し、これを記憶手段３８に蓄積する。 The external monitoring information conversion engine 32 converts the external monitoring information received from the external monitoring means 10 into external metric data (described later) and accumulates it in the storage means 38.

内部状態情報変換エンジン３４は、内部監視手段２０から受領した内部状態情報を内部メトリクスデータ（後述する）に変換し、これを記憶手段３８に蓄積する。 The internal state information conversion engine 34 converts the internal state information received from the internal monitoring unit 20 into internal metric data (described later) and accumulates it in the storage unit 38.

教師データ生成手段３５は、記憶手段３８に蓄積された内部メトリクスデータと外部メトリクスデータに基づいて教師データを生成し、これを記憶手段３８に蓄積する。 The teacher data generation unit 35 generates teacher data based on the internal metric data and the external metric data stored in the storage unit 38 and stores the teacher data in the storage unit 38.

学習エンジン３６は、教師あり機械学習を行う学習器であり、好ましくは、多層のニューラルネットワークである。学習エンジン３６は、記憶手段３８に蓄積された教師データを使用して障害判定条件を学習し、学習した障害判定条件を記憶手段３８に保管する。 The learning engine 36 is a learning device that performs supervised machine learning, and is preferably a multilayer neural network. The learning engine 36 learns the failure determination condition using the teacher data stored in the storage unit 38, and stores the learned failure determination condition in the storage unit 38.

判定エンジン３７は、学習エンジン３６と同じ構成を有する判定器である。運用時において、判定エンジン３７には、記憶手段３８から読み出した障害判定条件が設定され、判定エンジン３７は、内部状態情報変換エンジン３４が生成する内部メトリクスデータを入力として受け取り、情報処理システム２００の障害に係る判定結果を出力する。 The determination engine 37 is a determiner having the same configuration as the learning engine 36. During operation, the failure determination condition read from the storage unit 38 is set in the determination engine 37, and the determination engine 37 receives internal metric data generated by the internal state information conversion engine 34 as an input, and Outputs the determination result related to the failure.

以上、本実施形態の障害監視装置１００の機能構成について説明したが、本実施形態では、障害監視装置１００を構成するコンピュータが、所定のプログラムを実行することにより、上述した各手段として機能する。 The functional configuration of the failure monitoring apparatus 100 according to the present embodiment has been described above. In the present embodiment, the computer constituting the failure monitoring apparatus 100 functions as the above-described units by executing a predetermined program.

続いて、上述した各機能手段が実行する処理の内容を順を追って説明する。 Next, the contents of the processing executed by each functional unit described above will be described in order.

まず、外部監視手段１０（外部監視エンジン１２）が実行する処理の内容を図５に示すフローチャートに基づいて説明する。 First, the contents of processing executed by the external monitoring means 10 (external monitoring engine 12) will be described based on the flowchart shown in FIG.

まずステップ１０１では、監視シナリオ３００からシナリオを１つ読み込む。具体的には、監視シナリオ３００（図３参照）のレコードの中から、最も若いシナリオ番号が付された複数のレコードを読み込む。 First, in step 101, one scenario is read from the monitoring scenario 300. Specifically, a plurality of records with the smallest scenario number are read from the records of the monitoring scenario 300 (see FIG. 3).

続くステップ１０２では、先のステップ１０１で読み込んだ複数のレコードのうち、最も若い処理番号が付されたレコードの情報に基づいて、情報処理システム２００に対してアクセス処理を実行する。具体的には、該当するレコードのフィールド３０２に格納されるプロトコルに従い、必要に応じて、フィールド３０５に格納されるオプション情報を使用して、フィールド３０４に格納されるアドレスを宛先とするアクセス処理を実行する。 In the subsequent step 102, access processing is executed for the information processing system 200 based on the information of the record to which the youngest processing number is assigned among the plurality of records read in the previous step 101. Specifically, according to the protocol stored in the field 302 of the corresponding record, if necessary, the option information stored in the field 305 is used to perform access processing with the address stored in the field 304 as the destination. Run.

その後、所定時間、情報処理システム２００からの応答を待機した後に、続くステップ１０３で、応答を受信したか否かを判断する。その結果、応答を受信した場合は（ステップ１０３、Ｙｅｓ）、処理はステップ１０４に進み、応答を受信しなかった場合は（ステップ１０３、Ｎｏ）、処理はステップ１０８に進む。 Thereafter, after waiting for a response from the information processing system 200 for a predetermined time, it is determined in a subsequent step 103 whether or not the response has been received. As a result, when a response is received (step 103, Yes), the process proceeds to step 104. When a response is not received (step 103, No), the process proceeds to step 108.

続くステップ１０４では、情報処理システム２００から受信した応答結果に基づいて、その応答の成否と応答時刻を外部監視情報（後述する）に蓄積した後、処理はステップ１０５に進む。一方、続くステップ１０８では、「タイムアウト・エラー」を、同じく、外部監視情報に蓄積した後、処理はステップ１０５に進む。 In subsequent step 104, the success or failure of the response and the response time are accumulated in external monitoring information (described later) based on the response result received from the information processing system 200, and then the process proceeds to step 105. On the other hand, in the subsequent step 108, “timeout error” is also stored in the external monitoring information, and then the process proceeds to step 105.

図６は、外部監視情報５００を例示的に示す。外部監視情報５００は、外部監視の実行結果（応答の成否と応答時刻）を時系列に蓄積するためのデータ構造であり、一時記憶に保持される。外部監視情報５００は、図６に示すように、「シナリオ番号」を格納するためのフィールド５０１と、「処理番号」を格納するためのフィールド５０２と、「成否」を格納するためのフィールド５０３と、「応答時刻」を格納するためのフィールド５０４とを含む。 FIG. 6 exemplarily shows the external monitoring information 500. The external monitoring information 500 is a data structure for accumulating external monitoring execution results (response success / failure and response time) in time series, and is held in a temporary storage. As shown in FIG. 6, the external monitoring information 500 includes a field 501 for storing “scenario number”, a field 502 for storing “processing number”, and a field 503 for storing “success / failure”. , And a field 504 for storing “response time”.

ここで、先のステップ１０４では、外部監視情報５００に新規のレコードを追加し、先のステップ１０２で実行したアクセス処理に係るレコードの「シナリオ番号」および「処理番号」を、追加したレコードのフィールド５０１および５０２に格納する。また、当該アクセス処理に係るレコードに格納された「応答期待値」と受信した応答結果を比較し、両者が一致する場合は、成（successl）をフィールド５０３に格納し、一致しない場合は、否（fail）をフィールド５０３に格納する。さらに、当該応答を受信した時刻を応答時刻としてフィールド５０４に格納する。 Here, in the previous step 104, a new record is added to the external monitoring information 500, and the “scenario number” and “process number” of the record related to the access process executed in the previous step 102 are the fields of the added record. Stored in 501 and 502. Also, the “response expected value” stored in the record related to the access process is compared with the received response result, and if both match, the success (successl) is stored in the field 503. (Fail) is stored in the field 503. Further, the time when the response is received is stored in the field 504 as the response time.

同様に、先のステップ１０８では、先のステップ１０２で実行したアクセス処理に係るレコードの「シナリオ番号」および「処理番号」をフィールド５０１および５０２に格納した上で、否（fail）をフィールド５０３に格納する。さらに、タイムアウトした時刻を応答時刻としてフィールド５０４に格納する。 Similarly, in the previous step 108, the “scenario number” and “process number” of the record related to the access process executed in the previous step 102 are stored in the fields 501 and 502, and “fail” is stored in the field 503. Store. Further, the time-out time is stored in the field 504 as the response time.

続くステップ１０５では、先のステップ１０１で読み込んだシナリオを構成する処理のうち、実行していない次の処理があるか否かを判断する。その結果、次の処理がある場合は（ステップ１０５、Ｙｅｓ）、処理はステップ１０２に戻って、先のステップ１０１で読み込んだ複数のレコードのうち、次に若い処理番号が付されたレコードに基づいて、上述したのと同様の処理を実行する。以降、先のステップ１０１で読み込んだシナリオを構成する全ての処理が実行されるまで、ステップ１０２〜１０５を繰り返す。 In the subsequent step 105, it is determined whether or not there is a next process that is not executed among the processes constituting the scenario read in the previous step 101. As a result, when there is a next process (step 105, Yes), the process returns to step 102, and based on the record with the next smallest process number among the plurality of records read in the previous step 101. Then, the same processing as described above is executed. Thereafter, steps 102 to 105 are repeated until all the processes constituting the scenario read in the previous step 101 are executed.

その後、ステップ１０５の判断において、次の処理がないと判断した場合は（ステップ１０５、Ｎｏ）、処理はステップ１０６に進み、監視シナリオ３００に記述されたシナリオのうち、実行していない次のシナリオがあるか否かを判断する。その結果、次のシナリオがある場合は（ステップ１０６、Ｙｅｓ）、処理はステップ１０１に戻って、次に若いシナリオ番号が付された複数のレコードを読み込む。以降、監視シナリオ３００に記述された全てのシナリオが実行されるまで、ステップ１０１〜１０６の処理を繰り返す。その後、ステップ１０６の判断において、次のシナリオがないと判断した場合は（ステップ１０６、Ｎｏ）、処理はステップ１０７に進む。 Thereafter, if it is determined in step 105 that there is no next process (No in step 105), the process proceeds to step 106, and the next scenario that is not executed among the scenarios described in the monitoring scenario 300 is performed. Judge whether there is. As a result, if there is a next scenario (step 106, Yes), the process returns to step 101 to read a plurality of records to which the next young scenario number is assigned. Thereafter, the processes in steps 101 to 106 are repeated until all the scenarios described in the monitoring scenario 300 are executed. Thereafter, when it is determined in step 106 that there is no next scenario (step 106, No), the process proceeds to step 107.

続くステップ１０７では、一時記憶から外部監視情報５００を読み出して障害判定手段３０に送り、処理を終了する。なお、外部監視エンジン１２は、上述した一連の処理を定期的に実行する（たとえば、５分ごと）。 In the subsequent step 107, the external monitoring information 500 is read from the temporary storage and sent to the failure determination means 30, and the process is terminated. The external monitoring engine 12 periodically executes the series of processes described above (for example, every 5 minutes).

以上、外部監視手段１０が実行する処理の内容を説明してきたが、次に、内部監視手段２０（内部監視エンジン２２）が実行する処理の内容を図７に示すフローチャートに基づいて説明する。 The contents of the processing executed by the external monitoring means 10 have been described above. Next, the contents of the processing executed by the internal monitoring means 20 (internal monitoring engine 22) will be described based on the flowchart shown in FIG.

まずステップ２０１では、記憶手段２４から内部監視設定４００（図４参照）を読み込む。 First, in step 201, the internal monitoring setting 400 (see FIG. 4) is read from the storage means 24.

続くステップ２０２では、内部監視設定４００に記述された複数の監視対象（モジュール）のそれぞれに対して、設定された監視タイミングで内部状態を取得するための内部監視処理を実施する。 In the following step 202, an internal monitoring process for acquiring an internal state at a set monitoring timing is performed for each of a plurality of monitoring targets (modules) described in the internal monitoring setting 400.

その後、所定時間、各モジュールからの応答を待機した後に、続くステップ２０３で、内部状態を取得したか否かを判断する。その結果、内部状態を取得した場合は（ステップ２０３、Ｙｅｓ）、処理はステップ２０４に進み、内部状態を取得しなかった場合は（ステップ２０３、Ｎｏ）、処理はステップ２０６に進む。 Thereafter, after waiting for a response from each module for a predetermined time, it is determined in a subsequent step 203 whether or not the internal state has been acquired. As a result, when the internal state has been acquired (step 203, Yes), the process proceeds to step 204. When the internal state has not been acquired (step 203, No), the process proceeds to step 206.

続くステップ２０４では、監視対象（モジュール）から取得した内部状態を、内部状態情報（後述する）に蓄積した後、処理はステップ２０５に進む。一方、続くステップ２０６では、「タイムアウト・エラー」を、同じく、内部状態情報に蓄積した後、処理はステップ２０５に進む。 In subsequent step 204, the internal state acquired from the monitoring target (module) is accumulated in internal state information (described later), and then the process proceeds to step 205. On the other hand, in the subsequent step 206, “timeout error” is also stored in the internal state information, and then the process proceeds to step 205.

図８は、内部状態情報６００を例示的に示す。内部状態情報６００は、内部監視処理で取得した内部状態を時系列に蓄積するためのデータ構造であり、一時記憶に保持される。内部状態情報６００は、図８に示すように、「監視対象」を格納するためのフィールド６０１と、「内部状態の種類」を格納するためのフィールド６０２と、「内部状態の値」を格納するためのフィールド６０３と、「取得時刻」を格納するためのフィールド６０４とを含む。 FIG. 8 exemplarily shows the internal state information 600. The internal state information 600 is a data structure for accumulating the internal state acquired by the internal monitoring process in time series, and is held in a temporary storage. As shown in FIG. 8, the internal state information 600 stores a field 601 for storing “monitoring target”, a field 602 for storing “type of internal state”, and a “value of internal state”. And a field 604 for storing “acquisition time”.

ここで、先のステップ２０４では、内部状態情報６００に新規のレコードを追加し、先のステップ２０２で実行した内部監視処理の実行先である監視対象をフィールド６０１に格納し、当該監視対象から取得した内部状態の種類をフィールド６０２に格納し、当該内部状態の値をフィールド６０３に格納し、当該内部状態を取得した時刻をフィールド６０４に格納する。同様に、先のステップ２０６では、先のステップ２０２で実行した内部監視処理の実行先である監視対象をフィールド６０１に格納し、当該監視対象から取得した内部状態の種類をフィールド６０２に格納し、監視対象ごとに指定したエラーを意味する、監視対象ごとに指定した値（ゼロ値、NULL値、NoData値、Error値等）をフィールド６０３に格納し、タイムアウトした時刻をフィールド６０４に格納する。 Here, in the previous step 204, a new record is added to the internal state information 600, the monitoring target that is the execution destination of the internal monitoring process executed in the previous step 202 is stored in the field 601, and acquired from the monitoring target. The type of the internal state is stored in the field 602, the value of the internal state is stored in the field 603, and the time when the internal state is acquired is stored in the field 604. Similarly, in the previous step 206, the monitoring target that is the execution destination of the internal monitoring process executed in the previous step 202 is stored in the field 601, and the type of the internal state acquired from the monitoring target is stored in the field 602. A value (zero value, null value, NoData value, Error value, etc.) specified for each monitoring target, which means an error specified for each monitoring target, is stored in the field 603, and the time-out time is stored in the field 604.

続くステップ２０５では、一時記憶から内部状態情報６００を読み出して障害判定手段３０に送る。以降、ステップ２０２〜２０５の処理を繰り返し実行する。 In subsequent step 205, the internal state information 600 is read from the temporary storage and sent to the failure determination means 30. Thereafter, the processes in steps 202 to 205 are repeatedly executed.

一方、情報処理システム２００に監視エージェント２０２を常駐させている場合、内部監視手段２０は、上述したステップ２０２〜２０５に並行して、ステップ２０７〜２０９を実行する。 On the other hand, when the monitoring agent 202 is resident in the information processing system 200, the internal monitoring unit 20 executes steps 207 to 209 in parallel with the above-described steps 202 to 205.

まずステップ２０７では、監視エージェント２０２から送信される内部状態を待機し（ステップ２０７、Ｎｏ）、監視エージェント２０２から内部状態を取得すると（ステップ２０７、Ｙｅｓ）、処理はステップ２０８に進む。 First, in step 207, the internal state transmitted from the monitoring agent 202 is waited (step 207, No), and when the internal state is acquired from the monitoring agent 202 (step 207, Yes), the process proceeds to step 208.

続くステップ２０８では、監視エージェント２０２から取得した内部状態を、上述したのと同様の手順で、内部状態情報６００に蓄積した後、処理はステップ２０９に進む。 In subsequent step 208, the internal state acquired from the monitoring agent 202 is accumulated in the internal state information 600 in the same procedure as described above, and then the process proceeds to step 209.

続くステップ２０９では、一時記憶から内部状態情報６００を読み出して障害判定手段３０に送る。以降、ステップ２０７〜２０９の処理を繰り返し実行する。 In the subsequent step 209, the internal state information 600 is read from the temporary storage and sent to the failure determination means 30. Thereafter, the processing of steps 207 to 209 is repeatedly executed.

以上、内部監視手段２０が実行する処理の内容を説明してきたが、次に、障害判定手段３０が機械学習時に実行する処理の内容を図９（ａ）に示すフローチャートに基づいて説明する。 The contents of the processing executed by the internal monitoring means 20 have been described above. Next, the contents of the processing executed by the failure determination means 30 during machine learning will be described based on the flowchart shown in FIG.

まずステップ３０１では、外部監視情報変換エンジン３２が、外部監視手段１０から受領した外部監視情報５００の各レコードの値を数値のメトリクスに変換することにより、外部メトリクスデータを生成する。具体的には、外部監視情報５００の各レコードのフィールド５０１の値（シナリオ番号）を十の桁とし、フィールド５０２の値（処理番号）を一の桁とした整数を「メトリクス１」とし、フィールド５０３の値（成否）に対応する二値（success：１／fail：０）を「メトリクス２」とする。その上で、上述した２つのメトリクス（「メトリクス１」、「メトリクス２」）にフィールド５０４の値（応答時刻）を対応付ける。なお、上述した桁数へのマッピングはあくまで説明のための例示であり、実際には、シナリオ数や処理数に応じて適切なマッピングを行うことになる。 First, in step 301, the external monitoring information conversion engine 32 generates external metrics data by converting the values of each record of the external monitoring information 500 received from the external monitoring means 10 into numerical metrics. Specifically, the value (scenario number) of the field 501 of each record of the external monitoring information 500 is a ten digit, the integer having the value of the field 502 (processing number) as one digit is “metrics 1”, and the field The binary value (success: 1 / fail: 0) corresponding to the value 503 (success / failure) is defined as “metric 2”. Then, the value (response time) in the field 504 is associated with the above-described two metrics (“metric 1”, “metric 2”). Note that the above-described mapping to the number of digits is merely an example for explanation, and in practice, appropriate mapping is performed according to the number of scenarios and the number of processes.

図１０（ａ）は、上述した手順で生成される外部メトリクスデータ７００を例示的に示す。図１０（ａ）に示すように、外部メトリクスデータ７００においては、「メトリクス１」および「メトリクス２」が時刻（すなわち、外部監視の応答時刻）に対応付けられている。 FIG. 10A exemplarily shows external metrics data 700 generated by the above-described procedure. As shown in FIG. 10A, in the external metrics data 700, “metrics 1” and “metrics 2” are associated with time (that is, response time of external monitoring).

続くステップ３０２では、内部状態情報変換エンジン３４が、内部監視手段２０から受領した内部状態情報６００の各レコードの値を数値のメトリクスに変換することにより、内部メトリクスデータを生成する。具体的には、フィールド６０４の値（取得時刻）が一致するＮ個（Ｎは１以上の整数）のレコードのフィールド６０３の値（内部状態の値）を、それぞれ、「メトリクス１」、「メトリクス２」、「メトリクス３」、「メトリクス４」…「メトリクスＮ」とした上で、Ｎ個のメトリクスにフィールド６０４の値（取得時刻）を対応付ける。 In the subsequent step 302, the internal state information conversion engine 34 converts the value of each record of the internal state information 600 received from the internal monitoring means 20 into numeric metrics, thereby generating internal metrics data. Specifically, the value (internal state value) of the field 603 of the N records (N is an integer equal to or greater than 1) with the same value (acquisition time) in the field 604 is set to “metrics 1” and “metrics”, respectively. 2 ”,“ Metric 3 ”,“ Metric 4 ”...“ Metric N ”, and the value (acquisition time) of the field 604 is associated with the N metrics.

図１０（ｂ）は、上述した手順で生成される内部メトリクスデータ８００を例示的に示す。図１０（ｂ）に示すように、内部メトリクスデータ８００においては、Ｎ個のメトリクスが時刻（すなわち、内部状態の取得時刻）に対応付けられている。 FIG. 10B exemplarily shows internal metrics data 800 generated by the above-described procedure. As shown in FIG. 10B, in the internal metrics data 800, N metrics are associated with time (that is, the acquisition time of the internal state).

続くステップ３０３では、教師データ生成手段３５が、内部メトリクスデータ８００に含まれる１のレコードの値を入力とし、外部メトリクスデータ７００に含まれる１のレコードの値を出力とする教師データを生成する。 In the subsequent step 303, the teacher data generation means 35 generates teacher data having the value of one record included in the internal metrics data 800 as an input and the value of one record included in the external metrics data 700 as an output.

具体的には、外部メトリクスデータ７００の各レコードに格納された時刻と内部メトリクスデータ８００の各レコードに格納された時刻を比較し、外部メトリクスデータ７００の１のレコードの時刻から見て、直近の時刻が格納された内部メトリクスデータ８００のレコードを選出し、この２つのレコードの値の組を教師データとする。 Specifically, the time stored in each record of the external metric data 700 is compared with the time stored in each record of the internal metric data 800, and the most recent time is viewed from the time of one record of the external metric data 700. A record of the internal metrics data 800 in which the time is stored is selected, and a set of values of these two records is used as teacher data.

なお、本実施形態では、別法として、外部メトリクスデータ７００の１のレコードの時刻を起点とした過去の所定期間内（例えば、数秒内）の時刻が格納された内部メトリクスデータ８００のＭ個（Ｍは２以上の整数）のレコードを選出するようにしてもよい。この場合、選出したＭ個のレコードのそれぞれに含まれるＮ個のメトリクスのそれぞれの値について、適切な代表値（平均値、中央値、最大値、最小値など）を算出し、外部メトリクスデータ７００の１のレコードの値と算出したＮ個の代表値の組を教師データとする。すなわち、本実施形態では、外部メトリクスデータ７００に含まれる１の値を出力とし、当該値に時間的に対応する内部メトリクスデータの値を入力とすればよい。 In the present embodiment, as an alternative, M pieces of internal metrics data 800 (in a few seconds, for example) stored in the past in a predetermined period starting from the time of one record of external metrics data 700 ( M may be selected as an integer of 2 or more. In this case, an appropriate representative value (average value, median value, maximum value, minimum value, etc.) is calculated for each value of the N metrics included in each of the selected M records, and the external metrics data 700 is calculated. A set of one record value of N and the calculated N representative values is used as teacher data. In other words, in the present embodiment, a value of 1 included in the external metrics data 700 may be output and a value of internal metrics data corresponding to the value in time may be input.

続くステップ３０４では、学習エンジン３６が、先のステップ３０３で生成した教師データを使用して機械学習を実行する。図１１は、多層のニューラルネットワークとして構成された学習エンジン３６が、内部メトリクスデータ８００の１の値を入力とし、外部メトリクスデータ７００の１の値を出力とする教師データを使用して機械学習が実行される様子を模式的に示す。この場合、機械学習の実行により、ニューラルネットワークの隠れ層に障害判定条件が取得される。ここで、本実施形態における障害判定条件とは、下記（１）〜（４）の情報のセットを意味する。なお、下記（１）、（２）は、人為的に決定される設計事項であり、上述した教師データを使用して機械学習によって下記（３）、（４）の最適値が自動生成されることになる。
（１）ニューラルネットワークのネットワーク構造
（２）ノードの活性化関数
（３）重み値
（４）バイアス値 In the subsequent step 304, the learning engine 36 performs machine learning using the teacher data generated in the previous step 303. FIG. 11 shows that the learning engine 36 configured as a multi-layer neural network performs machine learning using teacher data having a value of 1 in the internal metrics data 800 as an input and a value of 1 in the external metrics data 700 as an output. A state of being executed is schematically shown. In this case, the failure determination condition is acquired in the hidden layer of the neural network by executing machine learning. Here, the failure determination condition in the present embodiment means a set of information (1) to (4) below. The following (1) and (2) are design items that are artificially determined, and the optimum values (3) and (4) below are automatically generated by machine learning using the above-described teacher data. It will be.
(1) Network structure of neural network (2) Node activation function (3) Weight value (4) Bias value

続くステップ３０５では、学習エンジン３６が、取得された障害判定条件を記憶手段３８に保存して、処理を終了する。 In the subsequent step 305, the learning engine 36 stores the acquired failure determination condition in the storage unit 38, and ends the process.

以上、障害判定手段３０が機械学習時に実行する処理の内容を説明してきたが、次に、障害判定手段３０が運用時に実行する処理の内容を図９（ｂ）に示すフローチャートに基づいて説明する。 The content of the processing executed by the failure determination unit 30 during machine learning has been described above. Next, the content of the processing executed by the failure determination unit 30 during operation will be described based on the flowchart shown in FIG. .

運用時においては、学習エンジン３６と同じ多層のニューラルネットワークとして構成された判定エンジン３７に対して、学習によって取得された障害判定条件が設定されていることが前提となる。 At the time of operation, it is assumed that the failure determination condition acquired by learning is set for the determination engine 37 configured as the same multi-layer neural network as the learning engine 36.

まずステップ４０１では、内部状態情報変換エンジン３４が、機械学習時と同様の手順で、内部監視手段２０から受領した内部状態情報６００に基づいて内部メトリクスデータを生成する。具体的には、受領した内部状態情報６００の各レコードのフィールド６０４の値（取得時刻）が一致するＮ個のレコードのフィールド６０３の値（内部状態の値）を、それぞれ、「メトリクス１」、「メトリクス２」、「メトリクス３」、「メトリクス４」…「メトリクスＮ」とする。なお、運用時においては、Ｎ個のメトリクスに対してフィールド６０４の値（取得時刻）を対応付ける必要はない。 First, in step 401, the internal state information conversion engine 34 generates internal metric data based on the internal state information 600 received from the internal monitoring means 20 in the same procedure as in machine learning. Specifically, the value (internal state value) of the field 603 of N records having the same value (acquisition time) of the field 604 of each record of the received internal state information 600 is set to “metrics 1”, “Metric 2”, “Metric 3”, “Metric 4”... “Metric N”. During operation, it is not necessary to associate the value (acquisition time) of the field 604 with N metrics.

続くステップ４０２では、内部状態情報変換エンジン３４が、先のステップ４０１で生成した内部メトリクスデータを判定エンジン３７に入力する。 In the subsequent step 402, the internal state information conversion engine 34 inputs the internal metrics data generated in the previous step 401 to the determination engine 37.

続くステップ４０３では、判定エンジン３７が判定結果を出力して、処理を終了する。ここで、ステップ４０３では、「メトリクス１（シナリオ番号＋処理番号）」と、「メトリクス２（success：１／fail：０）」が判定結果として出力される。仮に、ステップ４０３で、メトリクス２＝１が出力された場合、監視対象の情報処理システム２００が正常状態にあることが推定される。一方、仮に、ステップ４０３で、メトリクス２＝０が出力された場合、監視対象の情報処理システム２００に障害が発生していることが推定される。 In subsequent step 403, the determination engine 37 outputs a determination result, and the process is terminated. Here, in step 403, “metrics 1 (scenario number + processing number)” and “metrics 2 (success: 1 / fail: 0)” are output as determination results. If metric 2 = 1 is output in step 403, it is estimated that the information processing system 200 to be monitored is in a normal state. On the other hand, if metrics 2 = 0 is output in step 403, it is estimated that a failure has occurred in the information processing system 200 to be monitored.

以上、説明したように、本実施形態によれば、運用中は、内部監視の結果のみに基づいて情報処理システムの障害検知と総合的な影響度判定を行うことができるようになるので、外部監視に伴うコスト（監視システムの維持コストや監視対象に対するアクセス負荷）の低減が期待できる。また、本実施形態では、障害判定条件が自動的に学習されるので、従来の内部監視における煩雑な手間（各監視対象の内部状態に係る閾値の個別的な設定・調整）を省くことができるようになる。 As described above, according to the present embodiment, during operation, it becomes possible to perform failure detection of the information processing system and comprehensive impact determination based only on the result of internal monitoring. Reduction of costs (monitoring system maintenance cost and access load on the monitoring target) can be expected. Further, in the present embodiment, since the failure determination condition is automatically learned, it is possible to save the troublesome work in the conventional internal monitoring (individual setting / adjustment of the threshold value related to the internal state of each monitoring target). It becomes like this.

以上、本実施形態の障害監視装置１００について説明してきたが、本実施形態では、図２に示した各機能手段を１台のコンピュータ上で実現してもよいし、各機能手段を適切な単位でネットワーク上の２以上のコンピュータに分散配置することによって、ネットワークシステムとして実現してもよい。 The fault monitoring apparatus 100 according to the present embodiment has been described above. In the present embodiment, each functional unit illustrated in FIG. 2 may be realized on one computer, or each functional unit may be an appropriate unit. Thus, it may be realized as a network system by being distributed to two or more computers on the network.

図１２は、障害監視装置１００と同等の機能を有するネットワークシステムとして構成された障害監視システム１００ｓを例示的に示す。障害監視システム１００ｓは、上述した外部監視手段１０と同等の機能を有する外部監視システム１０ｓと、上述した内部監視手段２０と同等の機能を有する内部監視システム２０ｓと、上述した障害判定手段３０と同等の機能を有する障害判定システム３０ｓとを含み、各システム１０ｓ、２０ｓ、３０ｓは、ネットワーク５０を介して相互通信可能に接続されている。 FIG. 12 exemplarily shows a failure monitoring system 100 s configured as a network system having functions equivalent to those of the failure monitoring apparatus 100. The failure monitoring system 100 s is equivalent to the external monitoring system 10 s having the same function as the above-described external monitoring unit 10, the internal monitoring system 20 s having the same function as the above-described internal monitoring unit 20, and the above-described failure determination unit 30. The systems 10s, 20s, and 30s are connected to each other via a network 50 so that they can communicate with each other.

最後に、図１３に基づいて本実施形態の障害監視装置１００またはこれと同等の機能を有するネットワークシステムを構成するコンピュータのハードウェア構成について説明する。 Finally, based on FIG. 13, the hardware configuration of the computer constituting the fault monitoring apparatus 100 of the present embodiment or a network system having functions equivalent thereto will be described.

図１３に示すように、本実施形態の障害監視装置１００またはこれと同等の機能を有するネットワークシステムを構成するコンピュータは、装置全体の動作を制御するプロセッサ１０１と、ブートプログラムやファームウェアプログラムなどを保存するＲＯＭ１０２と、プログラムの実行空間を提供するＲＡＭ１０３と、コンピュータを上述した各機能手段として機能させるためのプログラムやオペレーティングシステム（ＯＳ）等を保存するための補助記憶装置１０４と、外部装置を接続するための入出力インタフェース１０５と、ネットワーク５０に接続するためのネットワーク・インターフェース１０６とを備えている。 As shown in FIG. 13, the computer constituting the fault monitoring apparatus 100 of this embodiment or a network system having the same function as this stores a processor 101 that controls the operation of the entire apparatus, a boot program, a firmware program, and the like. ROM 102 to be executed, RAM 103 to provide a program execution space, auxiliary storage device 104 for storing a program for causing the computer to function as each of the above-described functional units, an operating system (OS), and the like, and an external device are connected. An input / output interface 105 for connecting to the network 50, and a network interface 106 for connecting to the network 50.

なお、上述した実施形態の各機能は、Ｃ、Ｃ＋＋、Ｃ＃、Ｊａｖａ（登録商標）などで記述されたプログラムにより実現でき、本実施形態のプログラムは、ハードディスク装置、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ、フレキシブルディスク、ＥＥＰＲＯＭ、ＥＰＲＯＭなどの記録媒体に格納して頒布することができ、また他の装置が可能な形式でネットワークを介して伝送することができる。 Note that each function of the above-described embodiment can be realized by a program described in C, C ++, C #, Java (registered trademark), etc., and the program of this embodiment includes a hard disk device, a CD-ROM, an MO, a DVD, and the like. It can be stored in a recording medium such as a flexible disk, EEPROM, EPROM and distributed, and can be transmitted via a network in a format that can be used by other devices.

以上、本発明について実施形態をもって説明してきたが、本発明は上述した実施形態に限定されるものではなく、当業者が推考しうる実施態様の範囲内において、本発明の作用・効果を奏する限り、本発明の範囲に含まれるものである。 As described above, the present invention has been described with the embodiment. However, the present invention is not limited to the above-described embodiment, and as long as the operations and effects of the present invention are exhibited within the scope of embodiments that can be considered by those skilled in the art. It is included in the scope of the present invention.

１０…外部監視手段
１２…外部監視エンジン
１４…記憶手段
２０…内部監視手段
２２…内部監視エンジン
２４…記憶手段
３０…障害判定手段
３２…外部監視情報変換エンジン
３４…内部状態情報変換エンジン
３５…教師データ生成手段
３６…学習エンジン
３７…判定エンジン
３８…記憶手段
５０…ネットワーク
１００…障害監視装置
１０ｓ…外部監視システム
２０ｓ…内部監視システム
３０ｓ…障害判定システム
１００ｓ…障害監視システム
１０１…プロセッサ
１０２…ＲＯＭ
１０３…ＲＡＭ
１０４…補助記憶装置
１０５…入出力インタフェース
１０６…ネットワーク・インターフェース
２００…情報処理システム
２０２…監視エージェント
３００…監視シナリオ
３０１，３０２，３０３，３０４，３０５，３０６…フィールド
４００…内部監視設定
５００…外部監視情報
５０１，５０２，５０３，５０４…フィールド
６００…内部状態情報
６０１，６０２，６０３，６０４…フィールド
７００…外部メトリクスデータ
８００…内部メトリクスデータ DESCRIPTION OF SYMBOLS 10 ... External monitoring means 12 ... External monitoring engine 14 ... Storage means 20 ... Internal monitoring means 22 ... Internal monitoring engine 24 ... Storage means 30 ... Fault determination means 32 ... External monitoring information conversion engine 34 ... Internal state information conversion engine 35 ... Teacher Data generation means 36 ... learning engine 37 ... determination engine 38 ... storage means 50 ... network 100 ... failure monitoring device 10s ... external monitoring system 20s ... internal monitoring system 30s ... failure determination system 100s ... failure monitoring system 101 ... processor 102 ... ROM
103 ... RAM
104 ... auxiliary storage device 105 ... input / output interface 106 ... network interface 200 ... information processing system 202 ... monitoring agent 300 ... monitoring scenario 301, 302, 303, 304, 305, 306 ... field 400 ... internal monitoring setting 500 ... external monitoring Information 501, 502, 503, 504 ... Field 600 ... Internal state information 601, 602, 603, 604 ... Field 700 ... External metrics data 800 ... Internal metrics data

特開２０１２−１４１８０２号公報JP 2012-141802 A

Claims

A failure monitoring device for detecting a failure in an information processing system,
External monitoring means for periodically accessing the information processing system and accumulating success or failure of the response in time series,
Internal monitoring means for accumulating in time series the internal state of each element constituting the information processing system;
Failure determination means for determining a failure of the information processing system;
Including
The failure determination means includes
Means for converting time series information of success or failure of the response into external metric data;
Means for converting the time series information of the internal state into internal metrics data;
Means for generating teacher data having the value of the external metrics data as an output and the value of the internal metrics data corresponding to the value in time as input;
A learning device for machine learning of a failure determination condition for determining a failure of the information processing system using the teacher data;
A determination unit in which the failure determination condition is set, the determination unit receiving the internal metrics data as an input, and outputting a determination result relating to the failure of the information processing system;
Fault monitoring device including

The means for generating the teacher data includes:
Generating a teacher data in which a value of 1 of the external metrics data is output and a representative value of two or more values of the internal metrics data corresponding to the value in time is input;
The fault monitoring apparatus according to claim 1.

The fault monitoring device
It has a monitoring scenario that describes the combination of access and response expected value that simulates the access operation by the user in the order of execution,
The external monitoring means includes
Sequentially executing each access to the information processing system, and determining success or failure of the response based on a comparison between a response result to the access and the expected response value related to the access;
The fault monitoring apparatus according to claim 1 or 2.

A failure monitoring device for detecting a failure in an information processing system,
Internal monitoring means for accumulating in time series the internal state of each element constituting the information processing system;
Failure determination means for determining a failure of the information processing system;
Including
The failure determination means includes
Means for converting the internal state into internal metrics data;
A determination unit in which a failure determination condition acquired by machine learning using predetermined teacher data is set, the determination unit receiving the internal metrics data as an input and outputting a determination result related to the failure of the information processing system Including
The predetermined teacher data is
Teacher data that takes the value of the external metrics data as an output and receives the value of the internal metrics data corresponding to the value in time,
The external metrics data is
Time series information of metrics related to success or failure of response to access periodically executed for a given information processing system,
The internal metrics data is
Metric time series information relating to the internal state of each element constituting the predetermined information processing system,
Fault monitoring device.

A failure monitoring system for detecting a failure in an information processing system,
External monitoring means for periodically accessing the information processing system and accumulating success or failure of the response in time series,
Internal monitoring means for accumulating in time series the internal state of each element constituting the information processing system;
Failure determination means for determining a failure of the information processing system;
Including
The failure determination means includes
Means for converting time series information of success or failure of the response into external metric data;
Means for converting the time series information of the internal state into internal metrics data;
Means for generating teacher data having the value of the external metrics data as an output and the value of the internal metrics data corresponding to the value in time as input;
A learning device for machine learning of a failure determination condition for determining a failure of the information processing system using the teacher data;
Fault monitoring system including

A failure monitoring system for detecting a failure in an information processing system,
Internal monitoring means for accumulating in time series the internal state of each element constituting the information processing system;
Failure determination means for determining a failure of the information processing system;
Including
The failure determination means includes
Means for converting the internal state into internal metrics data;
A determination unit in which a failure determination condition acquired by machine learning using predetermined teacher data is set, the determination unit receiving the internal metrics data as an input and outputting a determination result related to the failure of the information processing system Including
The predetermined teacher data is
Teacher data that takes the value of the external metrics data as an output and receives the value of the internal metrics data corresponding to the value in time,
The external metrics data is
Time series information of metrics related to success or failure of response to access periodically executed for a given information processing system,
The internal metrics data is
Metric time series information relating to the internal state of each element constituting the predetermined information processing system,
Fault monitoring system.

A computer for detecting failures in information processing systems
External monitoring means for periodically accessing the information processing system and accumulating success or failure of the response in time series,
Internal monitoring means for accumulating the internal state of each element constituting the information processing system in time series;
A failure determination means for determining a failure of the information processing system;
Is a program for functioning as
The failure determination means includes
Means for converting time series information of success or failure of the response into external metric data;
Means for converting the time series information of the internal state into internal metrics data;
Means for generating teacher data having the value of the external metrics data as an output and the value of the internal metrics data corresponding to the value in time as input;
A learning device for machine learning of a failure determination condition for determining a failure of the information processing system using the teacher data;
Including the program.

A computer for detecting failures in information processing systems
Internal monitoring means for accumulating the internal state of each element constituting the information processing system in time series;
A failure determination means for determining a failure of the information processing system;
Is a program for functioning as
The failure determination means includes
Means for converting the internal state into internal metrics data;
A determination unit in which a failure determination condition acquired by machine learning using predetermined teacher data is set, the determination unit receiving the internal metrics data as an input and outputting a determination result related to the failure of the information processing system Including
The predetermined teacher data is
Teacher data that takes the value of the external metrics data as an output and receives the value of the internal metrics data corresponding to the value in time,
The external metrics data is
Time series information of metrics related to success or failure of response to access periodically executed for a given information processing system,
The internal metrics data is
Metric time series information relating to the internal state of each element constituting the predetermined information processing system,
program.