JP2002342182A

JP2002342182A - Operation management support system for network systems

Info

Publication number: JP2002342182A
Application number: JP2001150272A
Authority: JP
Inventors: Hirokazu Ikeda; 博和池田; Tsugito Miyahara; 次人宮原; Masaharu Akatsu; 雅晴赤津
Original assignee: Hitachi Ltd; Hitachi INS Software Ltd
Current assignee: Hitachi Ltd; Hitachi INS Software Ltd
Priority date: 2001-05-21
Filing date: 2001-05-21
Publication date: 2002-11-29

Abstract

(57)【要約】【課題】性能のボトルネックや障害の原因となる構成要
素の一覧を報告し、原因の特定を早期に実現する。【解決手段】管理対象システム２００と、ネットワーク
０００と、運用管理サーバ１００を備え、管理対象シス
テム２００から稼動情報収集アダプタ１２０を介して収
集された各構成要素の稼動情報は稼動情報格納部１４０
に格納される。分析演算部１５０では、任意の、もしく
はあらかじめ設定した値の範囲を越えた稼動情報を1つ
選択し、それ以外の稼動情報との関連の大きさを定量化
する。定量化の演算の際には、稼動情報収集部１４０か
ら逐次必要な稼動情報を抽出する。演算の対象となった
稼動情報のうち、定量化された関連の値があらかじめ設
定した値の範囲を越えたものについて、性能のボトルネ
ックや障害の原因となっている可能性が高いとし、入出
力部１８０に報告する。 (57) [Summary] [Problem] To report a list of components causing a performance bottleneck or a failure, and to quickly identify the cause. An operating information storage unit includes an managed system, a network, and an operation management server. The operating information of each component collected from the managed system through an operating information collection adapter is stored in an operating information storage unit.
Is stored in The analysis operation unit 150 selects one piece of operation information that is arbitrarily or out of a predetermined value range, and quantifies the magnitude of the relation with the other pieces of operation information. At the time of the calculation for quantification, necessary operation information is sequentially extracted from the operation information collection unit 140. Of the operation information subject to the calculation, if the quantified related value exceeds the range of the preset value, it is highly likely that it causes a performance bottleneck or failure. This is reported to the output unit 180.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はネットワークに接続
された複数のハードウェアとソフトウェアからなるネッ
トワークシステムにおいて稼動状況を管理する方法に関
し、特に性能上のボトルネックとなる構成要素の特定
や、障害の原因となる構成要素の特定において有効な技
術に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for managing an operation status in a network system including a plurality of hardware and software connected to a network, and more particularly to a method for specifying a component which is a bottleneck in performance and for identifying a failure. It relates to technology that is effective in identifying the causal component.

【０００２】[0002]

【従来の技術】企業情報システムやiDC(Internet Data
Center)など比較的大規模なネットワークシステムにお
いては、システム管理にかかるコストが高いため、シス
テムの構成要素の稼動状況を集中して管理する統合管理
システムが導入されることが多い。この統合管理システ
ムでは、管理対象となる複数のハードウェアまたはソフ
トウェアの稼動状況に関する情報をオンラインで取得
し、統合管理システムに接続した表示装置に出力する。
管理対象となるシステムの障害を判別するには、あらか
じめ稼動情報に閾値を設定しておく方法や、平均値から
のずれを評価する方法などがあり、障害と判定された場
合、該箇所が報告される。2. Description of the Related Art Corporate information systems and iDC (Internet Data
In a relatively large-scale network system such as the (Center), since the cost of system management is high, an integrated management system for centrally managing the operation status of the components of the system is often introduced. In this integrated management system, information on the operating status of a plurality of hardware or software to be managed is acquired online and output to a display device connected to the integrated management system.
In order to determine the failure of the system to be managed, there are methods such as setting a threshold value in operation information in advance and evaluating the deviation from the average value. If a failure is determined, the location is reported. Is done.

【０００３】障害箇所としてあるプログラムのエラー率
が特定されたような場合、それがメモリ容量不足が原因
なのか、CPU負荷が原因なのか、ネットワーク負荷が原
因なのか等、解決のために原因を絞り込む必要がある。
一般に原因の解明には関係がありそうな計算機のシステ
ムログやパラメータの調査、さらにはシステムエンジニ
アの経験と勘に頼る必要があり、解決に時間と労力を要
する。特開平８−６５３０２号公報では、構成要素ごと
に予め関連の大きさを決めておき、障害時には関連の高
い構成要素について稼動情報を追加収集する技術が記述
されている。しかし前記技術では、構成要素どうしの関
連の大きさを予め設定しているために、実際の稼動状況
に応じた関連の大きさは加味されず、必ずしも原因が特
定できなかったり、原因を特定する効率があまり良くな
い。また、従来技術では、管理対象システムの性能を改
善したい場合について、どのハードウェア、もしくはソ
フトウェアが性能上のボトルネックとなっているかを特
定することは、ネットワークシステムの規模が大きくな
るほど難しく、往々にして過剰投資、過少投資を招く。When the error rate of a certain program is identified as a failure point, the cause of the problem is determined as to whether it is due to insufficient memory capacity, CPU load, or network load. It is necessary to narrow down.
In general, it is necessary to investigate the system logs and parameters of a computer that are likely to be related to the elucidation of the cause, and to rely on the experience and intuition of the system engineer, and it takes time and effort to solve the problem. Japanese Patent Application Laid-Open No. 8-65302 describes a technique in which the size of a relationship is determined in advance for each component, and additional operation information is collected for components having a high relationship in the event of a failure. However, in the above-described technology, the magnitude of the relationship between the components is set in advance, so the magnitude of the relationship according to the actual operation status is not taken into account, and the cause cannot always be specified or the cause is specified. Not very efficient. Also, in the prior art, when it is desired to improve the performance of a managed system, it is difficult to identify which hardware or software is a performance bottleneck as the size of the network system increases. Over-investment and under-investment.

【０００４】[0004]

【発明が解決しようとする課題】前述の如く、比較的規
模の大きなネットワークシステムにおいては、性能のボ
トルネックや障害の原因の絞込みを行うために、複数の
構成要素の稼動情報を調査する必要があった。本発明が
解決しようとする課題は、管理対象となるシステムの構
成要素間の関係について稼動情報を元に定量化すること
により、性能のボトルネックや障害の原因となる構成要
素を絞り込み、原因の特定を早期に実現できるようにす
ることである。As described above, in a relatively large-scale network system, it is necessary to investigate operation information of a plurality of components in order to narrow down a performance bottleneck or a cause of a failure. there were. The problem to be solved by the present invention is to quantify the relationship between the components of the system to be managed based on the operation information, thereby narrowing down the components causing a performance bottleneck or failure, and This is to enable the identification to be realized early.

【０００５】[0005]

【課題を解決するための手段】上記目的を達成するため
に、本発明のネットワークシステムにおける運用管理の
支援システムは、（１）管理機器と通信する手段と、取
得可能な稼動状態を有する１つ以上の構成要素を備える
１台以上の管理対象機器と、前記管理対象機器と通信す
る通信手段と、前記通信手段によって管理対象機器から
構成要素の稼動情報を収集する収集手段と、前記収集手
段により収集された稼動情報を蓄積しておく記憶手段
と、前記記憶手段により蓄積された稼動情報を元に任意
の１つの稼動情報と他の１つ以上の稼動情報の関連の大
きさを定量化する演算手段と、演算結果を元に関連があ
る構成要素を特定する判定手段を備える管理機器と、か
ら構成されることを特徴とする。また、（２）上記
（１）に記載の支援システムにおいて、収集された稼動
情報が、予め各稼動情報に対して設定した閾値の範囲を
逸脱した場合、該当する稼動情報を前記任意の１つの稼
動情報として演算をおこなうことを特徴とする。また、
（３）上記（１）に記載の支援システムにおける演算手
段について、任意の１つの稼動情報を目的変数に、他の
１つ以上の稼動情報を説明変数として重回帰分析を行
い、各説明変数について偏相関係数やF値などの目的変
数に対する寄与を表す統計量を算出することで、関連の
大きさを定量化することを特徴とする。また、（４）上
記（１）に記載の支援システムにおける演算手段につい
て、任意の１つの稼動情報を目的変数、他の１つ以上の
稼動情報を説明変数、予め各稼動情報に対して任意に設
定した値の範囲を異常値域、と定義した場合、目的変数
が異常値域であった時間に対する、前記時間の内説明変
数が同時に異常値域であった時間の割合を、各説明変数
ごとに算出することで、関連の大きさを定量化すること
を特徴とする。また、（５）上記（１）に記載の支援シ
ステムにおける演算手段について、任意の１つの稼動情
報を目的変数、他の１つ以上の稼動情報を説明変数と定
義した場合、管理対象機器の構成要素間の既知の依存関
係を元に、依存の有無に応じて説明変数を演算対象から
除外することを特徴とする。また、（６）上記（１）に
記載の支援システムにおける判定手段について、演算手
段によって定量化された関連の値に対し予め値の範囲を
設定し、前記値の範囲を逸脱しない稼動情報を有する構
成要素について、関連があるとみなし報告することを特
徴とする。また、（７）上記（１）に記載の支援システ
ムにおける判定手段について、演算手段によって定量化
された関連の値が複数ある場合、構成要素を前記関連の
値に応じて順位付けして報告することを特徴とする。In order to achieve the above object, an operation management support system in a network system according to the present invention comprises: (1) means for communicating with a management device; One or more devices to be managed comprising the above components, communication means for communicating with the devices to be managed, collection means for collecting operation information of the components from the devices to be managed by the communication means, and Storage means for storing the collected operation information; and quantifying the magnitude of the relationship between any one of the operation information and one or more other operation information based on the operation information accumulated by the storage means. It is characterized by comprising an arithmetic means and a management device provided with a judging means for specifying a related component based on the arithmetic result. (2) In the support system according to the above (1), when the collected operation information deviates from a threshold range set in advance for each operation information, the corresponding operation information is replaced with the one of the arbitrary ones. The calculation is performed as the operation information. Also,
(3) For the calculation means in the support system according to (1), a multiple regression analysis is performed using any one piece of operation information as an objective variable and one or more other pieces of operation information as an explanatory variable. It is characterized in that the magnitude of the association is quantified by calculating a statistic representing a contribution to an objective variable such as a partial correlation coefficient or an F value. (4) Regarding the calculation means in the support system described in the above (1), any one of the operation information is a target variable, one or more other operation information is an explanatory variable, and any one of the operation information is arbitrarily determined in advance for each operation information. When the set value range is defined as an abnormal value range, the ratio of the time during which the explanatory variable in the time is simultaneously the abnormal value range to the time when the target variable is the abnormal value range is calculated for each explanatory variable. Thus, the size of the association is quantified. (5) Regarding the calculation means in the support system described in (1) above, when any one piece of operation information is defined as a target variable and at least one other piece of operation information is defined as an explanatory variable, It is characterized in that an explanatory variable is excluded from a calculation target depending on the presence or absence of a dependency based on a known dependency between elements. (6) The determination means in the support system according to (1) has a value range set in advance for the related value quantified by the calculation means, and has operation information that does not deviate from the value range. It is characterized by reporting that components are related. (7) Regarding the determination means in the support system according to (1), when there are a plurality of related values quantified by the calculation means, the components are ranked and reported according to the related values. It is characterized by the following.

【０００６】[0006]

【発明の実施の形態】図１〜図１０は発明を実施する形
態の一例であって、図中同一の符号を付した部分は同一
物を表わし、基本的な構成は図に示す従来のものと同様
である。以下、本発明の実施の形態を添付図面を参照し
て説明する。図1は本発明の実施形態を示すネットワー
クシステムのブロック図である。このシステムは、管理
対象システム２００と、ネットワーク０００と、運用管
理サーバ１００を備えている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIGS. 1 to 10 show an embodiment of the present invention. In the drawings, the same reference numerals denote the same parts, and the basic structure is the conventional one shown in the drawings. Is the same as Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. FIG. 1 is a block diagram of a network system showing an embodiment of the present invention. This system includes a management target system 200, a network 000, and an operation management server 100.

【０００７】管理対象システム２００は、一台以上の計
算機もしくは一台以上のネットワーク機器を備えてお
り、運用の間稼動情報が稼動情報収集アダプタによって
収集される。ネットワーク機器はネットワークプリン
タ、インテリジェントハブなど固有のIPアドレスを備え
ているがOSを有さない機器である。管理対象システム２
００は通信回線００１を用いてネットワーク０００と接
続する。図１では便宜上、一本の通信回線００１で接続
しているが、管理対象計算機２１０, ２３０とネットワ
ーク０００が直接複数の通信回線で接続する構成として
も良い。通信回線００１は有線、無線どちらでもよい。The managed system 200 includes one or more computers or one or more network devices, and operation information is collected by an operation information collection adapter during operation. Network devices are devices that have unique IP addresses such as network printers and intelligent hubs but do not have an OS. Managed system 2
00 is connected to the network 000 using the communication line 001. In FIG. 1, the connection is made by a single communication line 001 for the sake of convenience. The communication line 001 may be wired or wireless.

【０００８】図２は本発明における情報の伝達経路を表
すブロック図である。管理対象システム２００から稼動
情報収集アダプタ１２０によって収集された稼動情報
は、整形され経路５２で稼動情報格納部１４０に格納さ
れる。稼動情報の収集方法には、経路５０で稼動情報収
集アダプタ１２０が収集する方法と、経路５１で管理対
象システム２００が収集し定期的に稼動収集アダプタに
送る方法があり、どちらでもよい。分析演算部１５０は
分析を行う際、経路５３で稼動情報を稼動情報格納部１
４０に取得しに行き、経路５４で必要な情報を抽出す
る。入出力部１８０では経路５６で分析の対象を指定
し、経路５５で分析結果を表示する。図２における経路
は情報の流れを抽象化して表した物であり、具体的な機
器を要求するものではない。FIG. 2 is a block diagram showing an information transmission path according to the present invention. The operation information collected by the operation information collection adapter 120 from the management target system 200 is formatted and stored in the operation information storage unit 140 via the path 52. As a method of collecting the operation information, there is a method in which the operation information collection adapter 120 collects the information on the path 50 and a method in which the managed system 200 collects the information on the path 51 and periodically sends the collected information to the operation collection adapter. When performing the analysis, the analysis operation unit 150 stores the operation information in the operation information storage unit 1 through the path 53.
At 40, the information is obtained, and necessary information is extracted through a route 54. The input / output unit 180 specifies an analysis target through the path 56 and displays the analysis result through the path 55. The path in FIG. 2 is an abstract representation of the information flow, and does not require a specific device.

【０００９】運用管理サーバ１００は、稼動情報収集ア
ダプタ１２０を介して管理対象システムから稼動情報を
収集する手段と、分析部１３０において収集した稼動情
報を分析する手段と、入出力部１８０において分析結果
の出力および分析内容を設定する手段を備えている。分
析部１３０は稼動情報収集アダプタ１２０から出力され
る稼動情報を格納しておく稼動情報格納部１４０および
構成要素間の関連の大きさを定量化するための分析演算
部１５０からなる。図１では便宜上、一台の運用管理サ
ーバ１００において上記手段を構成しているが、稼動情
報収集アダプタ１２０、稼動情報格納部１４０、分析演
算部１５０、入出力部１８０をそれぞれ別のサーバで構
成することも可能である。The operation management server 100 collects operation information from the system to be managed via the operation information collection adapter 120, analyzes the operation information collected by the analysis unit 130, and analyzes the operation result by the input / output unit 180. Means for setting the output and analysis contents. The analysis unit 130 includes an operation information storage unit 140 for storing operation information output from the operation information collection adapter 120 and an analysis operation unit 150 for quantifying the magnitude of the relationship between the components. In FIG. 1, for convenience, the above means are configured in one operation management server 100. However, the operation information collection adapter 120, the operation information storage unit 140, the analysis operation unit 150, and the input / output unit 180 are configured by separate servers. It is also possible.

【００１０】ネットワーク０００は、管理対象システム
２００と運用管理サーバ１００との通信を行うための通
信回線であり、有線・無線を問わない。管理対象システ
ム２００と運用管理サーバ１００が直接通信を行う場合
には、ネットワーク０００がない構成も可能である。な
おネットワーク０００がインターネットである場合な
ど、複数のファイアーウォールを内包する構成とするこ
ともできる。[0010] The network 000 is a communication line for performing communication between the managed system 200 and the operation management server 100, and may be wired or wireless. When the managed system 200 and the operation management server 100 communicate directly, a configuration without the network 000 is also possible. Note that, for example, when the network 000 is the Internet, a configuration including a plurality of firewalls may be included.

【００１１】以下、管理対象システム２００の各構成部
の詳細を説明する。図３は図１に示す管理対象システム
の詳細の図である。計算機２１０はプログラムを解釈、
実行するCPU２１３、プログラムやデータを読み込むた
めのメモリ２１４、ネットワークとの接続を行うための
通信インターフェース２１２、プログラムやデータを格
納する外部記憶装置２１６から構成される。通信インタ
ーフェース２１２、CPU２１３、外部記憶装置２１６、
メモリ２１４は相互にデータを送受するためのバス２１
１で結合されている。メモリ２１４には、計算機２１０
上で動作する一つ以上のプログラムがロードされる。計
算機２１０が複数のネットワークを仲介する機能を有す
る場合など、一台の管理対象計算機に複数の通信インタ
ーフェースを搭載する構成としてもよい。外部記憶装置
を搭載しない、もしくは複数搭載する構成としても良
い。計算機２３０も計算機２１０と同様の構成である。
また、計算機２１０は、小型のノートパソコンや携帯電
話などネットワークに常時接続せず、利用時のみネット
ワークに接続するような断続的な利用形態を有する移動
端末機器としても良い。Hereinafter, each component of the managed system 200 will be described in detail. FIG. 3 is a detailed diagram of the managed system shown in FIG. The computer 210 interprets the program,
It comprises a CPU 213 to be executed, a memory 214 for reading programs and data, a communication interface 212 for connecting to a network, and an external storage device 216 for storing programs and data. Communication interface 212, CPU 213, external storage device 216,
The memory 214 is a bus 21 for transmitting and receiving data to and from each other.
They are joined by 1. The memory 210 has a computer 210
One or more programs running on it are loaded. For example, when the computer 210 has a function of mediating a plurality of networks, a configuration may be adopted in which a plurality of communication interfaces are mounted on one managed computer. An external storage device may not be mounted, or a plurality of external storage devices may be mounted. The computer 230 has the same configuration as the computer 210.
In addition, the computer 210 may be a mobile terminal device having an intermittent usage mode such as a small notebook personal computer or a mobile phone that does not always connect to the network but connects to the network only when using it.

【００１２】サービスとプログラムの関係の一例を図４
に示すサービスはインターネット／イントラネットにお
ける検索サービスなど、ユーザがプログラムの動作を意
識することなく提供される機能である。サービスは、一
つ以上のプログラムから構築される機能であり、一つの
サービスを構築するプログラムが複数の計算機で動作し
ていても良い。プログラムは計算機のメモリにロードさ
れプロセス単位での管理が可能な構成要素である。一つ
のプログラムが複数のサービスを構成しても良い。図４
においてサービス３１１は計算機２１０で動作するプロ
グラム２１５と、計算機２３０で動作するプログラム２
３５から構成されている。FIG. 4 shows an example of the relationship between a service and a program.
Are functions provided without the user being aware of the operation of the program, such as a search service on the Internet / intranet. A service is a function constructed from one or more programs, and a program for constructing one service may be operated by a plurality of computers. The program is a component that is loaded into the memory of the computer and can be managed in process units. One program may constitute a plurality of services. FIG.
, The service 311 includes a program 215 operating on the computer 210 and a program 2 operating on the computer 230.
35.

【００１３】図３における管理対象構成要素２２０は計
算機２１０を構成するハードウェア要素およびソフトウ
ェア要素のうち稼動情報を収集することによって稼動状
態を監視する要素の範囲を示している。すなわち、ここ
でプログラムC, プログラムE, 通信インターフェースお
よび外部記憶装置は稼動情報を取得しない構成要素であ
る。管理対象構成要素はすべての構成要素を範囲とする
ことができると同時に、任意の構成要素を範囲としない
ことができる。図４における管理対象サービス３１０
は、図３における管理対象構成要素と同様、稼動状態を
監視するサービスの範囲を示しており、ここでサービス
３１１を管理対象としている。管理対象サービス３１０
の範囲は任意であり、管理対象システムで稼動している
すべてのサービス３００を含めることが可能であると同
時に、すべてのサービス３００を含めないこともでき
る。A managed component 220 in FIG. 3 indicates a range of components of the hardware component and software component of the computer 210 for monitoring the operating state by collecting the operating information. That is, here, the program C, the program E, the communication interface, and the external storage device are components that do not acquire operation information. The managed components can cover all components, but not any components. Managed service 310 in FIG.
3 shows the range of services for monitoring the operation status, similar to the managed components in FIG. 3, and the service 311 is managed here. Managed service 310
Is arbitrary, and it is possible to include all the services 300 operating in the managed system, and not to include all the services 300.

【００１４】図５の表４２０は、図１における稼動情報
収集アダプタ１２０が稼動情報４２１を収集する際の収
集対象と収集手段を示している。稼動情報収集アダプタ
１２０は管理対象構成要素２２０および管理対象サービ
ス３１０の範囲内における構成要素の稼動情報について
収集を行い、集計を行った上で稼動情報格納部１４０に
格納する手段を有する。図５におけるCPU使用率４３１
について稼動情報収集アダプタ１２０の動作の詳細を以
下に示す。稼動情報収集アダプタ１２０は計算機1に対
して、収集ツールAを用いてCPU使用率の収集を行う。収
集ツールAは収集周期5分おきに稼動情報収集アダプタ１
２０から起動され、収集した5分毎のCPU使用率をメモリ
もしくは外部記憶装置に格納しておく。稼動情報収集ア
ダプタ１２０は、5分間隔で収集されたメモリ使用率1時
間分の12個のデータについて平均をとり、稼動情報格納
部１４０に格納する。図５において、稼動情報収集アダ
プタ１２０は収集手段として収集ツールを呼び出す記述
となっているが、計算機１が収集ツールをプログラムと
して持ち、自身の収集ツールを自立的に動作させる方法
も可能である。また、収集周期と集計周期が同じ場合な
ど、集計処理を行わず直接稼動情報を稼動情報格納部１
４０に格納する方法も可能である。図５は稼動情報収集
アダプタ１２０の動作定義情報の例であって、収集手段
４２３、収集周期４２４、集計周期４２５、集計方法４
２６の項目については必ずしも含まなくても良く、運用
管理サーバ１００から接続可能な外部記憶装置に同様の
情報を格納すればよい。自動的に報告を行うための閾値
など、任意の判定条件を表４２０で定義することも可能
である。A table 420 in FIG. 5 shows collection targets and collection means when the operation information collection adapter 120 in FIG. 1 collects the operation information 421. The operation information collection adapter 120 has a unit that collects operation information of components within the range of the managed component 220 and the managed service 310, collects the information, and stores it in the operation information storage unit 140. CPU usage rate 431 in FIG.
The details of the operation of the operation information collection adapter 120 are described below. The operation information collection adapter 120 collects the CPU usage rate of the computer 1 using the collection tool A. Collection tool A is an operation information collection adapter 1 every 5 minutes
20 and stores the collected CPU usage rate every 5 minutes in a memory or an external storage device. The operation information collection adapter 120 averages twelve data for one hour of the memory usage rate collected at 5-minute intervals, and stores the average in the operation information storage unit 140. In FIG. 5, the operation information collection adapter 120 is described as calling a collection tool as a collection means, but a method in which the computer 1 has the collection tool as a program and operates its own collection tool independently is also possible. Also, when the collection cycle and the aggregation cycle are the same, the operation information is directly stored in the operation information storage unit 1 without performing the aggregation processing.
It is also possible to store the data at 40. FIG. 5 shows an example of the operation definition information of the operation information collection adapter 120. The collection means 423, the collection period 424, the total period 425, and the total method 4 are shown.
The 26 items need not necessarily be included, and similar information may be stored in an external storage device connectable from the operation management server 100. Arbitrary determination conditions, such as a threshold for automatically reporting, can be defined in the table 420.

【００１５】稼動情報格納部１４０は分析演算部１５０
において演算に利用する稼動情報を格納する。図６に稼
動情報格納部１４０に格納される稼動情報の一覧を、示
す。表４４０の行４５１は、2001年1月17日の12時10分
に計算機1から収集されたCPU使用率の値が２２％である
ことを示している。稼動情報格納部１４０に格納されて
いるデータ量は、管理対象システムを管理する時間が長
いほど稼動情報が蓄積され、多くなる。稼動情報格納部
１４０には、分析演算において対象とする時間内のすべ
ての稼動情報が、同じ時間幅かつ同じ時刻で格納されて
いる必要がある。すなわち、行４５１については、2001
年1月17日12時10分の稼動情報が、計算機1のCPU使用率
およびメモリ使用率だけではなく、すべての対象構成要
素について格納されていなければならないということで
ある。ただし、実際の運用においては、ある時刻におい
て収集できなかった稼動情報があることがあるため、そ
の際には稼動情報の該格納場所に直前の値や０や１００
など適当な値を格納しておく。図６は稼動情報格納部１
４０に格納される稼動情報の一例であって、データ構造
は問わず、運用管理サーバ１００から接続可能な外部記
憶装置に同様の情報を格納すればよい。The operation information storage unit 140 includes an analysis operation unit 150
Stores operation information used for calculation. FIG. 6 shows a list of operation information stored in the operation information storage unit 140. Row 451 of Table 440 indicates that the value of the CPU usage rate collected from Computer 1 at 12:10 on January 17, 2001 is 22%. As for the amount of data stored in the operation information storage unit 140, the operation information is accumulated and increases as the time for managing the managed system increases. In the operation information storage unit 140, it is necessary that all operation information within the time targeted in the analysis operation be stored at the same time width and at the same time. That is, for row 451, 2001
That is, the operation information on January 17, 12:10 must be stored not only for the CPU usage rate and the memory usage rate of the computer 1 but also for all target components. However, in actual operation, there may be operation information that could not be collected at a certain time, and in that case, the last value, 0 or 100 is stored in the storage location of the operation information.
And other appropriate values. FIG. 6 shows the operation information storage unit 1
This is an example of operation information stored in the storage 40, and the same information may be stored in an external storage device connectable from the operation management server 100 regardless of the data structure.

【００１６】以下で、図１における分析演算部１５０の
動作の詳細について説明する。図７は、図６の稼動情報
を時刻を軸にグラフ化した図である。管理対象システム
２００から収集された稼動情報にはサービス応答時間４
６１など合計７つある。分析演算部１５０は、着目した
稼動情報と他の稼動情報の間の関連の強さを定量化する
手段と、着目した稼動情報に強い関連がある稼動情報を
選択する手段を具備する。着目した稼動情報には、入出
力部１８０において人間が指定するもの、およびあらか
じめ設定した閾値を越えたものの二通りがある。図８
は、入出力部１８０においてサービス応答時間４６１に
ついて分析を行うよう指示した場合の動作について示し
たブロック図である。以下で関連の強さを定量化する手
段として、重回帰分析を用いた方法と、異常時における
寄与率の算出による方法の二通りの分析方法を取り上げ
詳細について説明するが、定量化する手段を複数組み合
わせて用いることも可能である。Hereinafter, the operation of the analysis operation unit 150 in FIG. 1 will be described in detail. FIG. 7 is a diagram in which the operation information of FIG. 6 is graphed with time as an axis. The operation information collected from the managed system 200 includes the service response time 4
There are a total of seven, such as 61. The analysis operation unit 150 includes a unit for quantifying the strength of association between the operation information of interest and other operation information, and a unit for selecting operation information having a strong association with the operation information of interest. There are two types of operation information of interest, one specified by a human in the input / output unit 180 and one exceeding a preset threshold. FIG.
FIG. 9 is a block diagram showing an operation when an instruction is given to perform an analysis on the service response time 461 in the input / output unit 180. The following two methods are used to quantify the strength of the association: a method that uses multiple regression analysis and a method that calculates the contribution rate at the time of abnormality. It is also possible to use a plurality of them in combination.

【００１７】重回帰分析を用いて他の稼動情報の中から
サービス応答時間４６１に関連の大きい稼動情報を選定
する方法を説明する。重回帰分析では、数１に示す重回
帰式を求める。数１においてｙは目的変数を、ｘは説明
変数を、αは回帰係数を、ｆ(ｘ)はｘを従属変数とする
任意の関数を表す。ｆ(ｘ)＝ｘである場合、数１は線形
重回帰式であるが、当てはめのよい任意の関数を使用す
ることも可能である。入出力部１８０では、重回帰分析
を行うための分析条件５０１として目的変数と説明変数
を指定する。重回帰分析を行う説明変数は任意に選択す
ることができるが、必ず一つ以上選ばなければならな
い。重回帰分析演算部１５０は分析条件５０１を解釈
し、指定された稼動情報を稼動情報格納部１４０から抽
出する。抽出された稼動情報４７０は、各時刻ごとに一
つの稼動情報の集合として扱う。すなわち数１において
は、ある時刻のｙおよびその時刻におけるすべてのｘと
して扱うことを意味する。A method of selecting operation information having a high relation to the service response time 461 from other operation information using multiple regression analysis will be described. In the multiple regression analysis, a multiple regression equation shown in Expression 1 is obtained. In Equation 1, y represents an objective variable, x represents an explanatory variable, α represents a regression coefficient, and f (x) represents an arbitrary function having x as a dependent variable. When f (x) = x, Equation 1 is a linear multiple regression equation, but it is also possible to use any function that fits well. The input / output unit 180 specifies an objective variable and an explanatory variable as analysis conditions 501 for performing multiple regression analysis. The explanatory variables for performing multiple regression analysis can be arbitrarily selected, but at least one must be selected. The multiple regression analysis operation unit 150 interprets the analysis condition 501 and extracts specified operation information from the operation information storage unit 140. The extracted operation information 470 is treated as one set of operation information at each time. That is, in Expression 1, it means that it is treated as y at a certain time and all x at that time.

【００１８】[0018]

【数１】稼動情報格納部１４０から抽出した稼動情報４７０にお
ける目的変数４７２と説明変数４７３の値を用いて重回
帰分析を行い、重回帰式および分析結果を得る。図９は
図８における重回帰式による分析結果の詳細な内容であ
る。説明変数毎の分析結果５２０から数２の重回帰式が
記述できる。数２の重回帰式を構成する説明変数５２１
が目的変数４７２の予測に必要であるかはF値５２２を
用いて決定する。有意水準α＝０.０５において説明変
数として採用されるためにはF値が18.513を越えなけれ
ばならず、説明変数５２１のうちこれに該当するものは
計算機２のメモリ使用率４６２のみである。すなわち稼
動情報を分析した結果、特許検索サービスのサービス応
答時間４６１と関連が高い稼動情報は計算機２のメモリ
使用率４６２であった。分析演算部１５０は重回帰式と
分析結果のすべてもしくは一部と、十分な関連があると
判定された稼動情報、ここでは計算機２のメモリ使用
率、を入出力部１８０に出力する。数２の重回帰式の決
定では、ステップワイズ法など、より最適な説明変数を
取捨選択する手段を用いることも可能である。また、説
明変数の選定の際には検定統計量としてF値ではなく、t
値や、赤池情報量基準(AIC)、偏相関係数など他の基準
を用いることも可能であり、その際の選定基準値につい
ても任意の値を指定することができる。(Equation 1) Multiple regression analysis is performed using the values of the objective variable 472 and the explanatory variable 473 in the operation information 470 extracted from the operation information storage unit 140, and a multiple regression equation and an analysis result are obtained. FIG. 9 shows the detailed contents of the analysis result by the multiple regression equation in FIG. The multiple regression equation of Expression 2 can be described from the analysis result 520 for each explanatory variable. Explanatory variables 521 constituting the multiple regression equation of Expression 2
Is necessary for predicting the target variable 472, using the F value 522. In order to be adopted as an explanatory variable at the significance level α = 0.05, the F value must exceed 18.513, and the only one of the explanatory variables 521 that corresponds to this is the memory usage rate 462 of the computer 2. That is, as a result of analyzing the operation information, the operation information closely related to the service response time 461 of the patent search service is the memory usage rate 462 of the computer 2. The analysis operation unit 150 outputs, to the input / output unit 180, the operation information determined to be sufficiently related to all or a part of the multiple regression equation and the analysis result, here, the memory usage rate of the computer 2. In determining the multiple regression equation of Expression 2, it is also possible to use a means for selecting more optimal explanatory variables, such as a stepwise method. In addition, when selecting the explanatory variables, t
It is also possible to use other criteria such as a value, Akaike information criterion (AIC), and partial correlation coefficient, and any value can be designated as a selection reference value at that time.

【００１９】[0019]

【数２】管理対象システム２００から収集された図６の稼動情報
を異常時における寄与率による算定方法を用いて分析す
る方法を以下で説明する。異常寄与率の算定における動
作の流れは基本的に重回帰分析のそれに順ずる。入出力
部１８０において、目的変数に指定された特許検索サー
ビスのサービス応答時間と関連の高い説明変数を異常寄
与率を元に決定する。説明変数は目的変数以外の稼動情
報について、管理対象システム２００のすべての構成要
素を対象とすること、もしくは任意に選択することが可
能である。稼動情報４７０が異常であるということは、
平均値との差が大きいことや、機器の性能・容量限界に
近いこと、機器が正常に動作していないことなどがある
が、いずれにせよ、異常の判定のため任意の閾値を設定
する。数３に異常寄与率の算出式を示す。稼動情報４７
０から、目的変数が異常であった時間と、その時間にお
いて説明変数が異常であった時間の合計を算出し、異常
寄与率を各説明変数ごとに求める。異常寄与率は目的変
数が異常の時、説明変数が同時に異常である割合であ
り、関連の可能性を表す指標である。(Equation 2) A method of analyzing the operation information of FIG. 6 collected from the management target system 200 using the calculation method based on the contribution ratio at the time of abnormality will be described below. The flow of operation in calculating the anomalous contribution rate basically follows that of multiple regression analysis. The input / output unit 180 determines an explanatory variable that is highly relevant to the service response time of the patent search service specified as the objective variable based on the abnormal contribution rate. Regarding the operation information other than the objective variable, the explanatory variable can be used for all the components of the managed system 200 or can be arbitrarily selected. If the operation information 470 is abnormal,
The difference from the average value is large, the device is close to the performance / capacity limit, or the device is not operating normally. In any case, an arbitrary threshold value is set to determine an abnormality. Equation 3 shows a formula for calculating the abnormal contribution rate. Operation information 47
From 0, the sum of the time during which the objective variable was abnormal and the time during which the explanatory variable was abnormal at that time is calculated, and the abnormal contribution rate is determined for each explanatory variable. The abnormal contribution rate is a rate at which the explanatory variable is abnormal at the same time when the objective variable is abnormal, and is an index indicating the possibility of association.

【００２０】[0020]

【数３】図１０に異常寄与率による分析結果を示す。ここでは異
常判定の手段として、閾値５４２においてあらかじめ異
常である値の範囲を指定することと、偏差の閾値５４３
において稼動情報の平均値からの偏差の範囲を指定する
ことを用いたが、任意の閾値を任意の組み合わせで指定
することが可能である。閾値は、図５の稼動情報収集ア
ダプタの動作定義において指定すること、もしくは分析
演算部１５０や入出力部１８０で直接指定することがで
きる。図１４の異常であった時間５４４において目的変
数が異常であった時間は４時間であり、この時間の内説
明変数が同時に異常であった時間を数３から算出した値
が異常寄与率５４５である。関連が高いと決定するため
の異常寄与率の閾値を４０％とすると、特許検索サービ
スのサービス応答時間４６１と関連が高い稼動情報は、
計算機２のメモリ使用率と計算機２のCPU使用率であ
る。分析演算部１５０は分析結果のすべてもしくは一部
と、十分な関連があると判定された稼動情報、ここでは
計算機２のメモリ使用率と計算機２のCPU使用率、を入
出力部１８０に出力する。異常寄与率の閾値は、指定し
ないことおよび任意の値を指定することができる。(Equation 3) FIG. 10 shows an analysis result based on the abnormal contribution rate. Here, as a means for determining an abnormality, a range of a value that is abnormal in the threshold 542 is designated in advance, and a threshold 543 for the deviation is set.
Although the specification of the range of deviation from the average value of the operation information is used in the above, any threshold value can be specified in any combination. The threshold can be specified in the operation definition of the operation information collection adapter in FIG. 5, or can be specified directly by the analysis operation unit 150 or the input / output unit 180. The time during which the target variable was abnormal in the abnormal time 544 in FIG. 14 is 4 hours, and the value of the time during which the explanatory variable was abnormal at the same time from Equation 3 is the abnormal contribution rate 545. is there. Assuming that the threshold of the abnormal contribution rate for determining that the relation is high is 40%, the operation information that is highly relevant to the service response time 461 of the patent search service is:
The memory usage rate of the computer 2 and the CPU usage rate of the computer 2 are shown. The analysis calculation unit 150 outputs to the input / output unit 180 operation information determined to be sufficiently related to all or a part of the analysis result, here, the memory usage rate of the computer 2 and the CPU usage rate of the computer 2. . The threshold value of the abnormal contribution rate can be left unspecified and can be set to any value.

【００２１】入出力部１８０は、目的変数と説明変数指
定する手段と、分析結果を表示する手段を備える。稼動
対象システム２００の機器構成情報の表示または設定
と、図５における稼動情報収集アダプタの動作定義情報
の表示または設定と、図７における稼動情報のグラフ表
示など付加的な入出力機能を有することも可能である。
入出力部１８０において、目的変数と説明変数を直接指
定しない場合、閾値を逸脱した場合自動的に分析を行う
ように、閾値の情報を運用管理サーバ内で保持する必要
がある。分析結果の表示の際には、目的変数に関して関
連が高いとみなされた説明変数を一覧表示する。目的変
数もしくは説明変数は本来稼動情報であり、構成要素の
稼動状況を表す指標であることから、ここで表示された
一覧は目的変数が属する構成要素の、性能のボトルネッ
クや障害の原因となっている可能性が高い構成要素の稼
動情報である。サービス応答時間４６１を目的変数とし
て重回帰分析によって演算した結果である図９を例にと
ると、サービスの性能に影響を与えている稼動情報は計
算機２のメモリ使用率であるとなる。なお、分析演算部
１５０で関連を定量化する演算方法を複数使用した場合
についても、関連が高いとみなされた説明変数をすべて
表示する。The input / output unit 180 includes means for designating an objective variable and an explanatory variable, and means for displaying an analysis result. It may have additional input / output functions such as display or setting of device configuration information of the operation target system 200, display or setting of operation definition information of the operation information collection adapter in FIG. 5, and graph display of operation information in FIG. It is possible.
In the input / output unit 180, the threshold information needs to be stored in the operation management server so that the analysis is automatically performed when the target variable and the explanatory variable are not directly specified or when the threshold value is exceeded. When displaying the analysis result, a list of explanatory variables considered to be highly related to the objective variable is displayed. Since the target variable or explanatory variable is essentially operating information and an index indicating the operating status of the component, the list displayed here may cause a performance bottleneck or failure of the component to which the objective variable belongs. This is operation information of a component that is highly likely to be in operation. In the example of FIG. 9, which is a result of performing a multiple regression analysis using the service response time 461 as an objective variable, the operation information affecting the service performance is the memory usage rate of the computer 2. Even when a plurality of calculation methods for quantifying the relationship are used in the analysis calculation unit 150, all the explanatory variables regarded as having a high relationship are displayed.

【００２２】尚、本発明のネットワークシステムにおけ
る運用管理の支援システムは、上記した実施の形態に限
定されるものではなく、本発明の要旨を逸脱しない範囲
内において種々変更を加え得ることは勿論である。Incidentally, the operation management support system in the network system of the present invention is not limited to the above-described embodiment, and it is needless to say that various changes can be made without departing from the gist of the present invention. is there.

【００２３】[0023]

【発明の効果】本発明によれば、ネットワークシステム
の構成要素間の関係の大きさを定量化することにより、
性能のボトルネックや障害の原因となる構成要素の一覧
を報告し、原因の特定を早期に実現することができる。According to the present invention, by quantifying the magnitude of the relationship between the components of the network system,
By reporting a list of components that cause performance bottlenecks and failures, the cause can be identified early.

[Brief description of the drawings]

【図１】本発明のネットワークシステムにおける運用管
理の支援システムの本発明に係る構成の一実施例を示す
ブロック図である。FIG. 1 is a block diagram showing one embodiment of a configuration according to the present invention of an operation management support system in a network system of the present invention.

【図２】図１における情報の伝達経路を表すブロック図
である。FIG. 2 is a block diagram showing a transmission path of information in FIG.

【図３】図１における管理対象システムの各構成部の詳
細図である。FIG. 3 is a detailed diagram of each component of the managed system in FIG. 1;

【図４】図１における管理対象計算機で動作するプログ
ラムとサービスの関係を表すブロック図である。FIG. 4 is a block diagram showing a relationship between a program operating on a managed computer in FIG. 1 and a service.

【図５】図１における稼動情報収集アダプタの動作定義
の一例を示した一覧図である。FIG. 5 is a list showing an example of an operation definition of the operation information collection adapter in FIG. 1;

【図６】図１における稼動情報格納部に格納される稼動
情報の一例を示した一覧図である。FIG. 6 is a list diagram showing an example of operation information stored in an operation information storage unit in FIG. 1;

【図７】図１における管理対象システムから収集された
稼動情報を、時刻を横軸として表示したグラフの一例を
示す図である。FIG. 7 is a diagram illustrating an example of a graph in which operation information collected from the management target system in FIG. 1 is displayed with time as a horizontal axis.

【図８】図１における運用管理サーバについて、分析演
算部における処理の流れを示す説明図である。FIG. 8 is an explanatory diagram showing a flow of processing in an analysis operation unit with respect to the operation management server in FIG. 1;

【図９】図１における分析演算部において、演算手法と
して重回帰分析を用いた場合の分析結果の一例を示す図
である。9 is a diagram illustrating an example of an analysis result when multiple regression analysis is used as a calculation method in the analysis calculation unit in FIG. 1;

【図１０】図１における分析演算部において、演算手法
として異常寄与率を算出する方法を用いた場合の分析結
果の一例を示す図である。FIG. 10 is a diagram illustrating an example of an analysis result in a case where a method of calculating an abnormal contribution rate is used as a calculation method in the analysis calculation unit in FIG. 1;

[Explanation of symbols]

０００…ネットワーク、００１…通信回線、１００…運
用管理サーバ、１２０…稼動情報収集アダプタ、１３０
…分析部、１４０…稼動情報格納部、１５０…分析演算
部、１８０…入出力部、２００…管理対象システム、２
１０…管理対象計算機、２１１…バス、２１２…通信イ
ンターフェース、２１３…CPU、２１４…メモリ、２１
５…プログラム、２２０…管理対象構成要素、３００…
ネットワークシステムが提供するサービス群、３１０…
管理対象サービス、３１１…サービスα。000 network, 001 communication line, 100 operation management server, 120 operation information collection adapter, 130
... Analysis unit, 140 ... Operation information storage unit, 150 ... Analysis operation unit, 180 ... Input / output unit, 200 ... Managed system, 2
10: Computer to be managed, 211: Bus, 212: Communication interface, 213: CPU, 214: Memory, 21
5 Program, 220 Management target component, 300
Services provided by the network system, 310 ...
Managed service, 311... Service α.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｈ０４Ｌ 12/24 Ｈ０４Ｌ 12/24 12/26 12/26 Ｈ０４Ｑ 9/00 ３０１Ｈ０４Ｑ 9/00 ３０１Ｂ３１１３１１Ｈ３２１３２１Ｅ (72)発明者宮原次人神奈川県横浜市中区日本大通５番地の２日立アイ・エヌ・エス・ソフトウェア株式会社内 (72)発明者赤津雅晴神奈川県川崎市麻生区王禅寺1099番地株式会社日立製作所システム開発研究所内Ｆターム(参考） 5B042 GA12 HH20 JJ29 MA14 MC30 5B089 JA35 JA36 JB15 JB16 KA12 KA13 KB04 KC28 MC01 5K030 GA11 HB19 HC01 KA01 KA02 MA01 MB09 5K048 BA21 DA02 DB01 DC01 DC03 EB06 EB11 EB12 GB05 HA01 HA02 ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) H04L 12/24 H04L 12/24 12/26 12/26 H04Q 9/00 301 H04Q 9/00 301B 311 311H 321 321E (72) Inventor Tsutomu Miyahara 5-2 Nihon Odori, Naka-ku, Yokohama-shi, Kanagawa Prefecture Inside Hitachi I.S.S. Software Co., Ltd. (72) Masaharu Akatsu 1099 Ozenji Temple, Aso-ku, Kawasaki-shi, Kanagawa Stock Company GA term HB20 JJ29 MA14 MC30 5B089 JA35 JA36 JB15 JB16 KA12 KA13 KB04 KC28 MC01 5K030 GA11 HB19 HC01 KA01 KA02 MA01 MB09 5K048 BA21 DA02 DB01 DC01 DC03 EB06 EB01 HA01

Claims

[Claims]

A communication unit configured to communicate with the managed device; a communication device configured to communicate with the managed device, the communication device configured to communicate with the managed device, the communication device configured to communicate with the managed device; Collecting means for collecting operation information of components from the device to be managed by the means, storage means for storing the operation information collected by the collection means, and arbitrary information based on the operation information accumulated by the storage means. It is composed of arithmetic means for quantifying the magnitude of the association between one piece of operation information and one or more other pieces of operation information, and a management device provided with a judgment means for specifying a related component based on the operation result. An operation management support system for a network system.

2. The method according to claim 1, wherein when the collected operation information deviates from a threshold range set in advance for each operation information, a corresponding operation information is calculated as the arbitrary one operation information. An operation management support system for a network system characterized by performing:

3. A multiple regression analysis is performed on the calculation means in the management device according to claim 1, using any one of the operation information as an objective variable and one or more other operation information as an explanatory variable. An operation management support system for a network system, characterized in that a statistic representing a contribution to an objective variable such as a partial correlation coefficient or an F value is calculated to quantify the magnitude of the association.

4. The operation means in the management device according to claim 1, wherein any one of the operation information is a target variable, and
When one or more operation information is defined as an explanatory variable, and a value range arbitrarily set in advance for each operation information is defined as an abnormal value range,
It is characterized in that the ratio of the time during which the explanatory variable is simultaneously in the abnormal value range to the time when the objective variable is in the abnormal value range is calculated for each explanatory variable, thereby quantifying the magnitude of the association. Operation management support system in a network system.

5. The arithmetic means in the management device according to claim 1, wherein any one piece of operation information is used as a target variable, and
When one or more operation information is defined as an explanatory variable, a network system is characterized in that the explanatory variable is excluded from a calculation target according to the presence or absence of a dependency based on a known dependency between components of the managed device. Operation management support system.

6. A constituent element having operation information that does not deviate from the value range, in which the determination means in the management device according to claim 1 sets a value range in advance for a related value quantified by the calculation means. And an operation management support system in a network system, characterized in that it is reported as relevant.

7. The determining means in the management device according to claim 1, wherein when there are a plurality of related values quantified by the calculating means, the constituent elements are ranked and reported according to the related values. An operation management support system for network systems.