JP2004021549A

JP2004021549A - Network monitoring system and program

Info

Publication number: JP2004021549A
Application number: JP2002174833A
Authority: JP
Inventors: Sohei Yoshino; 芳野　壮平; Shinji Shinno; 新野　真司; Junichi Hosokawa; 細川　淳一; Yosuke Itasaka; 板坂　洋介; Kenzo Horie; 堀江　健三; Shinichi Okamoto; 岡本　真一
Original assignee: Hitachi Ltd; Hitachi Information Systems Ltd
Current assignee: Hitachi Ltd; Hitachi Systems Ltd
Priority date: 2002-06-14
Filing date: 2002-06-14
Publication date: 2004-01-22

Abstract

【課題】大規模なマルチベンダ環境の分散コンピュータネットワークの運用管理者の負担の軽減とＴＣＯの削減を可能とする。
【解決手段】監視対象装置４，５に情報収集エージェント４４，５４を、また、監視装置にネットワーク監視マネージャ１を組み込み、監視対象装置５においては、情報収集エージェント５４により、別系統の監視装置２専用に組込まれた情報収集エージェント５Ａと共存させ、複数台の監視装置と情報共用を実現することにより、マルチベンダ環境の分散コンピュータネットワークシステムにおける各監視サポートを統合的して行う。この際、監視装置（ネットワーク監視マネージャ１）と監視対象装置４，５間にワンタイムパスワードによる認証を行う機能（１１ａ、４４ａ，５４ａ）を設けることで、監視装置になりすましてのユーザ側の監視対象装置への不正侵入を防止する。
【選択図】　　　　図１An object of the present invention is to reduce the burden on an operation manager of a distributed computer network in a large-scale multi-vendor environment and reduce the TCO.
An information collection agent (44, 54) is incorporated in a monitoring target device (4, 5), and a network monitoring manager (1) is incorporated in the monitoring device. By coexisting with the dedicated information collecting agent 5A and realizing information sharing with a plurality of monitoring devices, each monitoring support in a distributed computer network system in a multi-vendor environment is performed in an integrated manner. At this time, by providing a function (11a, 44a, 54a) for performing authentication using a one-time password between the monitoring device (network monitoring manager 1) and the monitoring target devices 4 and 5, the user can monitor as a monitoring device. Prevent unauthorized entry into the target device.
[Selection diagram] Fig. 1

Description

【０００１】
【発明の属する技術分野】
本発明は、ネットワークの監視・管理技術に係わり、特に、大規模なネットワーク上に分散したサーバ装置やデータ伝送装置等を対象とした障害監視および性能監視を効率的に行い、ネットワークの運用管理者の負担を軽減するのに好適なネットワーク監視技術に関するものである。
【０００２】
【従来の技術】
コンピュータシステムにおけるクライアント・サーバ環境の進歩によりコンピュータネットワーク上でのリソースの分散化が進み、分散したオフィス先のコンピュータ装置に対する監視・管理も必要となっている。
【０００３】
このような分散ネットワークでの監視では、監視装置一台でネットワークを挟んだ分散オフィスや同一ＬＡＮ（Ｌｏｃａｌ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）上にあるデータ伝送装置やサーバ装置のリソースを監視することが望まれるが、ネットワークを挟んだ監視ではリモート系コマンドによる制御が必要となる。
【０００４】
しかし、このようなリモート系コマンドによる制御では、不正アクセスが可能となり、他人から覗かれる恐れがあり、セキュリティ上の問題がある。そのため、現状では、分散オフィス毎、例えば、同じＬＡＮに監視装置を設置し、当該ＬＡＮ内の監視対象装置のみを監視している。
【０００５】
さらに、リモートでの監視はネットワーク上のトラヒックが増加するので、このような問題に対処するために、監視対象装置（サーバ装置など）にエージェント（問題発覚時のみトラップを行う）を組み込み、このエージェントで収集した監視情報や管理情報を、監視装置は、監視の基本であるＳＮＭＰ（Ｓｉｍｐｌｅ　ＮｅｔｗｏｒｋＭａｎａｇｅｍｅｎｔ　Ｐｒｏｔｏｃｏｌ）により採取する技術が用いられている。
【０００６】
しかし、これらエージェントで収集した監視情報や管理情報には、問題発覚時に障害としてトラップせずに異常状態のメッセージをログ情報として出力するものや、独自プロトコルを持った管理情報もあり、これらの情報に関しては、ＳＮＭＰによる情報収集ができない。このようなＳＮＭＰによる情報収集ができない監視対象装置や管理情報が増える傾向にある。例えば、グループウェーア系Ｍａｉｌ、ファイヤーウォールやディレクトリ（Ｌｉｇｈｔｗｅｉｇｈｔ　Ｄｉｒｅｃｔｏｒｙ　Ａｃｃｅｓｓ　Ｐｒｏｔｏｃｏｌ、以下「ＬＤＡＰ」と省略）などの共通アプリケーションソフトがある。
【０００７】
また、大規模な分散ネットワークでは、マルチベンダ環境が一般的である。すなわち、大規模なコンピュータネットワークシステムにおいては、ネットワーク構築に当たり一社だけでは満足のいくシステム体系にはならない為、他社製品と連携できることが重要なファクターでもある。
【０００８】
同一ＬＡＮにおいて、障害検知から復旧、そして通報までのプロセスを一元管理した従来技術はあるが、マルチベンダ環境下や複数の監視装置を統合した環境での監視制御はできない。そのため、現状では、各社シリーズ製品内での連携が大半である。
【０００９】
また、ＴＣＯ（Ｔｏｔａｌ　Ｃｏｓｔ　ｏｆ　Ｏｗｎｅｒｓｈｉｐ、トータル運用コスト）削減を目的に、障害発生の通報を効率化する従来技術がある。すなわち、障害発生と同時にユーザへ、障害通知のための電子メールを自動送信するものであり、この技術では、電子メール宛先を監視担当者として通知したり、同報でメーリングリスト対象ユーザ全員に通知する。
【００１０】
しかし、実際の通報の流れはもう少し複雑である。すなわち、監視センタからネットワーク管理者へ通報し、ネットワーク管理者は関係するサーバ担当者を探し、電話やメールで連絡、または担当者の席まで呼びに行くのが実態である。このように、サイト先の顕在する問題を考慮しないと、実際のＴＣＯ削減が図れない。
【００１１】
また、ネットワーク監視者は、監視装置で提供する画面の状態確認が必要だがマルチベンダ環境下で複数の監視装置が分散されている為、業務は煩雑となり対応が遅れる。例えば、従来のマルチベンダ環境下での各監視装置の連携技術では、マスタ監視装置の画面に、その監視下になる各監視装置のアイコンを表示し、各アイコン間にリンクを張るだけである。そのため、監視情報画面や性能監視情報やログ情報は、各監視装置固有の画面表示のままであり、統合されていない。
【００１２】
また、電子メールのような共通アプリケーションの異常発生を監視する場合には、ＵＮＩＸ（登録商標）／ＯＳや、それ以外のＯＳ等が備えているメッセージ（リソース状況）と、アプリケーションが出力するメッセージログ情報とを連携し、どこまでリカバリ処理が必要か検知する。これを自動的に実行して障害を復旧させたり、システム担当者に警告を出すことが要求される。しかし、サーバ別や障害別にその対応が異なる点を考慮した監視システムは無く、ユーザ自身で開発しなければならない。
【００１３】
例えば、ＵＮＩＸ（登録商標）／ＯＳ系でもＦｒｅｅＢＳＤ（登録商標）、Ｌｉｎｕｘ（登録商標）や商用ＵＮＩＸ（登録商標）およびそれ以外のＯＳと多種にわたる。そして、監視対象の電子メールでも、ＳＭＴＰメールやグループウェーア系Ｍａｉｌなどがあり、その代表的な監視技術にメッセージ滞留キューチェックがあるが、グループウェーア系Ｍａｉｌの状態監視では、「ｘ．４００」、「ｓｍｔｐｇｗ」、「ｓｍｔｐ（Ｓｅｎｄｍａｉｌ）」それぞれの情報を収集し、グループウェーア系Ｍａｉｌサーバ内に滞留しているメールキュー数を予め設定したしきい値と比較し、障害を判定する必要がある。
【００１４】
以上のように、従来のネットワーク監視ソフト製品は同一ＬＡＮ（そのビルにクローズした利用）上での監視を前提にした仕様が大半であるが、クライアント／サーバ技術の進歩によりリソースの分散化が進んでおり、このような分散したオフィスの監視管理（ネットワークや広域ＬＡＮを挟んだ監視）や、アプリケーション層までを含めた”Ｅｎｄ　ｔｏ　Ｅｎｄ”の観点に立った監視、そしてその一元化によるネットワークシステム全体の信頼性確保が必要とされている。
【００１５】
しかし、従来の分散オフィス間の監視はセキュア通信が不十分であり、また分散先毎の情報セキュリテイ確保が困難であることから、分散オフィス毎に監視システムを構築・運用するのが現状である。
【００１６】
このように、近年のコンピュータネットワークの普及に伴い（１）ネットワークの広域化、（２）ネットワークの分散化、（３）マルチベンダ環境、（４）管理の効率化とＴＣＯ削減の要求があり、さらにはユーザ側からも監視状況が把握できる機能とサーバ装置のリブートをリモートで操作して復旧させることも要求される。また、各システム管理部門でもイントラネットや電子メール等の急激な利用増加により、サーバ装置の常時監視と障害の早期復旧が重要な課題となっている。しかし、どこにボトルネックが生じているのか予測し難いという現状である。
【００１７】
その対応として、複数の市販監視製品を導入し、良い所を集めたマルチベンダ監視システム環境の導入が図られているが、業務アプリケーション毎に特化した「障害情報表示と連携した連絡、異常処理のリモート復旧」等の連携運用機能面が不十分であった。この結果、監視・運用管理に要する費用（ＴＣＯ）の増大を招くと共にネットワーク監視の一元化に反し全体ネットワークシステムの信頼性維持が困難となっている。
【００１８】
【発明が解決しようとする課題】
解決しようとする問題点は、従来のマルチベンダ環境の大規模な分散ネットワークの監視技術では、別系統の監視装置専用に組み込まれた情報収集エージェントと共存させる配慮がなく、一つの監視システムだけではサポートが不可能であり、また、リモート監視により監視装置になりすましてユーザのサーバ装置等へ不正侵入することを防止することができず、また、従来は障害の発生をネットワーク管理者に電子メール等で通知するだけであり、通知を確認したネットワーク管理者が障害サーバ装置等の担当者を特定して連絡するまでに時間と手間がかかってしまい、また、従来は監視対象サーバが増えた場合に監視性能確保のために監視装置を分散させると全体の監視情報を一元管理できずＴＣＯ削減ができず、また、一台の監視装置で全ての監視情報を集約して管理すると当該管理装置への負荷が増大してしまい、また、従来の監視システムではＰＩＮＧコマンドでの動作状態監視やＳＮＭＰによる性能監視の抱き合わせ機能までであり、障害検知から通報そして復旧までの障害監視運用過程を自動化することができず、また、従来は複数の監視装置で分散して監視しているので、例えば、それぞれの監視装置を連携しマスタ監視装置の画面にその監視下になる監視装置アイコンを表示し、リンクを張るだけであり、監視情報画面や性能監視情報、ログ情報などは、各監視装置に固有の画面表示のままで統合されておらず、また、従来は監視装置で提供する画面の状態確認が監視装置に分散され業務は煩雑となり対応が遅れてしまい、さらに、従来の稼働統計は計画停止時間などの情報がなくこの時間を除いた稼働率が提供されていたため、大規模なマルチベンダ環境の分散コンピュータネットワークシステムにおける監視を効率良くかつ安全にサポートすることができない点である。
【００１９】
本発明の目的は、これら従来技術の課題を解決し、大規模なマルチベンダ環境の分散コンピュータネットワークの運用管理者の負担の軽減とＴＣＯの削減を可能とすることである。
【００２０】
【課題を解決するための手段】
上記目的を達成するため、本発明では、マルチベンダ環境のコンピュータネットワークシステムにおける各監視対象装置のリモート監視を行うシステムとして、各監視対象装置（データ伝送装置やサーバ装置）に情報収集エージェント（プログラム）を、また、監視装置にネットワーク監視マネージャ（プログラム）を組み込み、監視対象装置において、情報収集エージェントにより、別系統の監視装置専用に組込まれた情報収集エージェントと共存させ、複数台の監視装置と情報共用を実現することにより、マルチベンダ環境のコンピュータネットワークシステムにおける各監視サポートを統合的して行う。また、監視装置と監視対象装置間にワンタイムパスワードによる認証を行う機能を設けることで、監視装置になりすましてのユーザ側の監視対象装置への不正侵入を防止する。また、障害単位で担当者リスト、電話連絡の有無、重要度を示すメッセージを表示する機能を設けることで、障害を誰に伝えればよいかの検索を容易とする。また、ＮＦＳ（Ｎｅｔｗｏｒｋ　Ｆｉｌｅ　Ｓｙｓｔｅｍ）技術を利用して、監視情報が保存されるそれぞれの監視装置間をネットワーク結合する機能を設けることにより、サーバ負荷軽減を図り、かつ、複数の監視装置間の監視情報を同期させ、一元管理する。また、情報収集エージェントにおいて、ログ情報に用いられるアドレスや識別子、文字の配列などを登録し、ログ情報を検索して、同じパターンを検出した場合、予め登録したアクション動作をさせるパタンマッチ処理機能を設けることにより、ＳＮＭＰなどのように障害情報をログ上に出力するだけでは不可能な監視も可能とし、さらに、障害検知から通報そして復旧までの障害監視運用過程の自動実行を可能とする。また、ユーザ側に提供する監視情報は、監視状態を一元的に把握できる構成でＷｅｂ画面で提供し、かつ、階層が深くなるほど詳細情報を提供する表示構成とすることにより、ユーザと監視センタの双方向での遠隔監視を実現し、迅速な障害体制の確立を可能とする。また、ネットワーク監視マネージャにおいて、監視情報結果から監視対象装置ごとの計画停止時間を含めた月間サービス稼動率とリソース使用率を算出し、サービス稼働率表（稼働率、稼働時間、停止回数、停止時間、警告回数、計画停止回数と時間）と重要障害発生頻度管理（レベルを４区分して色分けして警告）およびリソース使用率推移グラフ（閾値との比較表示、週単位比較表示）の稼動月次レポートを自動作成してＷｅｂ画面で提供する機能を設けることにより、データ伝送装置やサーバ装置等の監視対象装置のシステム障害を事前に予測する情報を提供する。
【００２１】
【発明の実施の形態】
以下、本発明の実施の形態を、図面により詳細に説明する。
【００２２】
図１は、本発明に係わるネットワーク監視システムの構成例を示すブロック図であり、図２は、図１におけるネットワーク監視システムの第１の動作例を示す説明図、図３は、図１におけるネットワーク監視システムの第２の動作例を示す説明図、図４は、図１におけるネットワーク監視システムの詳細構成例を示すブロック図である。
【００２３】
図１において、１は監視装置に読み込まれたネットワーク監視マネージャ（図中「ネットワーク監視マネージャプログラム」と記載）、２は別系統の監視装置、３〜５はデータ伝送装置やサーバ装置等の監視対象装置であり、６〜９は広域ＬＡＮ等のネットワークの通信回線である。
【００２４】
各装置１〜５は、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）や主メモリ、表示装置、入力装置、外部記憶装置等を具備したコンピュータ構成からなり、光ディスク駆動装置等を介してＣＤ−ＲＯＭ等の記憶媒体に記録されたプログラムやデータを外部記憶装置内にインストールした後、この外部記憶装置から主メモリに読み込みＣＰＵで処理することにより各機能を実現する。
【００２５】
監視対象装置３は、ＴＣＰ（Ｔｒａｎｓｍｉｓｓｉｏｎ　Ｃｏｎｔｒｏｌ　Ｐｒｏｔｏｃｏｌ）処理を行うＴＣＰ処理部（図中「ＴＣＰポート」と記載）３１と、装置保有リソース情報や稼働情報を格納する情報記憶部（図中「装置保有リソース情報や稼働情報」と記載）３２、および、例えばＳＭＴＰ（Ｓｉｍｐｌｅ　Ｍａｉｌ　Ｔｒａｎｓｆｅｒ　Ｐｒｏｔｏｃｏｌ）やＷＷＷ（Ｗｏｒｌｄ　Ｗｉｄｅ　Ｗｅｂ）等のアプリケーションプログラムに基づく処理を行うアプリケーション処理部（図中「アプリケーションプログラム」と記載）３３を有する。
【００２６】
また、監視対象装置４は、同ＴＣＰ処理部４１、情報記憶部４２、アプリケーション処理部４３と共に、ワンタイムパスワード認証プログラム４４ａ、性能監視用エージェントプログラム４４ｂ、リモート復旧プログラム４４ｃをサブプログラムとして持ち本発明に係わる情報収集処理を行う情報収集エージェント（図中「情報収集エージェントプログラム」と記載）４４を有する。
【００２７】
さらに、監視対象装置５は、同ＴＣＰ処理部５１、情報記憶部５２、アプリケーション処理部５３、および、ワンタイムパスワード認証プログラム５４ａ、性能監視用エージェントプログラム５４ｂ、リモート復旧プログラム５４ｃをサブプログラムとして持ち情報収集処理を行う情報収集エージェント５４と共に、別系統の監視装置用の専用情報収集エージェントに基づく処理を行う別系統情報収集エージェント（図中「別系統の監視装置用の専用情報収集エージェントプログラム」と記載）５Ａを有する。
【００２８】
監視装置に組み込まれるネットワーク監視マネージャ１は、プロセス／性能監視プログラム１１、中継サーバプログラム１２、稼働月次レポート自動作成プログラム１３からなり、さらに、プロセス／性能監視プログラム１１は、ワンタイムパスワード発生プログラム１１ａ、ＴＣＰポート番号設定変更プログラム１１ｂ、プロセス／ステータス確認プログラム１１ｃ、監視タイミング時間調整変更プログラム１１ｄ、監視一時休止状態表示プログラム１１ｅ、障害管理用Ｗｅｂプログラム１１ｆ、リモート復旧判断プログラム１１ｇからなり、中継サーバプログラム１２は、統合監視情報管理プログラム１２ａ、ＨＴＭＬ生成プログラム（図中「ＨＴＭＬ生成」と記載）１２ｂ、ソケットプログラム１２ｃ、別系統の監視装置の専用情報収集エージェントプログラム１２Ａからなり、稼働月次レポート自動作成プログラム１３は、状態履歴情報Ｗｅｂコンテンツ生成プログラム（図中「状態履歴情報Ｗｅｂコンテンツ生成」と記載）１３ａからなる。
【００２９】
別系統の監視装置２は、障害復旧テンプレート２１ａとソケットプログラム２１ａを有し、仮想通信経路１０を介して、ネットワーク監視マネージャ１内に取り込まれた別系統の監視装置の専用情報収集エージェントプログラム１２Ａと接続される。
【００３０】
各監視対象装置３〜５はそれぞれ異なるベンダから提供されたものとし、本例のネットワーク監視システムでは、このようなマルチベンダ環境において、各監視対象装置３〜５に対するリモート監視を行う。
【００３１】
大規模なネットワークを構成するにはこのようなマルチベンダ環境となるのが一般的であり、このような大規模なネットワークを運用するにあたり、ネットワーク監視の自動化と標準化が要求される。また、ネットワーク機器やトラヒック管理のみでは万全ではなく、さらにはアプリケーションをも連携した監視も含め、迅速な復旧処理をする必要がある。
【００３２】
これらの要件を考慮したネットワーク監視を行うためには、次に例示するようなポイントが重要である。
【００３３】
ポイント（１）：監視オペレータの仕事は常時緊張を強いられる。すなわち、オペレータは、監視画面を常時チェックし、障害が発生するとユーザに連絡しなければならない。その際、連絡先担当者を調べて連絡・操作指示を待つ。役割分担の通りに障害に対応するには迅速な連絡を実現する必要があり、そのためには、オペレータが障害を誰に伝えればよいかを容易に検索できるようにすることが重要なポイントとなる。
【００３４】
そのために、本例では、ネットワーク監視マネージャ１（のプロセス／性能監視プログラム１１）に障害管理用Ｗｅｂプログラム１１ｆを設け、見やすいＷｅｂ画面構成で、ユーザ側および監視センタ側の双方に同時に、障害単位で担当者リスト、電話連絡の有無、重要度を示すメッセージを表示する。
【００３５】
ポイント（２）：業務アプリケーションの異常発生を監視する場合、ＯＳ（オペレーションシステム）が備えているメッセージ（リソース状況）とアプリケーションが出力するメッセージログを連携し、どこまでリカバリ処理が必要かを検知し、さらに、障害から自動復旧させたり、システム担当者に警告を出すことが要求される。本例では、プロセス／性能監視プログラム１１を設け、このような処理を行う。
【００３６】
ポイント（３）：リモートでの監視ソフトウエアはネットワーク上のトラヒックが増加する為、監視対象サーバ等にエージェントを組み込み、問題発覚時のみトラップすることで情報採取する技術があるが、監視の基本であるＳＮＭＰだけでは管理できない機器や管理情報が増える傾向にある。本例では、このような問題に対処するために、情報収集エージェント４４に性能監視用エージェントプログラム４４ｂを設ける。
【００３７】
ポイント（４）：大規模なコンピュータネットワークシステムの運用管理をする際、アプリケーション管理機能やソフトウエア配布／イベントリ管理機能等の個々の運用管理機能同士を連携させることが要求されるが、構築に当たり一社だけでは満足のいく製品体系にはならない。本例では、ネットワーク監視マネージャ１に中継サーバプログラム１２を設けて他社製品との連携を行う。これにより、分散先毎に監視マネージャを設置する必要がなくなり、設備面、運用人員面でのコスト削減を図ることができる。
【００３８】
ポイント（５）：分散したオフィス先の各リソース管理をＷＡＮ等を挟んでリモート系コマンドによる監視制御を行う場合、不正アクセスが可能であるとの問題に対処するため、本例では、ネットワーク監視マネージャ１側にワンタイムパスワード発生プログラム１１ａを、情報収集エージェント４４側にワンタイムパスワード認証プログラム４４ａを設け、監視装置と監視対象装置４，５間のセキュア通信をサポートする。
【００３９】
従来は、監視サーバ（監視マネージャとも呼ばれている）１台でＷＡＮを挟んだ分散オフィスや同一ＬＡＮ上にあるサーバのリソースを監視する場合、他人から覗かれる恐れがあるので、分散毎に監視サーバを設置し監視している。
【００４０】
その他、例えば、動作状態を監視するのに、「Ｐｉｎｇ」コマンドが用いられるが、従来は、このコマンド（Ｐｉｎｇ）の発行間隔時間を監視状態に応じて変更できない。その結果、実際には障害復旧しているが、監視間隔時間ズレにより監視マネージャの監視状態は異常表示となったままの状態が発生する。このような問題に対処するために、本例では、ネットワーク監視マネージャ１におけるプロセス／性能監視プログラム１１に監視タイミング時間調整変更プログラム１１ｄを設けている。
【００４１】
また、従来技術では、工事等で停止している状態も、障害として検知されるので、障害情報の精度が劣化する。このような問題に対処するために、本例では、監視一時休止状態表示プログラム１１ｅを設け、工事管理情報データベース１４に基づき、工事等での停止状態を障害状態と区別して管理する。
【００４２】
このように、本例では、監視対象装置４，５に情報収集エージェント４４，５４を、また、監視装置にネットワーク監視マネージャ１を組み込み、監視対象装置４，５では、情報収集エージェント４４，５４により、当該監視対象装置４，５の稼働情報や性能情報および保有するリソースの状態等の情報を収集し、情報記憶部４２，５２に格納して管理する。
【００４３】
監視対象装置４，５の情報収集エージェント４４，５４は、ログ情報に用いられるアドレスや識別子、文字の配列などを登録し、ログ情報を検索して、同じパターンを検出した場合、予め登録したアクション動作をさせるパタンマッチ処理機能を有し、ＳＮＭＰなどのように障害情報をログ上に出力するだけでは不可能な監視も行い、さらに、障害検知から通報そして復旧までの障害監視運用過程を自動的に実行する。
【００４４】
ネットワーク監視マネージャ１は、通信回線７，８を介して、監視対象装置４，５の情報収集エージェント４４，５４に定期的にアクセスして、情報記憶部４２，５２に格納した各種情報を取得し、障害発生の検知、および、障害復旧の検知等を行い、検知した障害情報および復旧情報を一元的に管理して、監視センタ側と共にユーザ装置側にもＷｅｂブラウザを介して通知する。これにより、リアルタイムで遠隔性と同時可視化に優れたネットワーク管理を行うことができる。
【００４５】
また、情報収集エージェント４４，５４とネットワーク監視マネージャ１間では、ワンタイムパスワード発生プログラム１１ａとワンタイムパスワード認証プログラム４４ａ，５４ａにより、ワンタイムパスワードによる情報収集単位毎の認証を行うことにより、リモート不正アクセスを可否でき安全性を確保できる。
【００４６】
また、ネットワーク監視マネージャ１は、監視対象装置４，５での障害情報を取得すると、監視タイミング時間調整変更プログラム１１ｄにより、当該障害に対する復旧情報を取得するまで、情報収集時間間隔を短くして、復旧情報を取得するタイミングを早め、監視精度を向上させる。
【００４７】
また、ネットワーク監視マネージャ１は、中継サーバプログラム１２により、別系統の監視装置の制御下で収集・管理している監視情報を、ＮＦＳ（Ｎｅｔｗｏｒｋ　Ｆｉｌｅ　Ｓｙｓｔｅｍ）技術により統合管理し、さらに、統合管理した情報に基づき障害発生を検知し、検知した障害に対応する復旧処理を別系統の監視装置に指示し、この別系統の監視装置にアドオンしたプログラムからリモートコマンドを発行し、当該監視対象装置の情報収集エージェントに自動復旧させる。これにより、既存の監視マネージャをそのまま導入しただけで、例えばイントラネット環境下でのベンダ毎の監視や管理ツールを共存させることができる。
【００４８】
以下、図１におけるシステムの動作説明を行う。
【００４９】
図１においては、データ伝送装置やサーバ装置などからなる複数の監視対象装置３〜５の情報収集エージェント４４，５４は、予め監視センタからリモート処理で組み込まれる。
【００５０】
この情報収集エージェント４４，５４が収集して情報記憶部３２，４２，５２に格納した各種情報（障害情報や装置保有リソース情報、稼動情報等の性能情報など）、および、アプリケーションプログラム３３，４３，５３の動作状況などが、ネットワークを介してネットワーク監視マネージャ１において、監視情報（システムログ情報）として収集される。この際、ネットワーク監視マネージャ１が監視対象装置であるか否かをワンタイムパスワード認証により認証し、正当性を確認する。
【００５１】
ネットワーク監視マネージャ１は、障害管理用Ｗｅｂプログラム１１ｆにより、監視情報と同時に収集している性能情報を基に、図２に示す手順で、監視センタとユーザの双方に、同時に、障害検知や、監視情報および性能情報の分析結果などを自動通知する。
【００５２】
すなわち、図２に示すように、従来は、監視対象装置における障害発生を検知した監視装置が、まず、監視センタに通知し、監視センタにおいて、情報収集、分析／調査を行い、ユーザ側に警告／通知を障害当該ユーザに行っており、ユーザ側と監視センタにおいて通知を受けるまでに大きな時間差が発生していたが、本例では、ネットワーク監視マネージャ１が、障害発生を検知すると、監視センタとユーザの双方に、同時に、通知するので、監視センタとユーザとの通知時間差がほとんどゼロになる。
【００５３】
また、本例では、ネットワーク監視マネージャ１と情報収集エージェント４４，５４において、障害情報の自動収集と、分析、調査、および、リモート復旧処理を行うことにより、監視センタおよびユーザ側では、直接の情報収集が不要となり、重度障害のみの分析／調査のみを行うだけでよくなる。
【００５４】
さらに、ネットワーク監視マネージャ１から監視センタおよびユーザ側への障害発生などの通知は、障害管理用Ｗｅｂプログラム１１ｆによりＷｅｂ技術を利用して、瞬時に異常を見つけ易いように監視項目や性能項目を任意の観測時間で、数値や○×で、視覚的に表示かつ具体的変化を数値で判断しやすいチックシート形式でＷｅｂ画面に表示する。
【００５５】
例えば、この障害管理用Ｗｅｂプログラム１１ｆによるＷｅｂ画面表示において、各監視対象装置３〜５がイントラネット系のサーバ装置であれば、障害が発生した装置の担当者および連絡先と条件等が記載されたポップアップメモが自動的に現れ、同時に、障害管理用Ｗｅｂプログラム１１ｆは、担当者へ電子メールを自動発信する。
【００５６】
監視センタ側に対するＷｅｂ画面では、障害サーバ名や時刻等の情報メッセージをポップアップ表示して警告する。障害が復旧すると自動的に裏画面の障害履歴画面に内容が移動される仕組みとする。
【００５７】
監視対象が電子メールサーバであれば、監視画面に障害サーバを表示してブザーを鳴らし、オペレータが、該当する障害サーバ表示部分をクリックすると連絡先情報がポップアップする仕組みとする。
【００５８】
ユーザ側に対するＷｅｂ画面では、階層画面構成とし、最初の階層画面では、事業所毎にサービス別ノードをアイコンで稼動状況をリアルタイム表示する。この際、正常／注意／異常の３段階評価で色分けして表示する。さらに次の階層画面では、パスワード入力を必要とし、障害発生ログによる詳細状態を把握可能な内容を表示し、この画面で警告音を出す仕組みとする。
【００５９】
また、監視対象装置３〜５がイントラネットやインターネット系のサーバであれば、そのプロセス・性能の監視に関してのＷｅｂ画面では、最初の画面においては、各監視対象装置３〜５の状態をアイコンで、正常／警告／異常の３階段評価で表示する。次の階層画面ではチェックシート方式による性能情報を提供し、次の階層画面で、詳細性能情報をテキストベースで提供する仕組みとする。
【００６０】
電子メールサーバの障害に関しては、監視画面に障害箇所を表示してブザーを鳴らし、オペレータが該当する障害箇所をクリックすると、稼動状況一覧画面にリンクし、リンク先では各障害箇所での滞留メッセージ数やＳＭＴＰのレスポンス状態等の情報をユーザが瞬時に異常判断できる最小項目をビジュアルにサーバ毎にブロック表示する。
【００６１】
尚、小規模な事業所側では、夜間バッチ処理によるサーバ停止が毎晩発生することがある。このような場合に対処するため、本例では、監視一時休止状態表示プログラム１１ｅにより、監視対象から任意な時間帯に解除する。
【００６２】
さらに、本例では、障害の発生から復旧、稼動月次統計報告作成に至る障害監視運用全過程を、途中、人的操作を介入せず、Ｗｅｂ管理画面のみで総合的に一括管理することができる。
【００６３】
以下、図１におけるネットワーク監視システムの動作について説明する。図１において、監視対象装置３は、情報収集エージェントが組み込まれておらず、ＴＣＰポートのみで監視される装置である。ＴＣＰポートでの監視としては、例えばサーバ装置の各サービスプロセスの生死状態確認がある。
【００６４】
また、監視対象装置４は、ＴＣＰポートでの監視を含み、さらに、ワンタイムパスワード認証プログラム４４ａと、性能監視用エージェントプログラム４４ｂ、リモート復旧プログラム４４ｃからなる情報収集エージェント４４が組み込まれ、これらのプログラムに基づく監視が行われる。
【００６５】
そして、監視対象装置５は、監視対象装置４の構成に、さらに、既に別系統の監視装置２の監視下にある専用情報収集エージェント５Ａが組み込まれており、情報収集エージェント４４と専用情報収集エージェント５Ａとが共存し、両監視が行われる。
【００６６】
これらの監視対象装置３〜５は、ネットワークや広域ＬＡＮを介して監視装置（ネットワーク監視マネージャ１）に接続され、監視装置において、各監視対象装置３〜５の監視情報が収集され管理される。
【００６７】
まず、監視対象装置３に対する監視動作について説明する。
【００６８】
監視対象装置３の監視は、ネットワーク監視マネージャ１のプロセス／ステータス確認プログラム１１ｃから、状態確認コマンド（ＰＩＮＧ）を、ＴＣＰポート番号設定変更プログラム１１ｂ経由（予め該当のＴＣＰポート番号変更指示設定がない場合はデフォルト）で、通信回線６に接続した監視対象装置３のＴＣＰ処理部３１を介して各ＴＣＰポートに接続し、監視対象のＴＣＰポートのプロセス状態を５分間隔（任意設定可）で監視する。
【００６９】
状態確認コマンド（ＰＩＮＧ）の無応答を検知すると、「正常／警告／異常」の３区分のうち「警告」に設定する。
【００７０】
このように、「警告」を設定すると、図３で示すように、監視タイミング時間変更プログラム１１ｂにより、ＰＩＮＧの発行タイミング時間を、５分間隔から１分間隔（任意設定可）に自動的に短縮し、以降、約１０分間、１分間隔で、そのＴＣＰポートに対してＴＣＰセッション確立を試みる。
【００７１】
そこで、確立できない場合のみエラーのメッセージ（Ｃｏｎｎｅｃｔｉｏｎ　ｒｅｆｕｓｅｄ）を返す。そのメッセージの存在有無により、プロセス／ステータス確認プログラム１１ｃは、障害を検知し、「異常」区分とする。
【００７２】
尚、ＰＩＮＧのレスポンスがあると、プロセス／ステータス確認プログラム１１ｃは、自動的にデフォルトに戻し「正常」区分となる。
【００７３】
このように、ネットワーク監視マネージャ１では、「警告」を設定すると、ＰＩＮＧ発行タイミング時間を、５分間隔から１分間隔に自動的に短縮して、そのＴＣＰポートに対するＴＣＰセッション確立を試みることにより、復旧検知時間を早くでき、監視精度向上を図ることが可能である。
【００７４】
また、監視タイミング時間変更プログラム１１ｂは、監視対象装置３が固有に持っているシステムログ情報で管理している復旧時刻と、ネットワーク監視マネージャ１の復旧時刻にズレが発生した場合、ネットワーク監視マネージャ１が障害通知のため自動発行する通知メール上に記載される障害発生時刻や復旧時刻およびＷｅｂ表示の警告時刻などに時刻差が生じるので、ネットワーク監視マネージャ１が参照する時刻を、監視対象装置３がシステムログ情報の管理に用いている時刻に補正する。複数の監視装置間にまたがった監視情報や性能管理情報の収集時刻などが同期されるので、障害分析を複数のログ情報を突き合わせる原因追跡（時間経緯）では有効となる。
【００７５】
尚、障害管理Ｗｅｂプログラム１１ｆでは、ＴＣＰポートに応答がない場合（「Ｃｏｎｎｅｃｔｉｏｎ」が「ｒｅｆｕｓｅｄ」される場合）は「警告」とし、ユーザ側装置や監視センタ装置に提供するＷｅｂ画面で表示するアイコンを緑色（正常）から黄色（警告）に変える。そして、監視間隔が１分間隔に切り替わり、さらに、１０回連続で応答がない場合（約１０分間）に障害として判断し、アイコンを黄色から赤色（異常）に変えアラームを鳴動する。
【００７６】
また、監視対象装置の「障害」、「復旧」を検知した場合は、監視条件メッセージ管理データベース１５において予め指定されたサーバ管理者に、電子メールを自動発送して通報する。この監視条件メッセージ管理データベース１５におけるユーザ別・サーバ別の通知先や、時間他の指定や担当者のエスカレーション等は任意に設定が可能である。
【００７７】
自動発送する通知メールの例を下記に示す。

【００７８】
また、監視結果は下記のようにＷｅｂ画面上にロギングされる。これらログは、常時、過去５日間のログを表示する。また、サーバの稼動状態が良好の場合は何も表示されない。
【００７９】

【００８０】
ここで、「Ａｐｒ／２４／２００１　０２：１３：１０　ｎｍａｐｐ１　ｄｉｓｋ　ｏｋ」は、「正常」であり、色識別区分は「緑色」で、緑色に表示され、また「Ａｐｒ／２３／２００１　００：１３：１０　ｎｍａｐｐ１　ｄｉｓｋ　ｗａｒｎｉｎｇ　ｏｖｅｒ　９０％」は、「警告」であり、色識別区分は「黄色」で黄色に表示され、そして、「Ａｐｒ／２２／２００１　１０：０６：３９　ｎｍａｐｐ２　ｄｎｓ　（ｐｏｒｔ　５３）　ｅｒｒｏｒ」は「異常」であり、色識別区分は「赤色」で、赤色に表示される。
【００８１】
障害管理用Ｗｅｂプログラム１１ｆでは、監視対象のＴＣＰポートに応答がない場合、監視一時休止表示プログラム１１ｅからの情報を参照する。すなわち、監視一時休止表示プログラム１１ｅは、工事管理情報データベース１４を参照し、監視対象装置３の工事停止情報を検索し、障害か工事による停止かを判断し、その結果を障害管理用Ｗｅｂプログラム１１ｆに指示する。
【００８２】
障害外、例えば工事による停止であれば、障害管理用Ｗｅｂプログラム１１ｆは、その時間帯を監視対象外扱いとする。このように、監視対象外時の場合は、Ｗｅｂ画面上に青色のアイコンを表示する。このアイコンは通常は使用しないが、計画的停止などによる監視の一時停止時などに表示する。
【００８３】
また、この停止時間情報は、稼動月次レポート自動作成プログラム１３に蓄積される。稼動月次レポート自動作成プログラム１３は、蓄積した情報結果から監視対象装置（サーバ装置等）ごとの月間サービス稼動率とリソース使用率を算出し、サービス稼働率表（稼働率、稼働時間、停止回数、停止時間、警告回数、計画停止回数と時間）と重要障害発生頻度管理（レベル４で区分して色で警告）、および、リソース使用率推移グラフ（閾値との比較表示、週単位比較表示）等からなる稼動月次レポートを自動作成し、状態履歴情報Ｗｅｂコンテンツ生成１３ａでデータ伝送装置やサーバ装置のシステム障害を事前に予測する情報に加工する。
【００８４】
次に、監視対象装置４に対する監視動作を説明するが、監視対象装置４の「状態監視」に関しては、ネットワーク監視マネージャ１のプロセス／ステータス確認プログラム１１ｃから状態確認コマンド（ＰＩＮＧ）をＴＣＰポート番号設定変更プログラム１１ｂ経由（予め該当のＴＣＰポート番号変更指示設定がない場合はデフォルト）で通信回線７に接続した監視対象装置４のＴＣＰ処理部４１を介して各ＴＣＰポートに接続するもので、監視対象装置３と同様の監視過程であり、以下「性能監視」のみをポイントに説明する。
【００８５】
監視対象装置４において、性能監視用エージェントプログラム４４ｂは、情報収集エージェント４４に組み込んだサブプログラムであるが、性能監視用エージェントプログラム４４ｂ単体でも機能するものであり、ＣＰＵ負荷情報の収集、ディスク使用率情報の収集、メモリ使用率情報の採取、メールキュー情報の採取、プロセス数の収集等を行う。
【００８６】
また、情報収集エージェント４４は、ログ情報とのパターンマッチによるアクション動作機能の他に、ネットワーク監視マネージャ１との監視専用ＴＣＰポート（例えばポート番号「８８８８」）での通信機能、ならびに、別系統の監視装置専用に組み込まれた情報収集エージェントと共存を可能とする機能を有し、さらに、ワンタイムパスワード認証プログラム４４ａ、性能監視用エージェントプログラム４４ｂ、リモート復旧プログラム４４ｃのそれぞれを連携する機能を有する。
【００８７】
ネットワーク監視マネージャ１は、プロセス／ステータス確認プログラム１１ｃからＴＣＰポート番号設定変更プログラム１１ｂ経由（予め該当のＴＣＰポート番号変更指示設定がない場合はデフォルト）で、監視対象装置４の情報収集エージェント４４の性能監視用エージェントプログラム４４ｂを起動させる為のリモートコマンドを、通信回線７に接続した監視対象装置４のＴＣＰ処理部４１が情報収集エージェント４４に専用に割当てたＴＣＰポート（「８８８８」）を介して発行し、性能監視用エージェントプログラム４４ｂにおいて予め登録されている各種スクリプト（ＣＰＵ負荷情報収集用、ディスク使用率情報収集用、メモリ使用率情報採取用、メールキュー情報採取用、プロセス数収集用など）を起動させる。
【００８８】
尚、この際、リモートコマンドには、ワンタイムパスワード発生プログラム１１ａで生成した、監視対象装置４の情報収集エージェント４４の性能監視用エージェントプログラム４４ｂを起動させる為のワンタイムパスワードを付与し、ワンタイムパスワード認証プログラム４４ａにおいてワンタイムパスワードに基づく認証を行った後に、性能監視用エージェントプログラム４４ｂに発行し起動させる。
【００８９】
このように、ワンタイムパスワード認証後に、性能監視用エージェントプログラム４４ｂは、リモートコマンドに対応する性能数値をチェックシート形式で性能監視情報として編集し、プロセス／性能監視プログラム１１に送信する。
【００９０】
プロセス／性能監視プログラム１１では、障害管理用Ｗｅｂプログラム１１ｆにより、性能監視用エージェントプログラム４４ｂから送られてきた性能数値を予め設定した「しきい値」と比較し、しきい値を超えた（下回った）場合には障害として検知し通報対象とする。尚、性能監視用エージェントプログラム４４ｂでは、性能監視情報は貯めず、アクセスログ情報のみを残す。
【００９１】
性能評価における「ロードアベレージの監視（ＣＰＵ負荷情報収集）」は、基本的に「ｕｐｔｉｍｅ」コマンド　の結果をもとにＣＰＵの負荷状況を把握し、過去１分平均の値をもとに監視を行う。例えば、ＦｒｅｅＢＳＤ（登録商標）の場合、「ｕｐｔｉｍｅ」　の実行結果は以下のように示される。
【００９２】

【００９３】
上記「ｌｏａｄ　ａｖｅｒａｇｅ」以下の項目（０．１０，　０．０９，　０．０８）を取得し、しきい値と比較させ、それを上回った場合に警告とする。この状態がしばらく続くと障害として検知する。
【００９４】
このように、しきい値を超えた時すぐに障害を検知するのではなく、しきい値を超えた状態が続くようなら障害と認識する。尚、警告期間は任意に設定可能である。
【００９５】
また、性能評価における「ディスク使用率の監視」は「ｄｆ」コマンド　の結果をもとにディスクの使用状況を把握し、ファイルシステム単位での監視を行う。例えばＦｒｅｅＢＳＤ（登録商標）の場合、「ｄｆ」の実行結果は以下のようになる。
【００９６】

【００９７】
ファイルシステム（「Ｆｉｌｅｓｙｓｔｅｍ」）に対応する「Ｃａｐａｃｉｔｙ」の値（５２％、４８％、０％）を取得し、しきい値と比較し、それを超えた場合に障害として検知する。ファイルシステムは同時に複数監視可能であるが、しきい値は同一のものとする。尚、しきい値の指定は２つまで可能とする。
【００９８】
また、性能評価における「メモリ使用率の監視」は、基本的に「ｔｏｐ」コマンド　の結果をもとにメモリの使用状況を把握し、フリーメモリの値をもとに監視を行う。例えば、ＦｒｅｅＢＳＤ（登録商標）の場合、「ｔｏｐ」の実行結果は以下に示すようになる。
【００９９】

【０１００】
このうち、「Ｍｅｍｏｒｙ：」の行（「Ｍｅｍｏｒｙ：　Ｒｅａｌ：　３６２８Ｋ／２２Ｍ　Ｖｉｒｔ：　８７５２Ｋ／１９９Ｍ　Ｆｒｅｅ：　２９Ｍ」）のみを選定する。さらに、「Ｍｅｍｏｒｙ：」に関する「Ｆｒｅｅ：」の項目（「２９Ｍ」）を取得し、しきい値と比較させ、それを下回った場合に障害として検知する。但し、「ｔｏｐ」コマンドが標準でインストールされていない場合があるので、その場合は別途インストールするか、監視できないということになる。
【０１０１】
例えば、オペレーティングシステムがＬｉｎｕｘ（登録商標）の場合、Ｌｉｎｕｘ（登録商標）にはメモリ使用状況を表示する専用の「ｆｒｅｅ」コマンドなるものが存在するので、Ｌｉｎｕｘ（登録商標）の場合はこの「ｆｒｅｅ」コマンドを使用する。この「ｆｒｅｅ」コマンドの実行例を下記に示す。
【０１０２】

【０１０３】
この時は、「Ｍｅｍ：」行に対する「ｆｒｅｅ」の値（「２３３４８」）を取得する。
【０１０４】
次に、性能評価における「メールキュー監視」について「Ｓｅｎｄｍａｉｌ」を例に説明する。
【０１０５】
「Ｓｅｎｄｍａｉｌ」のメールキュー監視は、「ｍａｉｌｑ」コマンド　の結果をもとにメールの滞留状況を把握し、この滞留数をもとに監視を行う。例えば、「Ｓｅｎｄｍａｉｌ」の場合の「ｍａｉｌｑ」の実行結果は以下のようになる。
【０１０６】

【０１０７】
このような実行結果から、メールの滞留数を取得し、しきい値と比較させ、それを超えた場合に障害として検知する。尚、メールキューがない場合はメッセージとして「ｅｍｐｔｙ」を返すので、これを「０（数値）」として扱う。
【０１０８】
また、他の事例として、グループウェーア系Ｍａｉｌについて説明する。このグループウェーア系Ｍａｉｌのメールキュー監視は、上記「Ｓｅｎｄｍａｉｌ」のメールキュー監視に加え、グループウェーア系Ｍａｉｌのローカルで使われている「ｘ．４００」及び、この「ｘ．４００」とＳＭＴＰとの掛け橋となる「ＳＭＴＰ　Ｇａｔｅｗａｙ」の持つそれぞれのファイル数をカウントし、それを滞留数として扱うようにする。尚、「ＳＭＴＰ　Ｇａｔｅｗａｙ」は「ｘ．４００」向けと「Ｓｅｎｄｍａｉｌ」向けの２つをカウントする。
【０１０９】
グループウェーア系Ｍａｉｌ特有のメール滞留数は、ある特定のディレクトリ上のファイル数をカウントすることで求めることができるので、ファイルをカウントするスクリプトを準備しておき、これを実行することで各々滞留数を取得することができる。このようにして取得した滞留数としきい値を比較させ、それを超えた場合に障害として検知する。尚、監視は「Ｍａｉｌ　ｑｕｅｕｅ」、「ｘ４００　ｑｕｅｕｅ」、「ｓｍｔｐ　ｔｏ　ｘ４００　ｑｕｅｕｅ」、「ｓｍｔｐ　ｔｏ　Ｓｅｎｄｍａｉｌ　ｑｕｅｕｅ」の４項目それぞれについて可能である。
【０１１０】
さらに、他の事例として、ウイルスチェックサーバの監視は、搭載されたウイルスチェックソフト製品を用いてのメールウィルスチェック専用のメールキューを監視する。このメール滞留数もグループウェーア系Ｍａｉｌと同様に、特定のディレクトリ上のファイル数をカウントすることで求めることができる。
【０１１１】
次に、性能評価における「プロセス数監視」は、特定のプロセス数をカウントして、そのカウント数を元に監視するものである。代表的なもので言えば、「ＳｅｎｄＭａｉｌ」、「Ｄｅｌｅｇａｔｅ」、「Ｓｑｕｉｄ」等である。対象プロセスを限定するものではないので、カウント可能ものであれば種別は問題ではない。
【０１１２】
例として、「ＳｅｎｄＭａｉｌ」のプロセス数を監視する際、以下に示すように、「ｐｓ」コマンド　にてプロセス一覧を表示させ、その中で　「ｓｅｎｄｍａｉｌ」　の文字列を有するものを抜き出す。その抜き出した行数をカウントすることでプロセス数を取得する。
【０１１３】

【０１１４】
特定プロセス数は、上記のようなプロセス数をカウントするためのスクリプトを準備しておき、これを実行することでプロセス数を取得することができる。取得したプロセス数としきい値を比較し、それを超えた場合に障害として検知する。
【０１１５】
次に、監視対象装置４に対するリモート復旧動作を説明する。
【０１１６】
まず、監視対象装置４上で生じるイベント（ＨＴＴＰ：Ｈｙｐｅｒ　Ｔｅｘｔ　Ｔｒａｎｓｆｅｒ　Ｐｒｏｔｏｃｏｌ、ＳＭＴＰ：Ｓｉｍｐｌｅ　Ｍａｉｌ　Ｔｒａｎｓｆｅｒ　Ｐｒｏｔｏｃｏｌの異常終了など）をトリガとして、リモート復旧プログラム４４ｃに予め組み込んだ障害に応じた復旧オペレーションを実行するプログラムやシェルスクリプトを登録しておく。
【０１１７】
監視対象装置４では、情報収集エージェント４４の性能監視用エージェントプログラム４４ｂが、情報記憶部４２に格納した装置保有リソースや稼動情報（各種ログファイル含む）を参照し、各種ログファイルでのパタンマッチやコマンド実行結果での監視を行う。
【０１１８】
情報収集エージェント４４は、この性能監視用エージェントプログラム４４ｂによる監視結果を、監視装置に組み込まれたネットワーク監視マネージャ１のプロセス／性能監視プログラム１１に、障害検知としてトラップを上げる。
【０１１９】
監視装置に組み込まれたネットワーク監視マネージャ１のプロセス／性能監視プログラム１１は、この情報を基に、リモート復旧判断プログラム１１ｇから、ＴＣＰポート番号設定変更プログラム１１ｂとワンタイムパスワード発生プログラム１１ａ経由で（予め該当のＴＣＰポート番号変更指示設定がない場合はデフォルト）、監視対象装置４の情報収集エージェント４４のリモート復旧プログラム４４ｃを起動させる為のワンタイムパスワード付きのリモートコマンドを、通信回線７を介して監視対象装置４に送る。
【０１２０】
監視対象装置４は、ＴＣＰ処理部４１を介して情報収集エージェント４４に専用に割当てたＴＣＰポート番号で、プロセス／性能監視プログラム１１と情報収集エージェント４４を接続する。
【０１２１】
情報収集エージェント４４は、プロセス／性能監視プログラム１１からのリモートコマンドに付与されたワンタイムパスワードを、ワンタイムパスワード認証プログラム４４ａで認証させた後に、リモートコマンドに対応して、リモート復旧プログラム４４ｃに対して、予め登録されている障害に応じた復旧オペレーションを実行するプログラムやシェルスクリプトを起動する。
【０１２２】
次に、第３の例として、監視対象装置５に対する監視動作について説明する。
【０１２３】
この監視対象装置５は、ネットワーク監視マネージャ１を設けた監視装置と、この監視装置とは別系統の監視装置２から同時に監視されるものであり、それぞれ（監視装置）に監視用通信回線８と監視用通信回線９で接続されている。
【０１２４】
そして、監視対象装置５には、別系統の監視装置２用の情報収集のための別系統の監視装置用の専用情報収集エージェント５Ａが設けられ、また、ネットワーク監視マネージャ１側には、中継サーバプログラム１２のサブシステムとして、別系統の監視装置の専用情報収集エージェントプログラム１２Ａが設けられている。尚、別系統の監視装置用の専用情報収集エージェント５Ａと別系統の監視装置の専用情報収集エージェントプログラム１２Ａとは同じ機能を有する。
【０１２５】
中継サーバプログラム１２の統合監視情報管理プログラム１２ａにより、ネットワーク監視マネージャ１の持つ監視情報と、別系統の監視装置２が持つ監視情報を仮想的に一体化させ、これにより、ネットワーク監視マネージャ１と別系統の監視装置２の監視機能を連携させる。
【０１２６】
以下、例として、別系統の監視装置２には、商用ＵＮＩＸ（登録商標）系のリモート復旧機能があるが、ＰＣ−ＵＮＩＸ（ＵＮＩＸ：登録商標）系（ＦｒｅｅＢＳＤ（登録商標）、Ｌｉｎｕｘ（登録商標）など）に対しては監視機能が無くリモート復旧対象外であるとし、また、ネットワーク監視マネージャ１には、ＰＣ−ＵＮＩＸ（ＵＮＩＸ：登録商標）のプロセス監視と性能監視および復旧機能を有するがリモートでの復旧機能が無いものと想定し、このような環境において、監視対象装置５でＨＴＴＰの障害が発生する際の動作処理を、図４を用いて説明する。
【０１２７】
このような監視対象装置５でＨＴＴＰの障害が発生すると（▲１▼）、監視対象装置５に設けた情報収集エージェント５４におけるサブプログラムの性能監視用エージェントプログラム５４ｂで検出し、ログ情報に記録する（▲２▼）。
【０１２８】
ネットワーク監視マネージャ１は、プロセス／ステータス確認プログラム１１ｃにより、所定の時間間隔でサブプログラムの性能監視用エージェントプログラム５４ｂからログ情報を取得し、監視対象装置５でのＨＴＴＰ障害を検知する（▲３▼）。
【０１２９】
このように、監視対象装置５でのＨＴＴＰ障害を検知すると、監視タイミング時間調整変更プログラム１１ｄにより、プロセス／ステータス確認プログラム１１ｃによる性能監視用エージェントプログラム５４ｂからのログ情報の取得時間間隔を短く、例えば、５分間隔から１分間隔にする。
【０１３０】
また、この際の障害状況により、障害管理用Ｗｅｂプログラム１１ｆにおいて、警告、障害、アラーム鳴動等、段階的にレベル分けしたＷｅｂ通報情報を生成し、ユーザ側および監視センタに送出する。
【０１３１】
また、プロセス／ステータス確認プログラム１１ｃで監視対象装置５のＨＴＴＰ障害を検知すると、リモート復旧判断プログラム１１ｇが、当該障害に対するリモートでの復旧機能の有無を判別する。ここでは、当該障害に対する復旧機能は有するがリモートでの復旧機能は無いとの判別結果となり、リモート復旧判断プログラム１１ｇから障害管理用Ｗｅｂプログラム１１ｆに復旧指示が出力される。
【０１３２】
このリモート復旧判断プログラム１１ｇから出力される復旧指示およびプロセス／ステータス確認プログラム１１ｃで取得した性能監視ログ情報を、障害管理用Ｗｅｂプログラム１１ｆは、障害復旧情報リスト生成機能１１ｆ_１により、チェックシート情報１１ｆ_２に編集する。このチェックシート情報１１ｆ_２は、別系統の監視装置２との共通化を図るようチェックシート形式となっている。
【０１３３】
この編集結果情報は、監視情報同期プログラム（ＮＦＳ）１１ｆ_３により、ＮＦＳを利用して、中継サーバプログラム１２の監視情報同期プログラム（ＮＦＳ）１２ｄに渡され、統合監視情報プログラム１２ａに伝達される（▲４▼、▲５▼）。
【０１３４】
このように、統合監視情報プログラム１２ａにおいては、別系統の監視装置２で登録されている障害ステータス情報をチェックシート（監視対象名称、障害ステータス情報、性能監視ログ情報、障害と同じ扱いで警報する情報）形式で登録し、このチェックシート情報１２ａ_１に基づき、別系統の監視装置用の専用情報収集エージェントプログラム１２Ａが、監視対象装置５のＨＴＴＰ障害を検知する。
【０１３５】
別系統の監視装置用の専用情報収集エージェントプログラム１２Ａによる監視対象装置５のＨＴＴＰ障害の検知動作に基づき、統合監視情報プログラム１２ａは、チェックシート情報１２ａ_１における「ＨＴＴＰ復旧指示」を読み出し、ソケットプログラム１２ｃを介して別系統の監視装置２に伝送し、別系統の監視装置２に対してリモート復旧指示のトラップをあげる（▲６▼）。
【０１３６】
この別系統の監視装置２は、通常は、障害検知機能２３により障害を検知すると、障害復旧用テンプレート２１に従いリモート復旧処理を行うが、ここでは、ＰＣ−ＵＮＩＸ（ＵＮＩＸ：登録商標）対応の復旧オペレーションを実行するプログラムやシェルスクリプトが無いので、監視対象装置５の情報収集エージェント５４の当該リモート復旧プログラム５４ｃを起動するために、ソケットプログラム２１ａを、予めリモート復旧機能２２のアドオンソフトとして、障害復旧用テンプレート２１に登録しておく。
【０１３７】
これにより、別系統の監視装置２は、ソケットプログラム２１ｂを介して接続された監視対象装置５に、情報収集エージェント５４のリモート復旧プログラム５４ｃの復旧オペレーションを実行するプログラムやシェルスクリプトをリモートコマンド発行し（▲７▼）、監視対象装置５においてＨＴＴＰ復旧オペレーションプログラム５４ｃ_１により復旧処理を行う（▲８▼）。
【０１３８】
尚、監視対象装置５において、別系統の監視装置２の配下用に組み込まれている別系統監視装置用専用の情報収集エージェント５Ａと、ネットワーク監視マネージャ１配下の情報収集エージェント５４とは、情報記憶部５２に格納されている装置保有リソース情報や稼動情報から共通に情報を収集するので、情報の同期ズレなどは発生しない。
【０１３９】
このように、監視対象装置５の監視機能を連携して利用することにより、このマルチベンダ環境下での複数の監視装置の監視運用の統合化が実現できる。
【０１４０】
次に、このようなマルチベンダ環境下での複数監視装置の監視機能の連携（トレース）動作に関して、監視対象装置５におけるディスク使用率やログ情報などの性能を監視する動作を例に説明する。
【０１４１】
ネットワーク監視マネージャ１から監視対象装置５に組込みこまれた情報収集エージェント５４の性能監視用エージェントプログラム５４ｂに性能監視情報を収集するためにポーリングを実施する。
【０１４２】
情報収集する内容は、例えば、「Ｌｏａｄ　ａｖｅｒａｇｅ　０．１３」、「Ｆｒｅｅ　Ｍｅｍｏｒｙ　１７５Ｍ」、「ｆｉｌｅ　ｓｙｓｔｅｍ　（／）　２９％」、「ｆｉｌｅ　ｓｙｓｔｅｍ　（／ｕｓｒ）　６２％」、「ｆｉｌｅ　ｓｙｓｔｅｍ　（／ｖａｒ）　１００％」、「ｆｉｌｅ　ｓｙｓｔｅｍ　（／ｖａｒ／ｍａｉｌ）　０％」、「ｆｉｌｅ　ｓｙｓｔｅｍ　（／ｖａｒ／ｓｐｏｏｌ／ｍｑｕｅｕｅ）　０％」、「ｉｎｅｔｄ　ｏ」、「ｓｙｓｌｏｇｄ　ｕｐｄａｔｅ　ｏ」、「ｎａｍｅｄ　ｏ」、「ｓｅｎｄｍａｉｌ　ｐｒｏｃｅｓｓ　１」、「Ｍａｉｌ　ｑｕｅｕｅ　０」、「ｄｅｌｅｇａｔｅ　ｐｒｏｃｅｓｓ　１」、「ｍｅｓｓａｇｅｓ　Ａｐｒ　２０　０５：３６：０３　監視対象装置５　ｋｅｒｎｅｌ：　／ｖａｒ：　ｏｐｔｉｍｉｚａｔｉｏｎ　ｃｈａｎｇｅｄ　ｆｒｏｍ　ＳＰＡＣＥ　ｔｏ　ＴＩＭＥ　　Ａｐｒ　２２　０３：１０：０４　監視対象装置５　ｋｅｒｎｅｌ：　／ｖａｒ：　ｏｐｔｉｍｉｚａｔｉｏｎ　ｃｈａｎｇｅｄ　ｆｒｏｍ　ＴＩＭＥ　ｔｏ　ＳＰＡＣＥ」等となる。
【０１４３】
ネットワーク監視マネージャ１は、上記データをテンポラリファイルとして保存し、予め監視設定ファイルに設定されたしきい値と比較し、監視対象装置５の障害発生の判定を行う。比較終了後、このテンポラリファイルは削除される。
【０１４４】
例えば、ディスク使用率がしきい値を越えて障害状態となった場合、障害管理用Ｗｅｂプログラム１１ｆにおいて、チェックシート情報１１ｆ_１を生成して、このチェックシート情報１１ｆ_１をＮＦＳでファイルシェアをしている中継サーバプログラム１２から、別系統の監視装置用の専用情報収集エージェントプログラム１２Ａを経由し、別系統の監視装置２に、この障害情報を通知する。
【０１４５】
障害管理用Ｗｅｂプログラム１１ｆでは、別系統の監視装置の監視下の障害を検知したことをオペレータコンソール画面などに警告等する。また、リモート復旧判断プログラム１１ｇにおいて、別系統の監視装置２内のリモート復旧機能２２の復旧対象か否か判定する。
【０１４６】
復旧対象の場合、ソケットプログラム１２ｃにより、別系統の監視装置２にアドオンソフトとして組み込まれたソケットプログラム２１ａを介して、リモート復旧機能２２にある障害復旧用テンプレート２１（ディスク障害復旧手順）を動作させ、監視対象装置５に組み込んだネットワーク監視マネージャ１の監視下にある情報収集エージェント５４のリモート復旧プログラム５４ｃ内のディスク障害復旧プログラムに起動をかける。
【０１４７】
このようにして、別系統の監視装置２からの上記アクセスを受け付けた監視対象装置５はネットワーク監視マネージャ１の監視下にある情報収集エージェント５４の専用ディレクトリ下のｂｉｎディレクトリ下に予め用意された復旧オペレーションプログラム（「ｄｉｓｋ＿ｒｅｃｏｖｅｒ．ｓｈ」）を実行する。
【０１４８】
次に、図１における監視装置に組み込まれたネットワーク監視マネージャ１の稼働月次レポート自動作成プログラム１３の動作を説明する。
【０１４９】
稼動月次レポート自動作成プログラム１３は、図４で示す統合監視情報管理プログラム１２ａのチェックシート情報１２ａ_１から，監視対象装置の月間のサービス稼働率とリソース使用率を算出し、「サービス稼働率表」と「リソース使用率推移グラフ」の月次レポートを作成する機能である。作成するレポートの詳細と画面を、図５および図６に示す。
【０１５０】
図５は、図１における稼動月次レポート自動作成プログラムで作成されるサービス稼働率表の構成項目内容例を示す説明図であり、図６は、図１における稼動月次レポート自動作成プログラムで作成されるリソース使用率推移グラフの構成項目内容例を示す説明図である。
【０１５１】
図５に示すように、月間の「サービス稼働率」は、「項目」と「単位」および「説明」欄からなり、例えば、「稼働率」は、「％」を単位とした、計画停止時間を除いた、稼働時間の割合であり、「（稼働率）＝（稼働時間）／（（全対象時間）−（計画停止時間））」の式で求められ。
【０１５２】
また、「稼働時間」は、「分」を単位としたサービス稼動時間であり、「（稼働時間）　＝　（全対象時間）−（計画停止時間）−（停止時間）」の式で求められ、「停止回数」は「回」を単位に、サービスが停止した回数で計画停止は除いた値となり、「停止時間」は「分」を単位に、サービスが停止した時間で計画停止は除いた値となり、「警告（応答遅延）回数」は「回」を単位に、サービス停止までには至らないが，応答遅延を検出した回数が記録され、「計画停止回数」は「回」を単位に、計画停止した回数が記録され、「計画停止時間」は「分」を単位に、計画停止した時間が記録される。
【０１５３】
そして、「停止時間レベル別停止回数」は「回」を単位に、サービスが停止した時間の長さ別の停止回数で計画停止は除く値が記録される。また、この「停止時間レベル別停止回数」においては、デフォルトの停止レベルは，「レベルＡ：２時間以上」、「レベルＢ：１時間以上２時間未満」、「レベルＣ：３０分以上１時間未満」、「レベルＤ：３０分未満」で、停止レベルを規定する停止時間は，設定変更可能である。
【０１５４】
尚、月をまたがる停止／警告／計画停止は，前後の月でそれぞれ停止／警告／計画停止回数にカウントする。また，停止時間レベルも，前後の月でそれぞれの停止時間により計算する。また、停止／警告時間に引き続いて計画停止に入った場合，計画停止前で，１回の停止／警告とカウントする。さらに、ｐｉｎｇ監視で停止／警告と判定された時間は，全てのサービスも停止／警告と判定された時間とする。
【０１５５】
「リソース使用率推移グラフ」は、ディスクやメモリなどのサーバリソースについて，対象月間中の使用率の推移を示すグラフであり、その構成項目内容は、図６に示すように、「項目」と「単位」および「説明」欄からなる。
【０１５６】
例えば、「ディスク使用量」は、「％」を単位として、各パーティションの日毎の最大使用率をプロットした推移グラフとなり、「空メモリ量」は、「Ｍｂｙｔｅ」を単位に、空メモリ量の日毎の最小量をプロットした推移グラフとなり、「ＣＰＵ負荷平均」は、ＣＰＵ負荷平均の日毎の最大値と平均値をプロットした推移グラフとなる。
【０１５７】
以上、図１〜図６を用いて説明したように、本例では、マルチベンダ環境の分散コンピュータネットワークシステムにおける各監視対象装置のリモート監視を行うシステムとして、各監視対象装置（データ伝送装置やサーバ装置）に情報収集エージェントを、また、監視装置にネットワーク監視マネージャを組み込み、監視対象装置において、情報収集エージェントにより、別系統の監視装置専用に組込まれた情報収集エージェントと共存させ、複数台の監視装置と情報共用を実現することにより、マルチベンダ環境のコンピュータネットワークシステムにおける各監視サポートを統合して行う。
【０１５８】
また、監視装置と監視対象装置間にワンタイムパスワードによる認証を行う機能を設けることで、監視装置になりすましてのユーザ側の監視対象装置への不正侵入を防止することが可能となる。
【０１５９】
また、障害単位で担当者リスト、電話連絡の有無、重要度を示すメッセージを監視センタ装置やユーザ装置に表示する機能を設けることで、センタおよびユーザ側において、障害を誰に伝えればよいかの検索が容易となり、迅速な通報等が可能となる。
【０１６０】
また、ＮＦＳ技術を利用して、監視情報が保存されるそれぞれの監視装置間をネットワーク結合する機能を設けることにより、サーバ負荷軽減を図り、かつ、複数の監視装置間の監視情報を同期させ、一元管理することができ、ＴＣＯの削減が可能となる。
【０１６１】
また、ログ情報に用いられるアドレスや識別子、文字の配列などを登録し、ログ情報を検索して、同じパターンを検出した場合、予め登録したアクション動作をさせるパタンマッチ処理機能を設けることにより、ＳＮＭＰなどのように障害情報をログ上に出力するだけでは不可能な監視も可能となり、さらに、障害検知から通報そして復旧までの障害監視運用過程を自動的に実行することができる。
【０１６２】
また、ユーザ側に提供する監視情報は、監視状態を一元的に把握できる構成でＷｅｂ画面で提供し、かつ、階層が深くなるほど詳細情報を提供する表示構成とすることにより、ユーザと監視センタの双方向での遠隔監視を実現し、迅速な障害体制の確立が可能となる。
【０１６３】
また、監視情報結果から監視対象装置ごとの計画停止時間を含めた月間サービス稼動率とリソース使用率を算出し、サービス稼働率表（稼働率、稼働時間、停止回数、停止時間、警告回数、計画停止回数と時間）と重要障害発生頻度管理（レベル４区分し色で警告）およびリソース使用率推移グラフ（閾値との比較表示、週単位比較表示）の稼動月次レポートを自動作成してＷｅｂ画面で提供する機能を設けることにより、データ伝送装置やサーバ装置等の監視対象装置のシステム障害を事前に予測する情報を提供することが可能となる。
【０１６４】
このように、本例では、マルチベンダ環境下での物理的ネットワークからアプリケーション層までを対象とした「監視から復旧及び運用管理」のシームレス化を実現させた。そして、新しい監視技術とＷｅｂ技術を活用し、監視センタで障害発生を検知したと同時にユーザ側にもＷｅｂ画面で通知する双方向監視による迅速な対応（情報収集／分析から障害検知及びリモート復旧）を可能とした。また、ＣＳ（クライアント・サーバ）技術思想を十分に考慮した運用管理の効率化と省力化を図り、ＴＣＯ（トータル運用コスト）の削減・信頼性面からの先手管理（データ伝送装置やサーバ装置のシステム障害を事前に予測する）を可能とした。
【０１６５】
このことにより、監視センタは、いつ障害が発生するか、また発生したら障害内容に応じてその担当者の連絡先を調べて連絡と、その対応指示を待つと言った行為の連続で監視装置画面をたえずチェックするなど常時緊張を強いられていたことから開放される。
【０１６６】
また、ユーザ側においては、マルチベンダ環境下の監視制限により個別に監視しなければならなかったグループウェーア系Ｍａｉｌ、ファイヤーウォールやディレクトリ（ＬＤＡＰ）などの共通アプリケーションソフト監視とその対象ＯＳ（ＦｒｅｅＢＳＤ（登録商標）、Ｌｉｎｕｘ（登録商標）等のＰＣ−ＵＮＩＸ（ＵＮＩＸ：登録商標）系、商用ＵＮＩＸ（登録商標）系やそれ以外のＯＳなど多岐にわたる）での監視運用と月次稼動報告業務の煩雑さから開放される。
【０１６７】
この結果、リモート型運用監視・管理サービスへのノウハウ適用範囲が広がった。例えば、他社製品と連携する中継サーバ機能により、監視装置のマルチベンダ化による監視業務分散等の問題を解決でき、また、監視システムからサーバ管理担当者毎に障害・復旧状況を自動的に通知する機能により、監視業務の工数を低減でき、また、従来の監視技術では未サポートであるＰＣ−ＵＮＩＸ（ＵＮＩＸ：登録商標）のリモート自動復旧プログラム開発で専門分野の人材確保対応など運用の実務にとっての効果が得られる。
【０１６８】
従って、本例のネットワーク監視システムは、インターネット時代には必須なネットワーク監視技術となる。本例の技術を用いないでネットワークを挟んだ分散オフィス先の運用監視をした場合、セキュリティが問題となるので、分散先に監視装置を設置した分散監視運用の体制となり、設備面・運用人員等のコスト面で増大する。
【０１６９】
尚、本発明は、図１〜図６を用いて説明した例に限定されるものではなく、その要旨を逸脱しない範囲において種々変更可能である。例えば、図２の説明において、本例では、監視センタとは別の箇所に設置された監視装置が、ネットワーク監視マネージャ１内の障害管理用Ｗｅｂプログラム１１ｆと連動して、ユーザと監視センタへの同時通知を行うものとしているが、情報収集エージェント単体で、ユーザと監視センタへの自動同時通知を行うことでも良い。本例では、複数ユーザへの通知や、障害区分に応じた通知、性能情報やしきい値管理および障害復旧指示などのためのデータベースが必要となるので、エージェントの負荷軽減させるために障害管理用Ｗｅｂプログラム１１ｆと連動させ、この部分の情報を付加しユーザと監視センタへの同時通知をする仕組みとしている。
【０１７０】
また、図４での説明として本例では、ネットワーク監視マネージャ１に、ＰＣ−ＵＮＩＸ（ＵＮＩＸ：登録商標）のリモートでの復旧機能が無いものとしたが、ネットワーク監視マネージャ１に、ＰＣ−ＵＮＩＸ（ＵＮＩＸ：登録商標）のリモート復旧機能も持たせることでも良い。この場合、監視対象装置５におけるＨＴＴＰ障害の復旧は、別系統の監視装置２を介することなく、ネットワーク監視マネージャ１を設けた監視装置から直接、リモート復旧させることができる。
【０１７１】
また、図４に示す例では、ネットワーク監視マネージャ１が、プロセス／ステータス確認プログラム１１ｃにより監視対象装置５における情報収集エージェント５４における性能監視用エージェントプログラム５４ｂのログ情報を読みとることで、監視対象装置５におけるＨＴＴＰプログラム５３ａの障害を検出しているが、ネットワーク監視マネージャ１（プロセス／性能監視プログラム１１）から監視対象装置５に対してＨＴＴＰ監視ポーリングを行い、ＨＴＴＰ監視ポーリングの無応答を検知することで、監視対象装置５におけるＨＴＴＰプログラム５３ａの障害を検出することでも良い。
【０１７２】
また、本例では、ＯＳ（オペレーティングシステム）としてＵＮＩＸ（登録商標）／ＯＳを用いた構成で説明しているが、他のＯＳであっても良い。また、ＮＦＳを別系統の監視装置との連携に用いているが、他のネットワークファイルプロトコルを用いることでも良い。
【０１７３】
また、本例のコンピュータ構成例として、光ディスクをプログラムやデータの記録媒体として用いているが、ＦＤ（Ｆｌｅｘｉｂｌｅ　Ｄｉｓｋ）等を記録媒体として用いることでも良い。また、プログラムのインストールに関しても、通信装置を介してネットワーク経由でプログラムをダウンロードしてインストールすることでも良い。
【０１７４】
【発明の効果】
本発明によれば、ネットワーク監視マネージャからネットワークを介した監視対象装置の情報収集エージェント（シェルスクリプト）へ起動をかけるとき、不正利用者から監視対象サーバのシェルスクリプトを実行をできないようにネットワーク監視マネージャと監視対象装置（サーバ装置）間の通信に認証機能を設けたので、ネットワークを利用してもセキュア通信を確保した安全な監視が可能である。また、障害発生を検知したと同時にユーザ側もＷｅｂ画面で障害を認識できる双方向監視を行うことにより、迅速な対応（情報収集／分析から障害検知及びリモート復旧）が可能である。さらには、中継サーバプログラムにより他社製品との連携が可能となり、監視装置のマルチベンダ化による監視業務分散等の問題を解決することができ、例えば、従来の監視技術では未サポートであるＰＣ−ＵＮＩＸ（ＵＮＩＸ：登録商標）等のリモート自動復旧が可能となり、専門分野の人材確保対応など運用の実務にとって効果的である。また、各システム管理部門のネットワーク運用者が最も頭を悩ます稼動統計月報作成を高信頼に自動的に作成でき、システム障害の事前予測を高精度に行う情報を提供でき、ユーザと監視センタの双方の運用実務を効率化できる。
【図面の簡単な説明】
【図１】本発明に係わるネットワーク監視システムの構成例を示すブロック図である。
【図２】図１におけるネットワーク監視システムの第１の動作例を示す説明図である。
【図３】図１におけるネットワーク監視システムの第２の動作例を示す説明図である。
【図４】図１におけるネットワーク監視システムの詳細構成例を示すブロック図である。
【図５】図１における稼動月次レポート自動作成プログラムで作成されるサービス稼働率表の構成項目内容例を示す説明図である。
【図６】図１における稼動月次レポート自動作成プログラムで作成されるリソース使用率推移グラフの構成項目内容例を示す説明図である。
【符号の説明】
１：ネットワーク監視マネージャ、１ａ：ソケットプログラム、２：別系統の監視装置、３〜５：監視対象装置、５Ａ：別系統の監視装置用の専用情報収集エージェント、６〜８：通信回線、９：別系統の監視装置用の通信回線、１０：仮想通信経路、１１：プロセス／性能監視プログラム、１１ａ：ワンタイムパスワード発生プログラム、１１ｂ：ＴＣＰポート番号設定変更プログラム、１１ｃ：プロセス／ステータス確認プログラム、１１ｄ：監視タイミング時間調整変更プログラム、１１ｅ：監視一時休止状態表示プログラム、１１ｆ：障害管理用Ｗｅｂプログラム、１１ｆ_１：障害復旧情報リスト作成機能、１１ｆ_２：チェックシート情報、１１ｆ_３：監視情報同期プログラム（ＮＦＳ）、１１ｇ：リモート復旧判断プログラム、１２：中継サーバプログラム、１２ａ：統合監視情報管理プログラム、１２ａ_１：チェックシート情報、１２ｂ：ＨＴＭＬ生成プログラム（「ＨＴＭＬ生成」）、１２ｃ：ソケットプログラム、１２ｄ：監視情報同期プログラム（ＮＦＳ）、１２Ａ：別系統の監視装置の専用情報収集エージェント、１３：稼働月次レポート自動作成プログラム、１３ａ：状態履歴情報Ｗｅｂコンテンツ生成プログラム（「状態履歴情報Ｗｅｂコンテンツ生成」）、１４：工事管理情報データベース、１５：監視条件メッセージ管理データベース、２１：障害復旧テンプレート、２１ａ，２１ｂ：ソケットプログラム、２２：リモート復旧機能、２３：障害検知機能、３１，４１，５１：ＴＣＰ処理部（「ＴＣＰポート」）、３２，４２，５２：情報記憶部（「装置保有リソース情報や稼働情報」）、３３，４３，５３：アプリケーション処理部（「アプリケーションプログラム」）、４４，５４：情報収集エージェント、４４ａ，５４ａ：ワンタイムパスワード認証プログラム、４４ｂ，５４ｂ：性能監視用エージェントプログラム、４４ｃ，５４ｃ：リモート復旧プログラム、５３：アプリケーションプログラム、５３ａ：ＨＴＴＰ、５４ｃ_１：ＨＴＴＰ復旧オペレーションプログラム、５４ｄ：ソケットプログラム。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a network monitoring / management technology, and particularly to a network operation manager who efficiently performs fault monitoring and performance monitoring for servers and data transmission devices distributed on a large-scale network. The present invention relates to a network monitoring technique suitable for reducing the load on the network.
[0002]
[Prior art]
With the advancement of the client-server environment in computer systems, the distribution of resources on a computer network has progressed, and monitoring and management of computer devices at distributed offices have become necessary.
[0003]
In monitoring in such a distributed network, it is desired to monitor the resources of a data transmission device and a server device in a distributed office or the same LAN (Local Area Network) with a single monitoring device across the network. For monitoring with a, the control by a remote command is required.
[0004]
However, such control using a remote command enables unauthorized access, which may be seen by others, and poses a security problem. Therefore, at present, a monitoring device is installed in each distributed office, for example, on the same LAN, and only the monitoring target device in the LAN is monitored.
[0005]
Furthermore, since remote monitoring increases the traffic on the network, in order to deal with such a problem, an agent (trapping only when a problem is discovered) is installed in the monitored device (such as a server device), and this agent is installed. The monitoring device uses a technique of collecting the monitoring information and management information collected by the above using SNMP (Simple Network Management Protocol) which is the basis of monitoring.
[0006]
However, the monitoring information and management information collected by these agents include those that output an abnormal message as log information without trapping as a failure when a problem is discovered, and management information that has a proprietary protocol. Cannot collect information by SNMP. There is a tendency that the number of monitoring target devices and management information for which information cannot be collected by SNMP increases. For example, there are common application software such as a groupware mail, a firewall, and a directory (Lightweight Directory Access Protocol, hereinafter abbreviated as “LDAP”).
[0007]
In a large-scale distributed network, a multi-vendor environment is generally used. In other words, in a large-scale computer network system, it is not a satisfactory system system to construct a network by one company alone, so that cooperation with products of other companies is also an important factor.
[0008]
In the same LAN, there is a conventional technique in which processes from failure detection to recovery and notification are centrally managed, but monitoring and control cannot be performed in a multi-vendor environment or an environment in which a plurality of monitoring devices are integrated. Therefore, at present, most of the cooperation is within each company's series products.
[0009]
In addition, there is a conventional technique for efficiently reporting a failure occurrence for the purpose of reducing TCO (Total Cost of Ownership). That is, an e-mail for notifying a failure is automatically transmitted to a user at the same time as the occurrence of a failure. According to this technique, an e-mail destination is notified as a monitoring person or all users on a mailing list are notified by broadcast. .
[0010]
However, the actual reporting flow is a bit more complicated. That is, the monitoring center informs the network administrator, and the network administrator searches for the server staff involved, contacts by telephone or e-mail, or calls the person in charge. As described above, the actual TCO reduction cannot be achieved unless the problem that appears at the site is considered.
[0011]
In addition, the network monitor needs to check the status of the screen provided by the monitoring device, but since a plurality of monitoring devices are distributed in a multi-vendor environment, the work becomes complicated and the response is delayed. For example, in the conventional cooperative technology of each monitoring device in a multi-vendor environment, icons of each monitoring device to be monitored are displayed on the screen of the master monitoring device, and only a link is established between the icons. Therefore, the monitoring information screen, the performance monitoring information, and the log information remain the screen display specific to each monitoring device and are not integrated.
[0012]
In addition, when monitoring the occurrence of an error in a common application such as an e-mail, a message (resource status) provided by UNIX (registered trademark) / OS and other OSs and a message log output by the application are used. By linking with the information, it detects how much recovery processing is required. It is required to execute this automatically to recover from the failure or to alert the person in charge of the system. However, there is no monitoring system that takes into account that the response differs depending on the server or failure, and the user has to develop it himself.
[0013]
For example, there are various types of UNIX (registered trademark) / OS, such as FreeBSD (registered trademark), Linux (registered trademark), commercial UNIX (registered trademark), and other OSs. The monitoring target e-mail includes an SMTP mail and a groupware mail, and a typical monitoring technique includes a message retention queue check. In the state monitoring of the groupmail mail, “x.400” is used. , "Smtpgw" and "smtp (Sendmail)" are collected, and the number of mail queues staying in the groupware mail server is compared with a preset threshold value to determine a failure. There is.
[0014]
As described above, most of the conventional network monitoring software products are premised on monitoring on the same LAN (use closed to the building). However, due to advances in client / server technology, resource decentralization has progressed. Monitoring and management of such dispersed offices (monitoring across networks and wide area LANs), monitoring from the viewpoint of “End to End” including up to the application layer, and unification of the entire network system There is a need to ensure reliability.
[0015]
However, in the conventional monitoring between distributed offices, secure communication is insufficient, and it is difficult to secure information security at each distribution destination. Therefore, at present, a monitoring system is constructed and operated for each distributed office.
[0016]
Thus, with the recent spread of computer networks, there are demands for (1) wide area networks, (2) decentralized networks, (3) multi-vendor environments, (4) more efficient management, and lower TCO. Further, the user is required to recover the function by remotely monitoring the function of monitoring the monitoring status and rebooting the server device. In addition, due to the sudden increase in the use of intranets and e-mails in each system management department, constant monitoring of server devices and early recovery of failures have become important issues. However, it is difficult to predict where the bottleneck is occurring.
[0017]
As a response, a multi-vendor monitoring system environment has been introduced, which introduces multiple commercially available monitoring products and gathers good points. The remote operation of remote control was not enough. As a result, the cost (TCO) required for monitoring and operation management is increased, and it is difficult to maintain the reliability of the entire network system despite unification of network monitoring.
[0018]
[Problems to be solved by the invention]
The problem to be solved is that conventional monitoring technology for large-scale distributed networks in a multi-vendor environment has no consideration to coexist with an information collection agent that is dedicated to another system of monitoring equipment. Support is not possible, and it is not possible to prevent impersonation of a monitoring device by remote monitoring and unauthorized entry into a user's server device, etc. And it takes time and effort for the network administrator who confirms the notification to identify and contact the person in charge of the faulty server, etc. If monitoring devices are dispersed to ensure monitoring performance, the entire monitoring information cannot be centrally managed, so that TCO cannot be reduced. If all the monitoring information is aggregated and managed, the load on the management device will increase. In addition, in the conventional monitoring system, only the function of monitoring the operation state by the PING command or the performance monitoring by the SNMP is used. It is not possible to automate the fault monitoring operation process from notification to recovery and recovery.In addition, since monitoring is conventionally distributed and monitored by multiple monitoring devices, for example, the screen of the master monitoring device can be linked to each monitoring device. Just display the monitoring device icon under the monitoring and link it, and the monitoring information screen, performance monitoring information, log information, etc. are not integrated as the screen display unique to each monitoring device, Conventionally, the status confirmation of the screen provided by the monitoring device is distributed to the monitoring device, and the work becomes complicated and the response is delayed. Since the utilization rate information without excluding this time such as has been provided, is the inability to efficiently and safely support the monitor in a distributed computer network system of large, multi-vendor environment.
[0019]
SUMMARY OF THE INVENTION An object of the present invention is to solve the problems of the prior art and reduce the burden on an operation manager of a distributed computer network in a large-scale multi-vendor environment and reduce the TCO.
[0020]
[Means for Solving the Problems]
In order to achieve the above object, according to the present invention, as a system for remotely monitoring each monitored device in a computer network system in a multi-vendor environment, an information collecting agent (program) is provided in each monitored device (data transmission device or server device). In addition, a network monitoring manager (program) is incorporated in the monitoring device, and in the monitored device, the information collecting agent coexists with the information collecting agent that is exclusively used for the monitoring device of another system, so that a plurality of monitoring devices By realizing the sharing, each monitoring support in the computer network system of the multi-vendor environment is performed in an integrated manner. In addition, by providing a function for performing authentication using a one-time password between the monitoring device and the monitoring target device, it is possible to prevent unauthorized intrusion of the monitoring target device on the user side impersonating the monitoring device. Further, by providing a function for displaying a list of persons in charge, a message indicating whether or not there is a telephone call, and a degree of importance for each failure, it is easy to search for who should be notified of the failure. In addition, by using a network file system (NFS) technology, a function of network connection between respective monitoring devices in which monitoring information is stored is provided, thereby reducing a server load and monitoring between a plurality of monitoring devices. Synchronize information and centrally manage information. Also, the information collection agent has a pattern match processing function that registers the address, identifier, character array, etc. used for log information, searches log information, and if the same pattern is detected, performs a pre-registered action operation. By providing such information, it is possible to perform monitoring that is impossible only by outputting fault information on a log, such as SNMP, and further, it is possible to automatically execute a fault monitoring operation process from fault detection to notification and recovery. In addition, the monitoring information provided to the user is provided on a Web screen in a configuration in which the monitoring state can be grasped in a unified manner, and the display configuration is such that the detailed information is provided as the hierarchy becomes deeper, so that the user and the monitoring center can be monitored. Realizes two-way remote monitoring and enables quick establishment of a failure system. In addition, the network monitoring manager calculates the monthly service utilization rate and resource utilization rate including the planned suspension time for each monitored device from the monitoring information results, and provides a service utilization rate table (utilization rate, operation time, number of suspensions, suspension time). Monthly operation of warning, number of times of warning, number of times of planned stop and time), frequency management of critical failure occurrence (warning by classifying into four levels and color coding), and resource utilization rate transition graph (comparison display with threshold value, weekly comparison display) By providing a function of automatically creating a report and providing the report on a Web screen, information for predicting a system failure of a monitoring target device such as a data transmission device or a server device in advance is provided.
[0021]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0022]
FIG. 1 is a block diagram showing a configuration example of a network monitoring system according to the present invention, FIG. 2 is an explanatory diagram showing a first operation example of the network monitoring system in FIG. 1, and FIG. 3 is a network diagram in FIG. FIG. 4 is an explanatory diagram showing a second operation example of the monitoring system, and FIG. 4 is a block diagram showing a detailed configuration example of the network monitoring system in FIG.
[0023]
In FIG. 1, reference numeral 1 denotes a network monitoring manager read by a monitoring device (described as a “network monitoring manager program” in the figure), 2 denotes a monitoring device of another system, and 3 to 5 denotes monitoring targets such as a data transmission device and a server device. Reference numerals 6 to 9 denote communication lines of a network such as a wide area LAN.
[0024]
Each of the devices 1 to 5 has a computer configuration including a CPU (Central Processing Unit), a main memory, a display device, an input device, an external storage device, and the like, and is stored in a storage medium such as a CD-ROM via an optical disk drive device or the like. After the recorded programs and data are installed in the external storage device, each function is realized by reading from the external storage device into the main memory and processing the read data by the CPU.
[0025]
The monitoring target device 3 includes a TCP processing unit (described as “TCP port”) 31 for performing a TCP (Transmission Control Protocol) process, and an information storage unit for storing device owned resource information and operation information (“Device owned” in the diagram). 32) and an application processing unit (described as “application program” in the figure) that performs processing based on an application program such as SMTP (Simple Mail Transfer Protocol) or WWW (World Wide Web). 33.
[0026]
The monitoring target device 4 has a one-time password authentication program 44a, a performance monitoring agent program 44b, and a remote recovery program 44c as sub-programs together with the TCP processing unit 41, the information storage unit 42, and the application processing unit 43. An information collection agent (described as an “information collection agent program” in the figure) 44 that performs information collection processing related to.
[0027]
Further, the monitoring target device 5 has the same TCP processing unit 51, information storage unit 52, application processing unit 53, one-time password authentication program 54a, performance monitoring agent program 54b, and remote recovery program 54c as subprograms. A separate system information collection agent that performs processing based on a dedicated information collection agent for another system of monitoring devices together with the information collection agent 54 that performs collection processing (described as a “dedicated information collection agent program for another system of monitoring devices” in the figure) ) 5A.
[0028]
The network monitoring manager 1 incorporated in the monitoring device includes a process / performance monitoring program 11, a relay server program 12, and an operation monthly automatic report creation program 13. The process / performance monitoring program 11 further includes a one-time password generation program 11a. , A TCP port number setting change program 11b, a process / status check program 11c, a monitoring timing time adjustment change program 11d, a monitoring temporary halt status display program 11e, a fault management Web program 11f, and a remote recovery determination program 11g. Reference numeral 12 denotes an integrated monitoring information management program 12a, an HTML generation program (described as "HTML generation" in the figure) 12b, a socket program 12c, and dedicated information collection of a monitoring device of another system. It consists agent program 12A, running monthly report automatic creation program 13 is comprised of a state history information Web content generation program (in the figure as "state history information Web content generation") 13a.
[0029]
The monitoring device 2 of another system has a failure recovery template 21a and a socket program 21a, and a dedicated information collection agent program 12A of the monitoring device of another system which is taken into the network monitoring manager 1 via the virtual communication path 10. Connected.
[0030]
It is assumed that the monitored devices 3 to 5 are provided by different vendors, and the network monitoring system of the present embodiment performs remote monitoring of the monitored devices 3 to 5 in such a multi-vendor environment.
[0031]
Generally, such a multi-vendor environment is used to configure a large-scale network. In order to operate such a large-scale network, automation and standardization of network monitoring are required. In addition, network management and traffic management alone are not sufficient, and it is necessary to perform quick recovery processing, including monitoring in cooperation with applications.
[0032]
In order to monitor the network in consideration of these requirements, the following points are important.
[0033]
Point (1): The work of the monitoring operator is constantly strained. That is, the operator must constantly check the monitoring screen and notify the user when a failure occurs. At that time, check the contact person and wait for a contact / operation instruction. In order to respond to obstacles exactly as the roles are divided, it is necessary to realize quick contact, and for that purpose, it is important to make it easy for the operator to find out who should report the obstacle. .
[0034]
For this purpose, in the present example, a fault management Web program 11f is provided in (the process / performance monitoring program 11 of) the network monitoring manager 1 so that both the user side and the monitoring center side have a trouble-free Web screen configuration in units of faults at the same time. Display a message indicating the contact list, the presence / absence of telephone contact, and the importance.
[0035]
Point (2): When monitoring the occurrence of an error in a business application, a message (resource status) provided in the OS (operation system) and a message log output by the application are linked to detect how much recovery processing is necessary, In addition, it is required to automatically recover from a failure and to issue a warning to a person in charge of the system. In this example, a process / performance monitoring program 11 is provided to perform such processing.
[0036]
Point (3): Since remote monitoring software increases traffic on the network, there is a technology to collect information by incorporating an agent in the server to be monitored and trapping only when a problem is discovered. There is a tendency for devices and management information that cannot be managed only by a certain SNMP to increase. In this example, in order to deal with such a problem, the information collection agent 44 is provided with a performance monitoring agent program 44b.
[0037]
Point (4): When operating a large-scale computer network system, it is necessary to link individual operation management functions such as application management functions, software distribution / event management functions, etc. A single company does not provide a satisfactory product system. In this example, the relay server program 12 is provided in the network monitoring manager 1 to cooperate with other companies' products. This eliminates the need to install a monitoring manager for each distribution destination, and can reduce costs in terms of facilities and operating personnel.
[0038]
Point (5): In this example, a network monitoring manager is used in order to cope with the problem that unauthorized access is possible when monitoring and controlling each resource of a distributed office using a remote command across a WAN or the like. A one-time password generation program 11a is provided on one side and a one-time password authentication program 44a is provided on the information collection agent 44 side to support secure communication between the monitoring device and the monitoring target devices 4 and 5.
[0039]
Conventionally, when a single monitoring server (also referred to as a monitoring manager) monitors the resources of a distributed office or a server on the same LAN across a WAN, there is a risk that others will be able to see the resources. Servers are installed and monitored.
[0040]
In addition, for example, a “Ping” command is used to monitor the operation state. However, conventionally, the issuance time of this command (Ping) cannot be changed according to the monitoring state. As a result, although the fault is actually recovered, the monitoring status of the monitoring manager remains abnormally displayed due to a shift in the monitoring interval time. In order to cope with such a problem, in the present example, a monitoring timing time adjustment change program 11d is provided in the process / performance monitoring program 11 in the network monitoring manager 1.
[0041]
Further, in the related art, a state where the apparatus is stopped due to construction or the like is also detected as a failure, so that the accuracy of the failure information deteriorates. In order to cope with such a problem, in this example, a monitoring temporary suspension state display program 11e is provided, and based on the construction management information database 14, a stop state in construction or the like is managed separately from a failure state.
[0042]
As described above, in this example, the

information collecting agents

44 and 54 are incorporated in the monitored devices 4 and 5, and the network monitoring manager 1 is incorporated in the monitoring device. Then, information such as operation information and performance information of the monitoring target devices 4 and 5 and the status of resources held therein are collected and stored in the

information storage units

42 and 52 for management.
[0043]
The

information collection agents

44 and 54 of the monitoring target devices 4 and 5 register addresses, identifiers, character arrays, and the like used in the log information, search the log information, and detect the same pattern when detecting the same pattern. It has a pattern match processing function to make it operate, and also performs monitoring that is impossible only by outputting failure information on a log, such as SNMP, and automatically monitors the failure monitoring and operation process from failure detection to notification and recovery. To run.
[0044]
The network monitoring manager 1 periodically accesses the

information collection agents

44 and 54 of the monitored devices 4 and 5 via the communication lines 7 and 8 to acquire various information stored in the

information storage units

42 and 52. , Failure detection, failure recovery detection, etc., and integrally manages the detected failure information and recovery information, and notifies the user device as well as the monitoring center via a Web browser. Thereby, network management excellent in remoteness and simultaneous visualization can be performed in real time.
[0045]
In addition, the one-time password generation program 11a and the one-time

password authentication programs

44a and 54a authenticate each information collection unit between the

information collection agents

44 and 54 and the network monitoring manager 1, thereby enabling remote unauthorized access. Access can be granted or denied, and security can be ensured.
[0046]
Further, when the network monitoring manager 1 obtains the failure information on the monitored devices 4 and 5, the monitoring timing time adjustment change program 11d shortens the information collection time interval until the recovery information for the failure is obtained. Advance the timing of obtaining the recovery information and improve the monitoring accuracy.
[0047]
Further, the network monitoring manager 1 integrally manages, using the relay server program 12, monitoring information collected and managed under the control of a monitoring device of another system by NFS (Network File System) technology, and furthermore, performs integrated management. Detects the occurrence of a failure based on the information, instructs a monitoring device of another system to perform a recovery process corresponding to the detected failure, issues a remote command from a program added to the monitoring device of the other system, and issues information on the monitoring target device. Let the collection agent automatically recover. This makes it possible to coexist, for example, with monitoring and management tools for each vendor in an intranet environment by simply introducing an existing monitoring manager as it is.
[0048]
Hereinafter, the operation of the system in FIG. 1 will be described.
[0049]
In FIG. 1, the

information collection agents

44 and 54 of a plurality of monitoring target devices 3 to 5 including a data transmission device, a server device, and the like are incorporated in advance from a monitoring center by remote processing.
[0050]
Various information (performance information such as fault information, device-owned resource information, operation information, and the like) collected by the

information collecting agents

44 and 54 and stored in the

information storage units

32, 42, and 52, and

application programs

33, 43, The operation status of 53 is collected as monitoring information (system log information) by the network monitoring manager 1 via the network. At this time, the network monitoring manager 1 authenticates whether the device is a monitoring target device by one-time password authentication, and confirms the validity.
[0051]
The network monitoring manager 1 uses the fault management Web program 11f to perform fault detection and monitoring for both the monitoring center and the user at the same time in the procedure shown in FIG. Automatically report the analysis results of information and performance information.
[0052]
That is, as shown in FIG. 2, conventionally, a monitoring device that has detected a failure in a monitoring target device first notifies a monitoring center, performs information collection, analysis / investigation, and alerts a user. / The notification was sent to the user concerned and the user and the monitoring center had a large time difference before receiving the notification. In this example, when the network monitoring manager 1 detects the occurrence of the failure, the network monitoring manager 1 Since both users are notified at the same time, the notification time difference between the monitoring center and the user becomes almost zero.
[0053]
In this example, the network monitoring manager 1 and the

information collection agents

44 and 54 perform automatic collection of failure information, analysis, investigation, and remote recovery processing, so that the monitoring center and the user can obtain direct information. No collection is required, and only analysis / investigation of severe disability is required.
[0054]
Further, the notification of the occurrence of a failure from the network monitoring manager 1 to the monitoring center and the user side is performed by using the Web technology for the failure management Web program 11f, and by using the Web technology, the monitoring item and the performance item can be set in an arbitrary manner so as to easily find the abnormality instantly. With the observation time of, a numerical value or ×× is displayed on the Web screen in a tick sheet format in which it is visually displayed and a specific change can be easily judged by a numerical value.
[0055]
For example, in the Web screen display by the fault management Web program 11f, if each of the monitored devices 3 to 5 is an intranet server device, the person in charge of the faulty device, contact information, conditions, and the like are described. The pop-up memo appears automatically, and at the same time, the fault management Web program 11f automatically sends an e-mail to the person in charge.
[0056]
On the Web screen for the monitoring center, an information message such as a faulty server name and time is displayed in a pop-up to warn the user. When the failure is recovered, the contents are automatically moved to the failure history screen on the back screen.
[0057]
If the monitoring target is an e-mail server, a failure server is displayed on the monitoring screen, a buzzer sounds, and the contact information pops up when the operator clicks the corresponding failure server display portion.
[0058]
The Web screen for the user side has a hierarchical screen structure, and the first hierarchical screen displays the operating status in real time using icons for service-specific nodes for each office. At this time, the colors are displayed by three-stage evaluation of normal / attention / abnormal. Further, on the next hierarchical screen, a password is required to be input, and details that can be grasped in detail from the failure occurrence log are displayed, and a warning sound is emitted on this screen.
[0059]
If the monitored devices 3 to 5 are intranet or Internet servers, the status of each of the monitored devices 3 to 5 is indicated by an icon on the first screen on the web screen for monitoring the process and performance. Displayed in three steps of normal / warning / abnormal. The next layer screen provides performance information in a check sheet format, and the next layer screen provides detailed performance information on a text basis.
[0060]
Regarding e-mail server failures, the failure location is displayed on the monitoring screen and a buzzer sounds, and when the operator clicks on the failure location, the link to the operation status list screen is displayed. At the link destination, the number of messages staying at each failure location The minimum items that allow the user to instantaneously judge abnormalities in information such as the response state of SMTP and SMTP response status are visually displayed in blocks for each server.
[0061]
Incidentally, on the small business side, the server may be stopped every night due to the batch processing at night. In order to deal with such a case, in the present example, the monitoring target is released from the monitoring target at an arbitrary time zone by the monitoring temporary stop state display program 11e.
[0062]
Further, in this example, it is possible to comprehensively manage the entire process of fault monitoring and operation from the occurrence of a fault to the recovery and the creation of an operation monthly statistical report only on the Web management screen without human intervention during the process. it can.
[0063]
Hereinafter, the operation of the network monitoring system in FIG. 1 will be described. In FIG. 1, a monitoring target device 3 is a device that does not incorporate an information collection agent and is monitored only by a TCP port. As the monitoring at the TCP port, for example, there is a live / dead state check of each service process of the server device.
[0064]
The monitoring target device 4 includes monitoring at a TCP port, and further incorporates an information collection agent 44 including a one-time password authentication program 44a, a performance monitoring agent program 44b, and a remote recovery program 44c. The monitoring based on is performed.
[0065]
The monitoring target device 5 has the configuration of the monitoring target device 4 and a dedicated information collection agent 5A already monitored by the monitoring device 2 of another system. 5A coexist and both monitors are performed.
[0066]
These monitored devices 3 to 5 are connected to a monitoring device (network monitoring manager 1) via a network or a wide area LAN, and the monitoring device collects and manages monitoring information of each of the monitored devices 3 to 5.
[0067]
First, a monitoring operation for the monitoring target device 3 will be described.
[0068]
The monitoring of the monitoring target device 3 is performed by transmitting the status confirmation command (PING) from the process / status confirmation program 11c of the network monitoring manager 1 via the TCP port number setting change program 11b (when there is no corresponding TCP port number change instruction setting in advance). Is the default), and connects to each TCP port via the TCP processing unit 31 of the monitoring target device 3 connected to the communication line 6, and monitors the process status of the monitoring target TCP port at intervals of 5 minutes (arbitrary setting is possible). .
[0069]
When a non-response of the status confirmation command (PING) is detected, the status is set to “warning” among three categories of “normal / warning / abnormal”.
[0070]
When the "warning" is set in this way, as shown in FIG. 3, the monitoring timing time changing program 11b automatically reduces the PING issue timing time from 5 minutes to 1 minute (arbitrary setting is possible). Thereafter, a TCP session establishment is attempted for the TCP port at intervals of one minute for about 10 minutes.
[0071]
Therefore, an error message (Connection refused) is returned only when the connection cannot be established. The process / status confirmation program 11c detects a failure based on the presence / absence of the message and classifies the failure as “abnormal”.
[0072]
When a PING response is received, the process / status confirmation program 11c automatically returns to the default and enters the “normal” category.
[0073]
As described above, in the network monitoring manager 1, when "warning" is set, the PING issuance timing time is automatically shortened from the 5-minute interval to the 1-minute interval, and a TCP session establishment for the TCP port is attempted. Recovery detection time can be shortened, and monitoring accuracy can be improved.
[0074]
Further, the monitoring timing time change program 11b, when the recovery time managed by the system log information inherent to the monitoring target device 3 and the recovery time of the network monitoring manager 1 occur, the network monitoring manager 1 There is a time difference between a failure occurrence time, a recovery time, a Web display warning time, and the like described in a notification mail automatically issued for failure notification, and the monitoring target device 3 determines the time referenced by the network monitoring manager 1. Correct to the time used to manage the system log information. Since the monitoring information and performance management information collection time spanning a plurality of monitoring devices are synchronized, the failure analysis is effective in tracking the cause (time history) of matching a plurality of log information.
[0075]
In the fault management Web program 11f, when there is no response to the TCP port (when “Connection” is “refused”), the warning is set to “warning”, and the icon displayed on the web screen provided to the user device or the monitoring center device is displayed. From green (normal) to yellow (warning). Then, when the monitoring interval is switched to the one-minute interval and there is no response for ten consecutive times (about 10 minutes), it is determined as a failure, the icon is changed from yellow to red (abnormal), and an alarm is sounded.
[0076]
In addition, when "failure" or "recovery" of the monitored device is detected, an e-mail is automatically sent to a server administrator designated in advance in the monitoring condition message management database 15 to notify the server administrator. The notification destination for each user / server in the monitoring condition message management database 15, designation of time and the like, escalation of a person in charge, and the like can be arbitrarily set.
[0077]
An example of a notification mail to be automatically sent is shown below.

[0078]
The monitoring result is logged on the Web screen as described below. These logs always display the logs of the past 5 days. Nothing is displayed when the operating state of the server is good.
[0079]

[0080]
Here, “Apr / 24/2001 02:13:10 nmapp1 disk ok” is “normal”, the color identification classification is “green”, displayed in green, and “Apr / 23/2001 00:13”. “10 nmapp1 disk warning over 90%” is “warning”, the color classification is indicated by “yellow” in yellow, and “Apr / 22/2001 10:06:39 nmapp2 dns (port 53) error” Is "abnormal", the color identification classification is "red", and is displayed in red.
[0081]
In the failure management Web program 11f, when there is no response to the monitoring target TCP port, the information from the monitoring temporary stop display program 11e is referred to. In other words, the monitoring temporary stop display program 11e refers to the construction management information database 14 to search for construction stop information of the monitoring target device 3, determines whether it is a failure or a stop due to construction, and reports the result to the failure management Web program 11f. To instruct.
[0082]
If it is outside the fault, for example, if it is stopped due to construction, the fault management Web program 11f treats the time zone as not being monitored. As described above, when the monitoring target is not being monitored, a blue icon is displayed on the Web screen. This icon is not normally used, but is displayed when monitoring is temporarily stopped due to planned stoppage or the like.
[0083]
The stop time information is accumulated in the operation monthly report automatic creation program 13. The operation monthly report automatic creation program 13 calculates a monthly service operation rate and a resource usage rate for each monitoring target device (server device or the like) from the accumulated information results, and generates a service operation rate table (operation rate, operation time, number of stop times). , Outage time, number of warnings, planned outage times and time), frequency management of critical failures (warning in color by dividing at level 4), and graph of resource usage transition (comparison display with threshold value, weekly comparison display) An operation monthly report is automatically created, and is processed into information for predicting a system failure of a data transmission device or a server device in advance by the state history information Web content generation 13a.
[0084]
Next, the monitoring operation of the monitoring target device 4 will be described. Regarding the “status monitoring” of the monitoring target device 4, a status confirmation command (PING) is set from the process / status confirmation program 11 c of the network monitoring manager 1 to the TCP port number setting. It is connected to each TCP port via the TCP processing unit 41 of the monitoring target device 4 connected to the communication line 7 via the change program 11b (default if there is no corresponding TCP port number change instruction setting in advance). This is a monitoring process similar to that of the device 3, and only the “performance monitoring” will be described below.
[0085]
In the monitoring target device 4, the performance monitoring agent program 44b is a subprogram incorporated in the information collection agent 44, but also functions independently of the performance monitoring agent program 44b. It collects information, collects memory usage information, collects mail queue information, and collects the number of processes.
[0086]
In addition to the action operation function based on pattern matching with log information, the information collection agent 44 also has a communication function with the network monitoring manager 1 via a monitoring dedicated TCP port (for example, port number “8888”), and a separate system. It has a function to enable coexistence with an information collection agent dedicated to a monitoring device, and further has a function to cooperate with each of a one-time password authentication program 44a, a performance monitoring agent program 44b, and a remote recovery program 44c.
[0087]
The network monitoring manager 1 sends the performance of the information collection agent 44 of the monitored device 4 from the process / status confirmation program 11c via the TCP port number setting change program 11b (the default if there is no corresponding TCP port number change instruction setting in advance). A remote command for activating the monitoring agent program 44b is issued via the TCP port (“8888”) dedicated to the information collection agent 44 by the TCP processing unit 41 of the monitoring target device 4 connected to the communication line 7. Then, various scripts (for collecting CPU load information, collecting disk usage information, collecting memory usage information, collecting mail queue information, collecting the number of processes, etc.) registered in the performance monitoring agent program 44b in advance are used. Start.
[0088]
At this time, a one-time password for starting the performance monitoring agent program 44b of the information collection agent 44 of the monitored device 4 generated by the one-time password generation program 11a is given to the remote command, After performing authentication based on the one-time password in the password authentication program 44a, it is issued to the performance monitoring agent program 44b and activated.
[0089]
As described above, after the one-time password authentication, the performance monitoring agent program 44b edits the performance numerical value corresponding to the remote command in the form of a check sheet as performance monitoring information, and transmits it to the process / performance monitoring program 11.
[0090]
In the process / performance monitoring program 11, the fault management Web program 11f compares the performance numerical value sent from the performance monitoring agent program 44b with a preset "threshold", and exceeds the threshold (below the threshold). In other cases, it is detected as a failure and reported. Note that the performance monitoring agent program 44b does not store performance monitoring information but leaves only access log information.
[0091]
“Monitoring load average (collecting CPU load information)” in the performance evaluation basically grasps the CPU load status based on the result of the “uptime” command, and monitors based on the average of the past minute. Do. For example, in the case of FreeBSD (registered trademark), the execution result of “uptime” is shown as follows.
[0092]

[0093]
The items (0.10, 0.09, 0.08) under the above-mentioned "load average" are acquired, compared with the threshold value, and a warning is issued when the value exceeds the threshold value. If this state continues for a while, it is detected as a failure.
[0094]
In this way, the failure is not detected immediately when the threshold value is exceeded, but is recognized as a failure if the threshold value continues to be exceeded. The warning period can be set arbitrarily.
[0095]
“Monitoring of disk usage rate” in the performance evaluation involves monitoring the disk usage status based on the result of the “df” command, and performs monitoring in file system units. For example, in the case of FreeBSD (registered trademark), the execution result of “df” is as follows.
[0096]

[0097]
The value of “Capacity” (52%, 48%, 0%) corresponding to the file system (“Filesystem”) is obtained, compared with a threshold value, and if it exceeds the threshold value, detected as a failure. A plurality of file systems can be monitored at the same time, but the threshold value is the same. Note that up to two thresholds can be specified.
[0098]
“Monitoring of memory usage rate” in the performance evaluation basically grasps the memory usage status based on the result of the “top” command, and performs monitoring based on the value of the free memory. For example, in the case of FreeBSD (registered trademark), the execution result of “top” is as follows.
[0099]

[0100]
Among them, only the line of “Memory:” (“Memory: Real: 3628K / 22M Virt: 8752K / 199M Free: 29M”) is selected. Furthermore, an item of “Free:” (“29M”) regarding “Memory:” is acquired and compared with a threshold value, and if it falls below the threshold value, it is detected as a failure. However, since the “top” command may not be installed as a standard, in that case, it may be installed separately or monitoring may not be possible.
[0101]
For example, when the operating system is Linux (registered trademark), there is an exclusive “free” command for displaying the memory usage status in Linux (registered trademark). Therefore, when the operating system is Linux (registered trademark), the “free” command is used. Command. An execution example of the “free” command is shown below.
[0102]

[0103]
At this time, the value of “free” (“23348”) for the “Mem:” line is acquired.
[0104]
Next, “mail queue monitoring” in the performance evaluation will be described using “Sendmail” as an example.
[0105]
In the “Sendmail” mail queue monitoring, the state of staying mail is grasped based on the result of the “mailq” command, and monitoring is performed based on the number of stays. For example, the execution result of “mailq” in the case of “Sendmail” is as follows.
[0106]

[0107]
From the execution result, the number of staying mails is obtained and compared with a threshold value. If the number exceeds the threshold value, it is detected as a failure. If there is no mail queue, "empty" is returned as a message, and this is treated as "0 (numerical value)".
[0108]
Further, as another example, a groupware mail will be described. The mail queue of the group mail system is monitored in addition to the “Sendmail” mail queue, and “x.400” locally used by the group mail system and “x.400” and SMTP are used. The number of each file of the “SMTP Gateway” which is a bridge with the above is counted and treated as the number of stays. The “SMTP Gateway” counts two for “x.400” and for “Sendmail”.
[0109]
Since the number of mails peculiar to the group mail system Mail can be obtained by counting the number of files in a specific directory, a script for counting the files is prepared, and by executing this script, each of the mails is retained. You can get the number. The acquired number of stays is compared with the threshold value, and if it exceeds the threshold value, it is detected as a failure. Note that monitoring is possible for each of the four items of “Mail queue”, “x400 queue”, “smtp to x400 queue”, and “smtp to Sendmail queue”.
[0110]
Further, as another example, the virus check server monitors a mail queue dedicated to mail virus check using a built-in virus check software product. The number of staying mails can also be obtained by counting the number of files in a specific directory, as in the case of group mail.
[0111]
Next, "process number monitoring" in the performance evaluation is to count a specific number of processes and monitor based on the counted number. Typical examples include “SendMail”, “Delegate”, and “Squid”. Since the target process is not limited, the type does not matter if it can be counted.
[0112]
As an example, when monitoring the number of processes of “SendMail”, as shown below, a process list is displayed by a “ps” command, and a process list having a character string of “sendmail” is extracted from the list. The number of processes is obtained by counting the number of extracted lines.
[0113]

[0114]
As the specific process number, a script for counting the number of processes as described above is prepared, and the number of processes can be obtained by executing the script. The acquired number of processes is compared with a threshold value, and if it exceeds the threshold value, it is detected as a failure.
[0115]
Next, a remote recovery operation for the monitored device 4 will be described.
[0116]
First, a recovery operation according to a failure previously incorporated in the remote recovery program 44c is executed using an event (HTTP: Hyper Text Transfer Protocol, SMTP: Abnormal termination of Simple Mail Transfer Protocol) occurring on the monitored device 4 as a trigger. Register programs and shell scripts.
[0117]
In the monitoring target device 4, the performance monitoring agent program 44b of the information collection agent 44 refers to the device owned resources and operation information (including various log files) stored in the information storage unit 42, and performs pattern matching in various log files. Monitors the command execution results.
[0118]
The information collecting agent 44 sends a trap as a failure detection to the process / performance monitoring program 11 of the network monitoring manager 1 incorporated in the monitoring device, based on the result of monitoring by the performance monitoring agent program 44b.
[0119]
Based on this information, the process / performance monitoring program 11 of the network monitoring manager 1 incorporated in the monitoring device sends the remote recovery determination program 11g via the TCP port number setting change program 11b and the one-time password generation program 11a (in advance). (If there is no corresponding TCP port number change instruction setting, the default), a remote command with a one-time password for starting the remote recovery program 44c of the information collection agent 44 of the monitored device 4 is monitored via the communication line 7. Send to target device 4.
[0120]
The monitoring target device 4 connects the process / performance monitoring program 11 and the information collection agent 44 with a TCP port number exclusively assigned to the information collection agent 44 via the TCP processing unit 41.
[0121]
The information collection agent 44 authenticates the one-time password given to the remote command from the process / performance monitoring program 11 by the one-time password authentication program 44a, and then responds to the remote command to the remote recovery program 44c. Then, a program or shell script that executes a recovery operation corresponding to a failure registered in advance is started.
[0122]
Next, as a third example, a monitoring operation on the monitoring target device 5 will be described.
[0123]
The monitoring target device 5 is simultaneously monitored by a monitoring device provided with the network monitoring manager 1 and a monitoring device 2 of a different system from the monitoring device. It is connected by a monitoring communication line 9.
[0124]
The monitoring target device 5 is provided with a dedicated information collection agent 5A for another system monitoring device for collecting information for another system monitoring device 2, and a relay server is provided on the network monitoring manager 1 side. As a subsystem of the program 12, a dedicated information collection agent program 12A for a monitoring device of another system is provided. The dedicated information collection agent 5A for the monitoring device of another system and the dedicated information collection agent program 12A for the monitoring device of another system have the same function.
[0125]
With the integrated monitoring information management program 12a of the relay server program 12, the monitoring information of the network monitoring manager 1 and the monitoring information of the monitoring device 2 of another system are virtually integrated, and thereby, The monitoring function of the monitoring device 2 of the system is linked.
[0126]
Hereinafter, as an example, the monitoring device 2 of another system has a remote recovery function of a commercial UNIX (registered trademark) system, but a PC-UNIX (UNIX: registered trademark) system (FreeBSD (registered trademark), Linux (registered trademark)). ), Etc., have no monitoring function and are not targeted for remote recovery. The network monitoring manager 1 has process monitoring, performance monitoring and recovery functions of PC-UNIX (UNIX: registered trademark), Assuming that there is no recovery function in the above, an operation process when an HTTP failure occurs in the monitored device 5 in such an environment will be described with reference to FIG.
[0127]
When an HTTP failure occurs in such a monitoring target device 5 ((1)), it is detected by the performance monitoring agent program 54b of a subprogram in the information collection agent 54 provided in the monitoring target device 5 and recorded in log information. (▲ 2 ▼).
[0128]
The network monitoring manager 1 acquires log information from the performance monitoring agent program 54b of the subprogram at predetermined time intervals by the process / status confirmation program 11c, and detects an HTTP failure in the monitoring target device 5 ([3]). ).
[0129]
As described above, when an HTTP failure is detected in the monitoring target device 5, the monitoring timing time adjustment changing program 11d shortens the time interval at which the process / status confirmation program 11c acquires log information from the performance monitoring agent program 54b, and The interval is changed from 5 minutes to 1 minute.
[0130]
In addition, according to the failure status at this time, the failure management Web program 11f generates Web report information that is divided into levels, such as warnings, failures, and alarm sounds, and sends the information to the user side and the monitoring center.
[0131]
When the process / status confirmation program 11c detects an HTTP failure of the monitoring target device 5, the remote recovery determination program 11g determines whether or not there is a remote recovery function for the failure. Here, it is determined that the remote recovery determination program 11g has a recovery function for the failure but does not have a remote recovery function, and a recovery instruction is output from the remote recovery determination program 11g to the failure management Web program 11f.
[0132]
The failure management Web program 11f uses the recovery instruction output from the remote recovery determination program 11g and the performance monitoring log information acquired by the process / status confirmation program 11c by the failure recovery information list generation function 11f. ₁ The check sheet information 11f ₂ Edit to This check sheet information 11f ₂ Is in a check sheet format so as to be shared with the monitoring device 2 of another system.
[0133]
This editing result information is stored in the monitoring information synchronization program (NFS) 11f. ₃ Is transmitted to the monitoring information synchronization program (NFS) 12d of the relay server program 12 using NFS, and transmitted to the integrated monitoring information program 12a ((4), (5)).
[0134]
As described above, in the integrated monitoring information program 12a, the failure status information registered in the monitoring device 2 of another system is alerted in the same manner as the check sheet (name of monitoring target, failure status information, performance monitoring log information, failure). Information) format, and the check sheet information 12a ₁ , The dedicated information collection agent program 12A for the monitoring device of another system detects an HTTP failure of the monitoring target device 5.
[0135]
Based on the operation of detecting the HTTP failure of the monitoring target device 5 by the dedicated information collection agent program 12A for the monitoring system of another system, the integrated monitoring information program 12a checks the check sheet information 12a. ₁ Is read out, transmitted to the monitoring device 2 of another system via the socket program 12c, and a trap of the remote recovery instruction is sent to the monitoring device 2 of another system ([6]).
[0136]
Normally, when a failure is detected by the failure detection function 23, the monitoring device 2 of the different system performs a remote recovery process in accordance with the failure recovery template 21, but here, a PC-UNIX (UNIX: registered trademark) compatible recovery is performed. Since there is no program or shell script for executing the operation, in order to start the remote recovery program 54c of the information collection agent 54 of the monitored device 5, the socket program 21a is used as an add-on software of the remote recovery function 22 in advance to recover the failure. Registered in the application template 21.
[0137]
Thus, the monitoring device 2 of another system issues a remote command to the monitoring target device 5 connected via the socket program 21b, a program or a shell script for executing the recovery operation of the remote recovery program 54c of the information collection agent 54. (7), the HTTP recovery operation program 54c in the monitored device 5 ₁ (8) to perform a recovery process.
[0138]
In the monitoring target device 5, the information collection agent 5A dedicated to another system monitoring device incorporated under the control of the monitoring device 2 of another system and the information collection agent 54 under the network monitoring manager 1 store information. Since information is commonly collected from the device-owned resource information and the operation information stored in the unit 52, no information synchronization deviation occurs.
[0139]
As described above, by using the monitoring function of the monitoring target device 5 in cooperation, it is possible to realize the integration of the monitoring operation of the plurality of monitoring devices in the multi-vendor environment.
[0140]
Next, with respect to the cooperation (trace) operation of the monitoring functions of a plurality of monitoring apparatuses in such a multi-vendor environment, an operation of monitoring performance such as disk usage and log information in the monitoring target apparatus 5 will be described as an example.
[0141]
Polling is performed from the network monitoring manager 1 to the performance monitoring agent program 54b of the information collection agent 54 incorporated in the monitoring target device 5 in order to collect performance monitoring information.
[0142]
The contents to be collected include, for example, “Load average 0.13”, “Free Memory 175M”, “file system (/) 29%”, “file system (/ usr) 62%”, “file system (/ var)”. 100% "," file system (/ var / mail) 0% "," file system (/ var / spool / mqueue) 0% "," inetd o "," syslogd update o "," named o "," named o "," named o "," named o " “process 1”, “Mail queue 0”, “delete process 1”, “messages Apr 200 05:36:03 Monitored device 5 kernel: / var: optimization changed f om SPACE to TIME Apr 22 03:10:04 monitored devices 5 kernel: / var: the optimization changed from TIME to SPACE ", and the like.
[0143]
The network monitoring manager 1 stores the data as a temporary file, compares the data with a threshold value set in a monitoring setting file in advance, and determines whether a failure has occurred in the monitoring target device 5. After the comparison is completed, the temporary file is deleted.
[0144]
For example, when the disk usage rate exceeds the threshold value and a failure occurs, the failure management Web program 11f checks the check sheet information 11f. ₁ Is generated, and the check sheet information 11f is generated. ₁ This fault information is notified from the relay server program 12 which performs file sharing by NFS to the monitoring device 2 of another system via the dedicated information collection agent program 12A for the monitoring device of another system.
[0145]
The fault management Web program 11f gives a warning or the like on an operator console screen or the like that a fault under monitoring of a monitoring device of another system is detected. In the remote recovery determination program 11g, it is determined whether or not the remote recovery function 22 in the monitoring device 2 of another system is a recovery target.
[0146]
In the case of a recovery target, the failure recovery template 21 (disk failure recovery procedure) in the remote recovery function 22 is operated by the socket program 12c via the socket program 21a incorporated as add-on software in the monitoring device 2 of another system. Then, the disk failure recovery program in the remote recovery program 54c of the information collection agent 54 under the monitoring of the network monitoring manager 1 incorporated in the monitoring target device 5 is started.
[0147]
In this way, the monitoring target device 5 that has received the above access from the monitoring device 2 of another system restores the recovery target prepared beforehand under the bin directory under the dedicated directory of the information collection agent 54 monitored by the network monitoring manager 1. Execute the operation program (“disk_recover.sh”).
[0148]
Next, the operation of the operating monthly report automatic creation program 13 of the network monitoring manager 1 incorporated in the monitoring device in FIG. 1 will be described.
[0149]
The operation monthly report automatic creation program 13 is a check sheet information 12a of the integrated monitoring information management program 12a shown in FIG. ₁ This is a function that calculates the monthly service utilization rate and resource utilization rate of the monitored device, and creates a monthly report of a “service utilization rate table” and a “resource utilization rate transition graph”. The details and screen of the report to be created are shown in FIG. 5 and FIG.
[0150]
FIG. 5 is an explanatory diagram showing an example of the configuration items of a service availability table created by the operation monthly report automatic creation program in FIG. 1, and FIG. 6 is created by the operation monthly report automatic creation program in FIG. FIG. 7 is an explanatory diagram showing an example of configuration item contents of a resource usage rate transition graph to be executed.
[0151]
As shown in FIG. 5, the “service utilization rate” for the month is composed of “item”, “unit”, and “description” columns. For example, the “operation rate” is a planned downtime in units of “%”. Is the ratio of the operating time excluding, and is obtained by the formula of “(operating rate) = (operating time) / ((all target times) − (planned stop time))”.
[0152]
The “operating time” is a service operating time in units of “minutes”, and is obtained by an expression of “(operating time) = (all target times) − (planned downtime) − (downtime). The “stop count” is the number of times the service has stopped, excluding the planned stop, in units of “times”, and the “stop time” is the value of the service stop time, excluding the planned stop, in “minutes”. The "warning (response delay) count" is in units of "times", but it does not lead to service interruption, but the number of times that response delay is detected is recorded, and the "planned stop count" is in units of "times". The number of planned stops is recorded, and the "planned stop time" is recorded in units of "minutes".
[0153]
The “stop times by stop time level” is a record of the number of stop times by the length of time during which the service is stopped, excluding the planned stop, in units of “times”. In the “stop times by stop time level”, the default stop levels are “level A: 2 hours or more”, “level B: 1 hour or more and less than 2 hours”, and “level C: 30 minutes or more and 1 hour”. The stop time that defines the stop level for “less than” and “level D: less than 30 minutes” can be changed.
[0154]
The stop / warning / planned stop over a month is counted as the stop / warning / planned stop count in the preceding and following months, respectively. The stop time level is also calculated based on each stop time in the preceding and following months. Further, when the planned stop is started after the stop / warning time, one stop / warning is counted before the planned stop. Further, the time determined as stop / warning in the ping monitoring is the time when all services are also determined as stop / warning.
[0155]
The “resource utilization rate transition graph” is a graph showing the transition of the utilization rate of a server resource such as a disk or a memory during the target month. The configuration items are, as shown in FIG. "Unit" and "Description" columns.
[0156]
For example, “disk usage” is a transition graph in which the daily maximum usage rate of each partition is plotted in units of “%”. “Empty memory” is “Mbyte” in units of daily Is a transition graph in which the minimum amount is plotted, and “CPU load average” is a transition graph in which the daily maximum value and average value of the CPU load average are plotted.
[0157]
As described above with reference to FIGS. 1 to 6, in this example, as a system for remotely monitoring each monitored device in a distributed computer network system in a multi-vendor environment, each monitored device (data transmission device or server) is used. Device) with an information collection agent, and a monitoring device with a network monitoring manager. In the monitored device, the information collection agent coexists with an information collection agent that is dedicated to a different system of monitoring devices. By realizing information sharing with devices, monitoring support in a computer network system in a multi-vendor environment is integrated and performed.
[0158]
In addition, by providing a function for performing authentication using a one-time password between the monitoring device and the monitoring target device, it is possible to prevent unauthorized intrusion of the monitoring target device on the user side impersonating the monitoring device.
[0159]
Also, by providing a function for displaying a list of persons in charge, a message indicating presence / absence of telephone call, and a degree of importance on a monitoring center device or a user device for each failure, the center and the user can know who should report the failure. Searching is facilitated, and prompt notification and the like are possible.
[0160]
In addition, by using the NFS technology, by providing a function for network connection between the respective monitoring devices in which the monitoring information is stored, the server load is reduced, and the monitoring information among the plurality of monitoring devices is synchronized. Centralized management is possible, and TCO can be reduced.
[0161]
In addition, by registering an address, an identifier, an array of characters, and the like used for log information, searching the log information, and detecting the same pattern, a pattern match processing function for performing a pre-registered action operation is provided. For example, it is possible to perform monitoring that is impossible only by outputting the failure information on the log, and the failure monitoring operation process from failure detection to notification and recovery can be automatically executed.
[0162]
In addition, the monitoring information provided to the user is provided on a Web screen in a configuration in which the monitoring state can be grasped in a unified manner, and the display configuration is such that the detailed information is provided as the hierarchy becomes deeper, so that the user and the monitoring center can be monitored. It realizes two-way remote monitoring and enables quick establishment of a failure system.
[0163]
It also calculates the monthly service utilization rate and resource utilization rate including the planned downtime for each monitored device from the monitoring information results, and provides a service utilization table (uptime, uptime, number of stops, downtime, number of warnings, Automatically generate a monthly report of operation of the frequency of stoppage and time), frequency management of critical failure occurrence (level 4 classification and warning by color), and resource utilization transition graph (comparison display with threshold value, weekly comparison display), and Web screen By providing the function provided by the above, it is possible to provide information for predicting a system failure of a monitoring target device such as a data transmission device or a server device in advance.
[0164]
As described above, in the present example, seamlessization of “monitoring to recovery and operation management” from the physical network to the application layer in a multi-vendor environment is realized. Utilizing new monitoring technology and Web technology, prompt response by interactive monitoring that detects failure occurrence at the monitoring center and also notifies the user on the Web screen at the same time (from information collection / analysis to failure detection and remote recovery) Was made possible. In addition, we will improve efficiency and labor saving of operation management taking full consideration of the CS (client server) technical concept, and reduce the TCO (total operation cost) and proactively manage the reliability (data transmission equipment and server equipment). System failure in advance).
[0165]
As a result, the monitoring center checks the contact information of the person in charge according to the content of the failure when the failure occurs, and if so, contacts the monitor and waits for the corresponding instruction. She was released because she was constantly nervous, such as constantly checking.
[0166]
On the user side, monitoring of common application software such as groupware mail, firewall and directory (LDAP), which has to be individually monitored due to monitoring restrictions in a multi-vendor environment, and the target OS (FreeBSD ( (Trademark), Linux (registered trademark), and other PC-UNIX (UNIX: registered trademark) systems, commercial UNIX (registered trademark) systems, and various other operating systems (OSs). You will be released from it.
[0167]
As a result, the scope of know-how applied to remote operation monitoring and management services has expanded. For example, the relay server function linked with other company's products can solve problems such as distribution of monitoring duties due to multi-vendor of monitoring equipment, and the monitoring system automatically notifies the server administrator of failure / recovery status from the monitoring system. The function can reduce the man-hours of the monitoring work. In addition, the PC-UNIX (UNIX: registered trademark) remote auto-recovery program, which is not supported by the conventional monitoring technology, can be used for the practical operation such as securing human resources in specialized fields by developing a remote automatic recovery program. The effect is obtained.
[0168]
Therefore, the network monitoring system of this example is an essential network monitoring technology in the Internet age. If operation monitoring of distributed offices across a network is performed without using the technology of this example, security becomes an issue.Therefore, a system for distributed monitoring and operation where monitoring equipment is installed at the distribution destination is established, and equipment and operation personnel Cost.
[0169]
The present invention is not limited to the examples described with reference to FIGS. 1 to 6 and can be variously modified without departing from the gist thereof. For example, in the description of FIG. 2, in this example, a monitoring device installed at a location different from the monitoring center operates in cooperation with the fault management Web program 11 f in the network monitoring manager 1 to connect the user and the monitoring center. Although the simultaneous notification is performed, the information collection agent alone may perform automatic simultaneous notification to the user and the monitoring center. In this example, a database is required for notification to multiple users, notification according to failure classification, performance information, threshold management, and failure recovery instruction. The mechanism is linked to the Web program 11f, and the information of this part is added to notify the user and the monitoring center at the same time.
[0170]
Also, in the present example, the network monitoring manager 1 does not have a remote recovery function of PC-UNIX (UNIX: registered trademark) in the present example, but the network monitoring manager 1 uses PC-UNIX ( UNIX (registered trademark) may have a remote recovery function. In this case, the HTTP failure in the monitoring target device 5 can be remotely recovered directly from the monitoring device provided with the network monitoring manager 1 without passing through the monitoring device 2 of another system.
[0171]
In the example shown in FIG. 4, the network monitoring manager 1 reads the log information of the performance monitoring agent program 54b in the information collection agent 54 of the monitoring target device 5 by the process / status confirmation program 11c, so that the monitoring target device 5 The network monitoring manager 1 (the process / performance monitoring program 11) performs HTTP monitoring polling on the monitoring target device 5 and detects a non-response of the HTTP monitoring polling. Alternatively, a failure of the HTTP program 53a in the monitoring target device 5 may be detected.
[0172]
Further, in this example, the configuration is described in which UNIX (registered trademark) / OS is used as an OS (operating system), but another OS may be used. Further, although NFS is used for coordination with a monitoring device of another system, another network file protocol may be used.
[0173]
In the computer configuration example of this example, an optical disk is used as a recording medium for programs and data, but an FD (Flexible Disk) or the like may be used as a recording medium. As for the installation of the program, the program may be downloaded and installed via a network via a communication device.
[0174]
【The invention's effect】
According to the present invention, when the network monitoring manager activates the information collection agent (shell script) of the monitored device via the network, the network monitoring manager prevents the unauthorized user from executing the shell script of the monitored server. Since an authentication function is provided for communication between the server and the monitoring target device (server device), safe monitoring that secures secure communication is possible even when a network is used. In addition, by performing bidirectional monitoring in which the user can recognize the failure on the Web screen at the same time that the failure has been detected, quick response (from information collection / analysis to failure detection and remote recovery) is possible. Further, the relay server program enables cooperation with other companies' products, and can solve problems such as distribution of monitoring duties due to multi-vendor of monitoring devices. For example, PC-UNIX which is not supported by the conventional monitoring technology can be solved. (UNIX: registered trademark) and the like, and automatic recovery is possible, which is effective for practical operations such as securing human resources in a specialized field. In addition, the network operator in each system management department can create the operation statistics monthly report automatically and reliably, which can be the most troublesome. It can provide the information to predict the system failure in advance with high accuracy. Operation efficiency of
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration example of a network monitoring system according to the present invention.
FIG. 2 is an explanatory diagram showing a first operation example of the network monitoring system in FIG. 1;
FIG. 3 is an explanatory diagram showing a second operation example of the network monitoring system in FIG. 1;
FIG. 4 is a block diagram illustrating a detailed configuration example of a network monitoring system in FIG. 1;
FIG. 5 is an explanatory diagram showing an example of configuration item contents of a service availability table created by an operating monthly report automatic creation program in FIG. 1;
FIG. 6 is an explanatory diagram showing an example of configuration item contents of a resource usage rate transition graph created by an operating monthly report automatic creation program in FIG. 1;
[Explanation of symbols]
1: network monitoring manager, 1a: socket program, 2: monitoring device of another system, 3 to 5: monitoring target device, 5A: dedicated information collection agent for monitoring device of another system, 6 to 8: communication line, 9: Communication line for monitoring system of another system, 10: virtual communication path, 11: process / performance monitoring program, 11a: one-time password generation program, 11b: TCP port number setting change program, 11c: process / status confirmation program, 11d : Monitoring timing time adjustment change program, 11e: Monitoring temporary stop state display program, 11f: Fault management Web program, 11f ₁ : Failure recovery information list creation function, 11f ₂ : Check sheet information, 11f ₃ : Monitoring information synchronization program (NFS), 11g: Remote recovery judgment program, 12: Relay server program, 12a: Integrated monitoring information management program, 12a ₁ : Check sheet information, 12b: HTML generation program (“HTML generation”), 12c: Socket program, 12d: Monitoring information synchronization program (NFS), 12A: Dedicated information collection agent for monitoring device of another system, 13: Monthly operation Automatic report creation program, 13a: status history information Web content generation program (“status history information Web content generation”), 14: construction management information database, 15: monitoring condition message management database, 21: failure recovery template, 21a, 21b: Socket program, 22: remote recovery function, 23: failure detection function, 31, 41, 51: TCP processing unit ("TCP port"), 32, 42, 52: information storage unit ("device owned resource information and operation information") ), 33, 43, 53: Application Management unit ("application program"), 44, 54: information collection agent, 44a, 54a: one-time password authentication program, 44b, 54b: performance monitoring agent program, 44c, 54c: remote recovery program, 53: application program, 53a: HTTP, 54c ₁ : HTTP recovery operation program, 54d: socket program.

Claims

A network monitoring system for monitoring a monitoring target device by a monitoring device across a network,
Information collection agent means provided in the monitored device for collecting monitoring information including at least performance information of the monitored device;
Monitoring manager means provided in the monitoring device for acquiring operation information collected by the information collection agent means, and detecting occurrence of a failure in the monitored device based on the operation information;
A network monitoring system comprising: an authentication unit for authenticating the monitoring device when the monitoring manager unit collects the operation information in the information collection agent unit.

A network monitoring system for monitoring a monitoring target device by a monitoring device across a network,
Information collection agent means provided in the monitored device for collecting monitoring information including at least performance information of the monitored device;
Monitoring manager means provided in the monitoring device for acquiring monitoring information collected by the information collecting agent means, and detecting occurrence of a failure in the monitored device based on the monitoring information;
A network monitoring system, comprising: an authentication unit that authenticates the validity of the monitoring device using a one-time password when the monitoring manager unit collects the monitoring information in the information collection agent unit.

A network monitoring system according to any one of claims 1 and 2,
The information collection agent means includes an operation means for determining and executing a process stored in a storage device in advance for the collected monitoring information, and at least one of a cause analysis process and a recovery process for a fault that has occurred. A network monitoring system, comprising:

The network monitoring system according to any one of claims 1 to 3, wherein
The monitoring manager means,
Monitoring timing time adjustment changing means for setting a time interval for acquiring monitoring information collected by the information collecting agent means,
A network monitoring system, wherein the monitoring timing time adjustment changing means changes the time interval to be shorter in response to the detection of an abnormality in the monitored device, and restores the time interval in response to the detection of a normal state.

The network monitoring system according to any one of claims 1 to 4, wherein
The monitoring manager means,
A TCP port number setting change unit configured to change a TCP port number for recognizing an application program operating on the monitored device,
A network monitoring system for detecting an abnormality of the application program by trying to establish a TCP session for a TCP port of a number set by the TCP port number setting change means.

The network monitoring system according to claim 5, wherein
The monitoring manager means,
Monitoring timing time adjustment changing means for setting a time interval for attempting the TCP session establishment,
The network monitoring system, wherein the monitoring timing time adjustment changing means changes the time interval to be shorter when the abnormality of the application program is detected, and returns to the original time when a normal state is detected.

A network monitoring system according to any one of claims 5 and 6,
A network monitoring system, comprising: means for correcting the time referenced by the monitoring manager means in accordance with the time of the monitored device.

The network monitoring system according to any one of claims 1 to 7, wherein:
The monitoring manager means,
A network monitoring system comprising a Web unit for displaying monitoring result information on the monitoring target device on a Web screen.

The network monitoring system according to claim 8, wherein
The monitoring manager means,
A network monitoring system comprising means for transmitting and displaying the Web screen of the monitoring result information to a terminal device of a user who uses the monitoring target device and a predetermined monitoring center device.

A network monitoring system according to any one of claims 8 and 9,
The Web screen of the monitoring result information has a multi-layer configuration, the screen of the first layer includes information for notifying the occurrence of a failure, and the screens of the other layers include information indicating at least a notification destination and a procedure for responding to the failure, including a report destination. A network monitoring system, comprising:

11. The network monitoring system according to claim 10, wherein password protection is provided for the screen of the other layer.

The network monitoring system according to any one of claims 1 to 11, wherein:
The monitoring manager means refers to the construction plan information of the monitoring target device stored in the storage device in advance, determines a failure due to the construction of the monitoring target device, and displays the failure due to the construction so as to be distinguishable from a normal failure. A network monitoring system, comprising: a hibernation state display unit.

The network monitoring system according to any one of claims 1 to 12, wherein
The monitoring manager means has remote recovery instruction means for transmitting recovery instruction information corresponding to the failure to the monitored device in which the failure has occurred,
A network monitoring system, characterized in that the information collection agent means includes a recovery means for performing a predetermined recovery process for the failure based on recovery instruction information from the monitoring manager means.

The network monitoring system according to any one of claims 1 to 13, wherein:
The monitoring target device incorporates the information collecting agent means and the information collecting agent means for a different type of monitoring apparatus, and stores the monitoring information collected by each information collecting agent means in a common storage device. A network monitoring system, wherein the network monitoring system is shared by each of the monitoring device and the monitoring device of the different system.

The network monitoring system according to claim 14, wherein the monitoring manager means comprises:
Generating means for converting the common monitoring information collected from the information collecting agent into check sheet information that can be handled in common with the monitoring device of the different system;
Means for detecting, based on the check sheet information, a failure of a monitoring target of the monitoring target device in the monitoring device of the different system.

16. The network monitoring system according to claim 15, wherein said monitoring manager means comprises:
Determining means for determining whether the detected fault of the monitoring target in the monitoring device of the different system is a recovery target of the own device,
If it is not a target for recovery, it has means for sending recovery request information to the monitoring device of the another system,
The network monitoring system, wherein the monitoring device of another system has means for instructing an information collection agent of the monitoring target device to recover the failure based on the recovery request information.

The network monitoring system according to any one of claims 1 to 16, wherein
A network monitoring system, comprising: means for connecting a storage device that stores monitoring information acquired by each of a plurality of monitoring devices to a network using one of network file protocols including NFS.

The network monitoring system according to any one of claims 1 to 17, wherein
A network monitoring system, characterized in that the monitoring manager means includes a report creation means for generating information indicating an operation status of the monitoring target device in the period based on a monitoring result in a predetermined period.

20. The network monitoring system according to claim 18, wherein said report creating means includes:
A network monitoring system, comprising: means for correcting operation status information for a monitored device based on construction information of the monitored device stored in a storage device in advance.

A program for causing a computer to function as each unit in the network monitoring system according to any one of claims 1 to 19.