JP2010009293A

JP2010009293A - Computer system and system switching method

Info

Publication number: JP2010009293A
Application number: JP2008167443A
Authority: JP
Inventors: Haruhiko Nakamura; 春彦中村; Masahiko Yamauchi; 雅彦山内; Taro Nakamura; 太郎中村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-06-26
Filing date: 2008-06-26
Publication date: 2010-01-14

Abstract

【課題】現用系及び待機系のコンピュータから構成されるコンピュータシステムの系切替を高速に行うことを可能とする。
【解決手段】現用系及び待機系のコンピュータ101、102がOS116を格納した1つのディスク104を共有し、また、これらのコンピュータの状態を監視し、系切替の制御を行う監視系コンピュータ103が現用系及び待機系のコンピュータに接続されている。監視系コンピュータ103は、システムの立ち上げ時、現用系及び待機系のコンピュータの電源をオンとさせ、現用系及び待機系のコンピュータにハードウェア初期化及びセルフテストを行わせ、その後、現用系コンピュータにサービスを提供する処理を実行させ、待機系コンピュータをその状態で待機させ、現用系コンピュータの障害を検出したとき、待機系コンピュータに現用系コンピュータが行っていた処理を引き継がせる。
【選択図】図１It is possible to perform high-speed system switching of a computer system composed of an active computer and a standby computer.
An active computer and a standby computer share a single disk that stores an OS, and a monitoring computer that monitors the status of these computers and controls system switching is used. It is connected to the primary and standby computers. When the system is started up, the monitoring computer 103 turns on the power of the active and standby computers, causes the active and standby computers to perform hardware initialization and self-test, and then the active computer When the failure of the active computer is detected, the standby computer is made to take over the processing performed by the active computer.
[Selection] Figure 1

Description

本発明は、コンピュータシステム及び系切替方法に係り、特に、複数のコンピュータを現用系と待機系とにより多重化し、系切替を可能としたコンピュータシステム及び系切替方法に関する。 The present invention relates to a computer system and a system switching method, and more particularly to a computer system and a system switching method in which a plurality of computers are multiplexed by an active system and a standby system to enable system switching.

現代の社会では、コンピュータシステムを用いて２４時間３６５日休みなくインターネット等を介して各種のサービスを提供することが求められている。そのため、コンピュータシステムとしては、障害が発生した場合でも、できるだけ短時間で障害から復旧し、サービスの提供を継続することができる高信頼なものが求められている。 In the modern society, it is required to provide various services through the Internet etc. without using a computer system for 24 hours 365 days a day. Therefore, a highly reliable computer system is required that can recover from a failure and continue providing services in the shortest possible time even when a failure occurs.

信頼性の高いコンピュータシステムを実現する方法の１つとして、複数のコンピュータを多重化して利用する方法がある。この方法は、現用系コンピュータの処理を引継ぐ待機系のコンピュータを設けてコンピュータシステムを構成するというものである。これにより、現用系のコンピュータに障害が発生した場合でも、待機系のコンピュータへ系の切替を行うことができ、待機系コンピュータが処理を継続することが可能となる。 One method for realizing a highly reliable computer system is to use a plurality of computers in a multiplexed manner. In this method, a computer system is configured by providing a standby computer that takes over the processing of the active computer. As a result, even when a failure occurs in the active computer, the system can be switched to the standby computer, and the standby computer can continue processing.

そして、系の切替を行う方法として、コールドスタンバイと呼ばれる方法が知られている。このコールドスタンバイと呼ばれる系切替の方法は、サービスの提供に必要なＯＳやアプリケーションを格納したシステムディスクを共有した現用系コンピュータと待機系コンピュータとによりコンピュータシステムを構成し、現用系のコンピュータに障害が発生した場合に、待機系のコンピュータを起動し、系切替を行うことにより処理を継続するというものである。前述したようなコールドスタンバイによる系の切替を用いることにより、コンピュータシステムの信頼性を向上させることができるが、この方法は、障害発生後に待機系を起動するために、待機系が起動を完了し、サービスを提供するまでに、３０分から１時間程度の時間が掛かるという欠点があった。 A method called cold standby is known as a method for switching the system. In this system switching method called cold standby, a computer system is configured by an active computer and a standby computer that share a system disk storing an OS and applications necessary for providing a service, and there is a problem with the active computer. When this occurs, the standby computer is started and the processing is continued by switching the system. Although the reliability of the computer system can be improved by using the system switching by the cold standby as described above, this method starts the standby system after the failure occurs. However, it takes about 30 minutes to 1 hour to provide the service.

コンピュータを高速に起動させることを可能とした従来技術として、高速ブート技術がある。この高速ブート技術は、コンピュータの起動時に行われるハードウェアのセルフテストを省略することにより、起動の高速化を実現することを可能としたものである。しかし、この技術は、起動するコンピュータのハードウェアに障害があった場合に、ＯＳやアプリケーションを起動することができないため、ハードウェアをリセットして、ハードウェアのセルフテストを実施し、障害のある部位を切り離す処理を行わなければならないため、高速な起動ができなくなってしまうものである。 As a conventional technique that enables a computer to be started at high speed, there is a high-speed boot technique. This high-speed boot technology makes it possible to realize a high-speed startup by omitting the hardware self-test performed at the time of startup of the computer. However, since this technology cannot start the OS or application when there is a failure in the computer hardware to be started, the hardware is reset and the hardware self-test is performed. Since the process of separating the part has to be performed, high-speed activation cannot be performed.

なお、前述したコンピュータを高速に起動させる技術である高速ブート技術に関する従来技術として、例えば、特許文献１等に記載された技術が知られている。
特開２００７−４７８７号公報 For example, a technique described in Patent Document 1 is known as a conventional technique related to a high-speed boot technique that is a technique for starting the computer at a high speed.
Japanese Unexamined Patent Publication No. 2007-4787

前述で説明したサービスの提供に必要なＯＳやアプリケーションを格納したシステムディスクを共有した現用系のコンピュータと待機系のコンピュータとによりコンピュータシステムを構成し、コールドスタンバイによる系の切替が可能にしたコンピュータ処理は、現用系のコンピュータに障害が発生した後、待機系のコンピュータを起動するために、待機系のコンピュータが起動を完了し、サービスを提供するまでに３０分から１時間程度の大きな時間が掛かり、その間、サービスが停止するという問題点を有している。 Computer processing in which a computer system is configured by the active computer and the standby computer sharing the system disk storing the OS and applications necessary for providing the service described above, and the system can be switched by cold standby. In order to start the standby computer after a failure occurs in the active computer, it takes about 30 minutes to 1 hour until the standby computer completes startup and provides the service. Meanwhile, there is a problem that the service is stopped.

前述の問題は、待機系のコンピュータに前述した高速ブート技術を適用することによりある程度解決することができる。しかし、高速ブート技術を適用しても、その場合のコンピュータシステムは、待機系コンピュータのハードウェアに障害があった場合、ＯＳやアプリケーションを起動できないため、ハードウェアをリセットして、ハードウェアのセルフテストを実施し、障害のある部位を切り離して、起動する必要があるため、待機系が起動を完了し、サービスを提供するまでに必要な時間を短縮できないという問題点を生じさせてしまう。 The above-described problem can be solved to some extent by applying the above-described high-speed boot technique to a standby computer. However, even if fast boot technology is applied, the computer system in that case cannot start the OS or application if there is a failure in the hardware of the standby computer. Since it is necessary to perform a test, isolate the faulty part, and start up, the standby system completes the start-up and the time required to provide the service cannot be shortened.

本発明の目的は、前述したような点に鑑み、現用系コンピュータと待機系コンピュータとから構成されるコンピュータシステムにおける系切替を高速に行うことができるようにしたコンピュータシステム及び系切替方法を提供することにある。 In view of the above-described points, an object of the present invention is to provide a computer system and a system switching method that can perform system switching at high speed in a computer system composed of an active computer and a standby computer. There is.

本発明によれば前記目的は、現用系コンピュータと待機系コンピュータとがオペレーティングシステムを格納したディスクを共有するコンピュータシステムにおいて、前記待機系コンピュータは、前記コンピュータシステムの立ち上げ時、ハードウェア初期化及びセルフテストを行った状態で待機し、前記現用系コンピュータの障害が通知されたとき、現用系コンピュータが行っていた処理を引き継ぐことにより達成される。 According to the present invention, the object is to provide a computer system in which the active computer and the standby computer share the disk storing the operating system, and the standby computer performs hardware initialization and startup when the computer system is started up. This is achieved by waiting in a state where a self-test has been performed and taking over the processing performed by the active computer when the failure of the active computer is notified.

また、前記目的は、現用系コンピュータと待機系コンピュータとがオペレーティングシステムを格納したディスクを共有するコンピュータシステムにおいて、前記現用系コンピュータと前記待機系コンピュータとの状態を監視し、系切替の制御を行う監視系コンピュータが前記現用系コンピュータと前記待機系コンピュータとに接続されており、前記監視系コンピュータは、前記コンピュータシステムの立ち上げ時、前記現用系及び待機系のコンピュータの電源をオンとさせ、前記現用系及び待機系のコンピュータにハードウェア初期化及びセルフテストを行わせ、その後、前記現用系コンピュータにサービスを提供する処理を実行させ、前記待機系コンピュータをその状態で待機させ、前記現用系コンピュータの障害を検出したとき、前記待機系コンピュータに前記現用系コンピュータが行っていた処理を引き継がせることにより達成される。 The object is to monitor the status of the active computer and the standby computer and control the system switching in a computer system in which the active computer and the standby computer share the disk storing the operating system. A monitoring computer is connected to the active computer and the standby computer, and when the computer system is started up, the monitoring computer turns on the power of the active computer and the standby computer, and Causing the active computer and the standby computer to perform hardware initialization and self-test; then, causing the active computer to execute a process of providing a service; causing the standby computer to wait in that state; and the active computer When a failure is detected, Is accomplished by to take over the active computer is performing processing to the machine system computer.

本発明によれば、待機系コンピュータのハードウェアに、起動可能な程度の障害があるような場合にも、待機系コンピュータを速やかに稼動状態にすることが可能となり、系切替を高速に行うことが可能となり、これにより、現用系コンピュータで行われていた処理を、速やかに待機系コンピュータで継続することが可能となる。 According to the present invention, even when there is a failure that can be activated in the hardware of the standby computer, it becomes possible to quickly bring the standby computer into an operating state and perform system switching at high speed. As a result, the processing performed on the active computer can be promptly continued on the standby computer.

以下、本発明によるコンピュータシステム及び系切替方法の実施形態を図面により詳細に説明する。 Embodiments of a computer system and a system switching method according to the present invention will be described below in detail with reference to the drawings.

図１は本発明の一実施形態によるコンピュータシステムの構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of a computer system according to an embodiment of the present invention.

図１に示す本発明の実施形態によるコンピュータシステムは、現用系コンピュータ１０１と、待機系コンピュータ１０２と、監視系コンピュータ１０３とが、現用系及び待機系コンピュータ１０１、１０２が備えるネットワーク制御装置１１４と監視系コンピュータ１０３が備えるネットワーク制御装置１２６とを介して接続されて構成されている。 The computer system according to the embodiment of the present invention shown in FIG. 1 includes an active computer 101, a standby computer 102, and a monitoring computer 103, a network control device 114 provided in the active and standby computers 101 and 102, and monitoring. It is configured to be connected via a network control device 126 provided in the system computer 103.

現用系コンピュータ１０１は、ＣＰＵ１１１、メモリ１０５、ディスク制御装置１１２、電源制御装置１１３、ネットワーク制御装置１１４を備えて構成されている。待機系コンピュータ１０２は、図１にはその内部構成を示していないが、現用系コンピュータ１０１と全く同一に構成されている。そして、現用系コンピュータ１０１と待機系コンピュータ１０２とは、１つのディスク１０４を共有している。ディスク１０４には、ＯＳブートローダ１１５、ＯＳ１１６、アプリケーション１１７が格納されている。 The active computer 101 includes a CPU 111, a memory 105, a disk control device 112, a power supply control device 113, and a network control device 114. Although the internal configuration of the standby computer 102 is not shown in FIG. 1, it is configured in exactly the same way as the active computer 101. The active computer 101 and the standby computer 102 share one disk 104. The disk 104 stores an OS boot loader 115, an OS 116, and an application 117.

現用系コンピュータ１０１または待機系コンピュータ１０２は、それらのコンピュータが起動された際、ディスク１０４の内容がメモリ１０５に読み込まれる。また、メモリ１０５には、コンピュータ内の図示しないＲＯＭに格納されているファームウェアが読み込まれる。これにより、メモリ１０５には、アプリケーション１０６と、ＯＳ１０８と、ファームウェア１０９が読み込まれることになる。アプリケーション１０６の中には、ハートビート命令列１０７が含まれている。また、ファームウェア１０９の中には、起動再開命令列１１０が含まれている。 When the active computer 101 or the standby computer 102 is activated, the contents of the disk 104 are read into the memory 105. In addition, firmware stored in a ROM (not shown) in the computer is read into the memory 105. As a result, the application 106, the OS 108, and the firmware 109 are read into the memory 105. The application 106 includes a heartbeat instruction sequence 107. In addition, the firmware 109 includes an activation restart instruction sequence 110.

監視系コンピュータ１０３は、ＣＰＵ１２５、メモリ１１８、ネットワーク制御装置１２６、ディスク制御装置１２７、ディスク１２８、コンソール制御装置１２９を備えて構成されている。この監視系コンピュータ１０３は、起動された際、メモリ１１８には、アプリケーション１１９、ＯＳ１２３がディスク１２８から、図示しないＲＯＭからファームウェア１２４が読み込まれることになる。アプリケーション１１９の中には、現用系待機系管理テーブル１２０、起動命令列１２１、復帰命令列１２２が含まれている。 The monitoring computer 103 includes a CPU 125, a memory 118, a network control device 126, a disk control device 127, a disk 128, and a console control device 129. When the monitoring computer 103 is activated, the application program 119 and the OS 123 are read into the memory 118 from the disk 128 and the firmware 124 is read from the ROM (not shown). The application 119 includes an active standby system management table 120, a start command sequence 121, and a return command sequence 122.

前述した本発明の実施形態は、現用系、待機系、監視系のコンピュータをそれぞれ１台備えてコンピュータシステムを構成しているものとしているが、本発明は、現用系コンピュータ複数台に対して、待機系コンピュータ１台を設けてコンピュータシステムを構成することもできる。 In the embodiment of the present invention described above, a computer system is configured by including one active computer, one standby computer, and one monitoring computer, but the present invention is provided for a plurality of active computers. A computer system can be configured by providing one standby computer.

図２は監視系コンピュータ１０３のアプリケーション１１９の中に含まれる現用系待機系管理テーブル１２０の構成例を示す図である。この管理テーブル１２０は、コンピュータ識別子２０２の列、現用系か待機系かを示す現用系／待機系２０３の列、コンピュータの状態を示す状態２０４の列が設けられて構成されている。 FIG. 2 is a diagram showing a configuration example of the active standby system management table 120 included in the application 119 of the monitoring computer 103. The management table 120 includes a column of a computer identifier 202, a column of an active / standby system 203 indicating whether it is an active system or a standby system, and a column of a status 204 indicating the state of a computer.

コンピュータ識別子２０２の列には、本発明の実施形態でのコンピュータシステムを構成しているコンピュータのＩＰアドレス２０５、２０６の情報が格納される。また、現用系か待機系かを示す現用系／待機系２０３の列には、コンピュータのＩＰアドレス２０５、２０６の行に対応させてそれらのコンピュータが現用系か待機系かを示す現用系２０７、待機系２０８の情報が格納される。さらに、状態の列２０４には、現用系あるいは待機系となっているコンピュータの状態として、停止、待機、稼動の３種類の状態の情報が格納される。図示例では、いずれにも停止２０９、２１０が格納されている。 In the column of the computer identifier 202, information on the IP addresses 205 and 206 of the computers constituting the computer system according to the embodiment of the present invention is stored. The active / standby system 203 column indicating whether the current system is the active system or the standby system has an active system 207 that indicates whether the computers are the active system or the standby system in correspondence with the rows of the IP addresses 205 and 206 of the computers. Information on the standby system 208 is stored. Further, the status column 204 stores information on three types of statuses of stop, standby, and operation as the status of the computer that is the active or standby system. In the illustrated example, stops 209 and 210 are stored in both.

図３は現用系コンピュータ１０１及び待機系コンピュータ１０２のアプリケーション１０６の中にあるハートビート命令列１０７の処理動作を説明するフローチャートであり、次に、これについて説明する。 FIG. 3 is a flowchart for explaining the processing operation of the heartbeat instruction sequence 107 in the application 106 of the active computer 101 and standby computer 102. Next, this will be described.

（１）ハートビート命令列は、監視系コンピュータ１０３から送られてくるハートビート確認パケットを受信したか否かを判定し、受信しなかった場合、受信するまでハートビート確認パケット受信の有無の判定を繰り返す（ステップ３０２）。 (1) The heartbeat command sequence determines whether or not a heartbeat confirmation packet sent from the monitoring computer 103 has been received. If not received, it is determined whether or not a heartbeat confirmation packet has been received until reception. Is repeated (step 302).

（２）ステップ３０２の判定で、ハートビート確認パケットを受信した場合、ハートビート応答パケットを監視系コンピュータ１０３に送信する（ステップ３０３）。 (2) If a heartbeat confirmation packet is received in the determination in step 302, a heartbeat response packet is transmitted to the monitoring computer 103 (step 303).

ハートビート命令列１０７は、前述した一連の処理を繰り返すことにより、監視系コンピュータ１０３に、自コンピュータが生きていることを知らせることができる。 The heartbeat command sequence 107 can inform the monitoring computer 103 that the own computer is alive by repeating the series of processes described above.

図４は現用系及び待機系の各コンピュータのファームウェア１０９の処理動作を説明するフローチャートであり、次に、これについて説明する。 FIG. 4 is a flowchart for explaining the processing operation of the firmware 109 of each of the active and standby computers. Next, this will be described.

（１）現用系及び待機系の各コンピュータは、電源がオンとされると、ファームウェア１０９を起動し、ファームウェア１０９は、ハードウェアの初期化処理及びセルフテストを実行し、実行の結果、障害のあったハードウェアを切り離す（ステップ４０２）。 (1) When the power supply is turned on, the active computer and the standby computer start up the firmware 109. The firmware 109 executes a hardware initialization process and a self-test. The existing hardware is disconnected (step 402).

（２）次に、ファームウェア１０９は、起動再開命令列１１０を呼び出し、その後、ＯＳブートローダ１１５に処理を渡して、ここでの処理を終了する（ステップ４０３〜４０５）。 (2) Next, the firmware 109 calls the boot restart instruction sequence 110, and then passes the processing to the OS boot loader 115, and ends the processing here (steps 403 to 405).

前述において、ＯＳブートローダ１１５は、ディスク１０４内のＯＳ１１６をメモリ１０５にロードして、現用系及び待機系の各コンピュータを動作可能な状態にする。 As described above, the OS boot loader 115 loads the OS 116 in the disk 104 into the memory 105 to make the active and standby computers operable.

図５は現用系及び待機系の各コンピュータのファームウェア１０９の中にある起動再開命令列１１０の処理動作を説明するフローチャートであり、次に、これについて説明する。 FIG. 5 is a flowchart for explaining the processing operation of the activation / resumption instruction sequence 110 in the firmware 109 of each of the active and standby computers. Next, this will be explained.

（１）起動再開命令列１１０は、図４で説明したステップ４０２の処理で、コンピュータの電源オンによってハードウェアの初期化処理及びセルフテストが実行されて完了すると、セルフテスト完了パケットを監視系コンピュータ１０３に対して送信する。現用系及び待機系の各コンピュータ１０１、１０２は、監視系コンピュータ１０３を特定する情報を持っていないため、セルフテスト完了パケットは、ブロードキャストにより送信される（ステップ５０２）。 (1) The start / resume instruction sequence 110 is a process of step 402 described with reference to FIG. 4. When the hardware initialization process and the self-test are executed by turning on the computer, the self-test completion packet is transmitted to the monitoring computer. 103 is transmitted. Since each of the active and standby computers 101 and 102 does not have information for specifying the monitoring computer 103, the self-test completion packet is transmitted by broadcast (step 502).

（２）その後、起動再開命令列１１０は、監視系コンピュータ１０３から起動再開パケットを受信したか否かを判定し、受信した場合、起動の処理を再開する。すなわち、図４のステップ４０４からの処理を開始させる。起動再開パケットを受信できなかった場合、受信するまで、自コンピュータの起動がセルフテスト完了の状態で中断されたままとなる（ステップ５０３）。 (2) Thereafter, the activation resumption instruction sequence 110 determines whether or not an activation resumption packet has been received from the monitoring computer 103, and when received, resumes the activation process. That is, the processing from step 404 in FIG. 4 is started. If the activation restart packet cannot be received, the activation of the own computer remains suspended in the self-test completed state until it is received (step 503).

図６Ａ、図６Ｂ、図６Ｃは監視系コンピュータ１０３のアプリケーション１１９が持つ起動命令列１２１の処理動作を説明するフローチャートであり、次に、これについて説明する。なお、図６Ａ、図６Ｂ、図６Ｃは、一連の処理動作を示しているので、以下の説明も、一連の処理として説明することとし、また、監視系コンピュータが実行する処理として説明する。 6A, 6B, and 6C are flowcharts for explaining the processing operation of the activation instruction sequence 121 possessed by the application 119 of the monitoring computer 103, which will be described next. 6A, 6B, and 6C show a series of processing operations, the following description will be described as a series of processes, and will also be described as a process executed by the monitoring computer.

（１）監視系コンピュータ１０３は、現用系と待機系との各コンピュータ１０１、１０２に対して、各コンピュータの電源をオンにするためのパケットを送信する（ステップ６０２）。 (1) The monitoring computer 103 transmits a packet for turning on the power of each computer to the active and standby computers 101 and 102 (step 602).

（２）電源をオンにするためのパケットを受信した現用系と待機系との各コンピュータ１０１、１０２のそれぞれは、自コンピュータの電源をオンとし、図４により説明した処理を行って、セルフテスト完了後に完了パケットを送信してくるので、現用系と待機系との各コンピュータ１０１、１０２からのセルフテスト完了パケットを受信したか否かを判定し、受信した場合、管理テーブル１２０の状態２０４の列における現用系２０９、待機系２１０の状態を、初期状態の停止から待機に変更する（ステップ６０３、６０４）。 (2) Each of the active and standby computers 101 and 102 that has received the packet for turning on the power turns on its own computer and performs the processing described with reference to FIG. Since a completion packet is transmitted after completion, it is determined whether or not a self-test completion packet has been received from each of the computers 101 and 102 of the active system and the standby system. The states of the active system 209 and the standby system 210 in the queue are changed from the initial state stop to the standby state (steps 603 and 604).

（３）ステップ６０３の判定で、セルフテスト完了パケットを受信しなかた場合、受信するまで、監視系コンピュータ１０３の動作は中断されることになる。 (3) If it is determined in step 603 that the self-test completion packet has not been received, the operation of the monitoring computer 103 is suspended until it is received.

（４）監視系コンピュータ１０３は、管理テーブル１２０の状態２０４を変更した後、現用系コンピュータ１０１に対して、起動再開パケットを送信すると共に、ハートビート確認パケットを送信する（ステップ６０５、６０６）。 (4) After changing the status 204 of the management table 120, the monitoring computer 103 transmits an activation restart packet and a heartbeat confirmation packet to the active computer 101 (steps 605 and 606).

（５）現用系コンピュータ１０１からハートビート応答パケットを受信したか否かを判定し、ハートビート応答パケットを受信した場合、現用系コンピュータ１０１が稼動したとみなし、管理テーブル１２０の現用系の状態２０９を待機から稼動に変更する（ステップ６０７、６０８）。 (5) It is determined whether or not a heartbeat response packet has been received from the active computer 101. If a heartbeat response packet is received, it is considered that the active computer 101 has been activated and the status 209 of the active system in the management table 120 Is changed from standby to operation (steps 607 and 608).

（６）ステップ６０７の判定で、ハートビート応答パケットを受信しなかった場合、ステップ６０６からの処理に戻って、ハートビート確認パケットを再び送信し、ハートビート応答パケットを受信するまで処理を繰り返す。 (6) If the heartbeat response packet is not received in the determination in step 607, the processing returns to step 606, the heartbeat confirmation packet is transmitted again, and the processing is repeated until the heartbeat response packet is received.

（７）ステップ６０８の処理で管理テーブル１２０の状態２０４を変更した後、タイマーを設定し、現用系コンピュータ１０１にハートビート確認パケットを送信する（ステップ６０９、６１０）。 (7) After changing the state 204 of the management table 120 in the process of step 608, a timer is set and a heartbeat confirmation packet is transmitted to the active computer 101 (steps 609 and 610).

（８）現用系コンピュータ１０１からハートビート応答パケットを受信したか否かを判定し、ハートビート応答パケットを受信した場合、ステップ６０９からの処理に戻り、再びタイマーを設定する処理からの処理を続ける（ステップ６１１）。 (8) It is determined whether or not a heartbeat response packet has been received from the active computer 101. If a heartbeat response packet is received, the process returns from step 609 to continue the process from the process of setting the timer again. (Step 611).

（９）ステップ６１１の判定で、ハートビート応答パケットを受信しなかった場合、タイマーを進め、タイマーが規定した時間経過したか否かを判定し、規定した時間経過していなければ、ステップ６１１からの処理に戻って、ハートビート応答パケットを受信したか否かの判断処理からの処理を続ける（ステップ６１２、６１３）。 (9) If the heartbeat response packet is not received in the determination in step 611, the timer is advanced to determine whether the time specified by the timer has elapsed. If the specified time has not elapsed, from step 611 Returning to the process, the process from the determination process of whether or not the heartbeat response packet has been received is continued (steps 612 and 613).

（10）ステップ６１３の判定で、タイマーが一定時間経過した場合、現用系コンピュータ１０１の障害によるハートビートタイムアウトとみなし、現用系コンピュータ１０１の電源をオフにし、管理テーブル１２０の状態２０４の現用系の状態２０９を稼動から停止に変更する（ステップ６１４、６１５）。 (10) If it is determined in step 613 that the timer has elapsed for a certain period of time, it is regarded as a heartbeat timeout due to a failure of the active computer 101, the active computer 101 is turned off, and the active system in the status 204 of the management table 120 is turned off. The state 209 is changed from operation to stop (steps 614 and 615).

（11）その後、監視系コンピュータ１０３は、待機系コンピュータ１０２に起動再開パケットを送信すると共に、ハートビート確認パケットを送信する（ステップ６１６、６１７）。 (11) Thereafter, the monitoring computer 103 transmits an activation restart packet and a heartbeat confirmation packet to the standby computer 102 (steps 616 and 617).

（12）待機系コンピュータ１０２からハートビート応答パケットを受信したか否かを判定し、ハートビート応答パケットを受信した場合、待機系コンピュータ１０２が稼動したとみなし、管理テーブル１２０の待機系の状態２１０を待機から稼動に変更する（ステップ６１８、６１９）。 (12) It is determined whether or not a heartbeat response packet has been received from the standby computer 102. If a heartbeat response packet is received, it is considered that the standby computer 102 has been operated, and the standby state 210 in the management table 120 Is changed from standby to operation (steps 618 and 619).

（13）ステップ６１８の判定で、ハートビート応答パケットを受信しなかった場合、ステップ６１７からの処理に戻って、ハートビート確認パケットを再び送信し、ハートビート応答パケットを受信するまで処理を繰り返す。 (13) If the heartbeat response packet is not received in step 618, the process returns to step 617, the heartbeat confirmation packet is transmitted again, and the process is repeated until the heartbeat response packet is received.

（14）監視系コンピュータ１０３は、管理テーブル１２０の待機系の状態２１０を待機から稼動に変更した後、管理テーブル１２０の現用系か待機系かを示す現用系／待機系２０３の列の現用系の表示２０７を待機系の表示に変更し、待機系の表示２０８を現用系表示に変更する（ステップ６２０）。 (14) The monitoring computer 103 changes the standby system state 210 of the management table 120 from standby to active, and then the active system in the active / standby system 203 column indicating whether the management table 120 is active or standby. Display 207 is changed to the standby display, and the standby display 208 is changed to the active display (step 620).

（15）次に、監視系コンピュータ１０３は、管理テーブル１２０で、待機系表示となったコンピュータを交換する旨のメッセージをコンソール制御装置１２９を介して表示する（ステップ６２１）。 (15) Next, the monitoring computer 103 displays a message in the management table 120 to replace the computer that has become the standby display via the console control device 129 (step 621).

（16）ステップ６２１でのメッセージ表示後、保守員が速やかにコンピュータの交換を行うものとし、さらに、交換後、保守員が監視系コンピュータ１０３のアプリケーション１１９の中にある復帰命令列１２２を実行するものとする。これにより、待機系コンピュータ１０２と現用系コンピュータ１０１とが使用可能となり、ステップ６０９からの処理に戻ってタイマー設定からの処理を続けることができる。 (16) After the message is displayed in step 621, the maintenance staff shall promptly replace the computer, and after the replacement, the maintenance staff executes the return instruction sequence 122 in the application 119 of the monitoring computer 103. Shall. Thereby, the standby computer 102 and the active computer 101 can be used, and the processing from the timer setting can be continued by returning to the processing from step 609.

図７は監視系コンピュータ１０３のアプリケーション１１９が持つ復帰命令列１２２の処理動作を説明するフローチャートであり、次に、これについて説明する。 FIG. 7 is a flowchart for explaining the processing operation of the return instruction sequence 122 possessed by the application 119 of the monitoring computer 103. Next, this will be explained.

（１）コンピュータの交換後、保守員が復帰命令列１２２を立ち上げると、監視系コンピュータ１０３は、待機系コンピュータ１０２（説明している本発明の実施形態での例では、障害となって停止したコンピュータを交換した後のコンピュータ）に対して、電源をオンにするためのパケットを送信する（ステップ７０２）。 (1) When the maintenance engineer starts up the return instruction sequence 122 after the replacement of the computer, the monitoring computer 103 is stopped due to a failure in the standby computer 102 (in the example of the embodiment of the present invention described). A packet for turning on the power is transmitted to the computer after replacing the computer (step 702).

（２）待機系コンピュータ１０２は、監視系コンピュータ１０３からの指示により電源をオンとし、ハードウェアのセルフテストが完了するとセルフテスト完了パケットを監視系コンピュータ１０３へ送信してくるので、監視系コンピュータ１０３は、そのパケットを受信するまで待つ（ステップ７０３）。 (2) The standby computer 102 turns on the power according to an instruction from the monitoring computer 103 and transmits a self-test completion packet to the monitoring computer 103 when the hardware self-test is completed. Waits until the packet is received (step 703).

（３）監視系コンピュータ１０３は、セルフテスト完了パケットを受信した場合、管理テーブル１２０の状態２０４の列の待機系の状態２０９（復帰命令列は、待機系と現用系の系切替後に実行される命令列のため、ここでの待機系の状態を表す情報の場所は２０９に該当する）を停止から待機に変更して、ここでの処理を終了する（ステップ７０４）。 (3) When the monitoring computer 103 receives the self-test completion packet, the monitoring computer 103 executes the standby system state 209 in the column of the state 204 of the management table 120 (the return instruction sequence is executed after switching between the standby system and the active system). Since this is an instruction sequence, the location of the information indicating the state of the standby system here corresponds to 209) is changed from stop to standby, and the processing here ends (step 704).

図８は図６Ａ〜図６Ｃ及び図７に示して監視系コンピュータの処理として説明した本発明の実施形態での処理を、現用、待機、監視の各コンピュータ間の処理として示したシーケンスチャートであり、次に、これについて説明する。 FIG. 8 is a sequence chart showing the processing in the embodiment of the present invention described as the processing of the monitoring computer shown in FIGS. 6A to 6C and FIG. 7 as processing among the active, standby, and monitoring computers. Next, this will be described.

（１）監視系コンピュータ１０３は、現用系と待機系との各コンピュータ１０１、１０２に対して、各コンピュータの電源をオンにするように指示する（ステップ８０１、８０２）。 (1) The monitoring computer 103 instructs each of the active and standby computers 101 and 102 to turn on the power of each computer (steps 801 and 802).

（２）現用系と待機系との各コンピュータ１０１、１０２は、ステップ８０１、８０２での監視系コンピュータ１０３からの指示を受けて、自コンピュータの電源をオンとし、ハードウェアの初期化及びセルフテストを実行する（ステップ８０３、８０４）。 (2) In response to an instruction from the monitoring computer 103 in steps 801 and 802, each of the active and standby computers 101 and 102 turns on its own computer, initializes the hardware, and performs a self-test. Are executed (steps 803 and 804).

（３）現用系と待機系との各コンピュータ１０１、１０２は、ハードウェアのセルフテストの完了後に、完了パケットを監視系コンピュータ１０３に送信する（ステップ８０５、８０６）。 (3) After completion of the hardware self-test, each of the active and standby computers 101 and 102 transmits a completion packet to the monitoring computer 103 (steps 805 and 806).

（４）監視系コンピュータ１０３は、現用系と待機系との各コンピュータ１０１、１０２からセルフテスト完了パケットを受信すると、現用系コンピュータ１０１に起動再開パケットを送信し、待機系コンピュータ１０２に対しては、何もしない。この結果、待機系コンピュータ１０２は、この時点から、待機中となる（ステップ８０７）。 (4) When the monitoring computer 103 receives the self-test completion packet from each of the active and standby computers 101 and 102, the monitoring computer 103 transmits a start restart packet to the active computer 101, and the standby computer 102 ,do nothing. As a result, the standby computer 102 is on standby from this point (step 807).

（５）起動再開パケットを受信した現用系コンピュータ１０１は、ＯＳの初期化、アプリケーションの初期化を行って、サービスを開始してサービス提供中の状態となる（ステップ８０８〜８１０）。 (5) The active computer 101 that has received the activation restart packet initializes the OS and initializes the application, starts the service, and enters the service providing state (steps 808 to 810).

（６）現用系コンピュータ１０１がサービスの提供中、監視系コンピュータ１０３は、一定時間毎に、現用系コンピュータ１０１へのハートビート確認パケットの送信と現用系コンピュータ１０１からのハートビート応答パケットの受信を繰り返して、現用系コンピュータ１０１が正常にサービスを提供していることを確認する（ステップ８１１、８１２）。 (6) While the active computer 101 is providing services, the monitoring computer 103 transmits a heartbeat confirmation packet to the active computer 101 and receives a heartbeat response packet from the active computer 101 at regular intervals. Repeatingly, it is confirmed that the active computer 101 normally provides a service (steps 811 and 812).

（７）現用系コンピュータ１０１がサービスの提供中に障害となりサービスを停止すると、現用系コンピュータ１０１は、監視系コンピュータ１０３からのハートビート確認パケットに対するハートビート応答パケットの送信を行うことができなくなる。これにより、監視系コンピュータ１０３は、現用系コンピュータ１０１の障害を検出する（ステップ８１３〜８１６）。 (7) If the active computer 101 fails during service provision and stops the service, the active computer 101 cannot transmit a heartbeat response packet to the heartbeat confirmation packet from the monitoring computer 103. As a result, the monitoring computer 103 detects a failure of the active computer 101 (steps 813 to 816).

（８）現用系コンピュータ１０１の障害を検出した監視系コンピュータ１０３は、現用系コンピュータ１０１に対して電源オフを指示すると共に、待機系コンピュータ１０２に対して起動再開パケットを送信する（ステップ８１７、８１８）。 (8) The monitoring computer 103 that has detected the failure of the active computer 101 instructs the active computer 101 to turn off the power, and transmits an activation restart packet to the standby computer 102 (steps 817 and 818). ).

（９）起動再開パケットを受信した待機系コンピュータ１０２は、ＯＳの初期化、アプリケーションの初期化を行って、現用系コンピュータとしてサービスを開始してサービス提供中の状態となる（ステップ８１９〜８２１）。 (9) The standby computer 102 that has received the activation restart packet initializes the OS and initializes the application, starts the service as the active computer, and enters the service providing state (steps 819 to 821). .

（10）待機系コンピュータ１０２がサービスの提供中、監視系コンピュータ１０３は、一定時間毎に、待機系コンピュータ１０２へのハートビート確認パケットの送信と待機系コンピュータ１０２からのハートビート応答パケットの受信を繰り返して、待機系コンピュータ１０２が正常にサービスを提供していることを確認する（ステップ８２２、８２３）。 (10) While the standby computer 102 is providing the service, the monitoring computer 103 transmits a heartbeat confirmation packet to the standby computer 102 and receives a heartbeat response packet from the standby computer 102 at regular intervals. Repeatingly, it is confirmed that the standby computer 102 normally provides a service (steps 822 and 823).

（11）一方、障害となりサービスを停止していた現用系コンピュータ１０１は、保守員によりコンピュータ交換等の対処がなされ、その後、保守員により監視系コンピュータ１０３の復帰命令列が起動される（ステップ８２４、８２５）。 (11) On the other hand, the active computer 101 that has failed and stopped the service is dealt with by replacement of the computer by the maintenance personnel, and then the restoration instruction sequence of the monitoring computer 103 is started by the maintenance personnel (step 824). 825).

（12）監視系コンピュータ１０３は、復帰命令列が起動されたことにより、停止中の現用系コンピュータ１０１に対して電源をオンとするように指示する（ステップ８２６）。 (12) When the return instruction sequence is activated, the monitoring computer 103 instructs the stopped active computer 101 to turn on the power (step 826).

（13）現用系コンピュータ１０１は、監視系コンピュータ１０３からの指示を受けて、自コンピュータの電源をオンとし、ハードウェアの初期化及びセルフテストを実行し、ハードウェアのセルフテストの完了後に、完了パケットを監視系コンピュータ１０３に送信して、待機中の状態となる（ステップ８２７、８２８）。 (13) In response to the instruction from the monitoring computer 103, the active computer 101 turns on its own computer, executes hardware initialization and self-test, and completes after completing the hardware self-test. The packet is transmitted to the monitoring computer 103 to enter a standby state (steps 827 and 828).

前述した本発明の実施形態での各処理は、プログラムにより構成し、本発明が備えるＣＰＵに実行させることができ、また、それらのプログラムは、ＦＤ、ＣＤＲＯＭ、ＤＶＤ等の記録媒体に格納して提供することができ、また、ネットワークを介してディジタル情報により提供することができる。 Each process in the above-described embodiment of the present invention is configured by a program and can be executed by a CPU included in the present invention. These programs are stored in a recording medium such as an FD, CDROM, or DVD. It can be provided and can be provided by digital information via a network.

本発明の一実施形態によるコンピュータシステムの構成を示すブロック図である。It is a block diagram which shows the structure of the computer system by one Embodiment of this invention. 監視系コンピュータのアプリケーションの中に含まれる現用系待機系管理テーブルの構成例を示す図である。It is a figure which shows the structural example of the active system standby system management table contained in the application of a monitoring system computer. 現用系コンピュータ及び待機系コンピュータのアプリケーションの中にあるハートビート命令列の処理動作を説明するフローチャートである。It is a flowchart explaining the processing operation | movement of the heartbeat command sequence in the application of an active computer and a standby computer. 現用系及び待機系の各コンピュータのファームウェアの処理動作を説明するフローチャートである。It is a flowchart explaining the processing operation of the firmware of each computer of an active system and a standby system. 現用系及び待機系の各コンピュータのファームウェアの中にある起動再開命令列の処理動作を説明するフローチャートである。It is a flowchart explaining the processing operation | movement of the starting resumption command sequence in the firmware of each computer of an active system and a standby system. 監視系コンピュータのアプリケーションが持つ起動命令列の処理動作を説明するフローチャート（その１）である。It is a flowchart (the 1) explaining the processing operation | movement of the starting command sequence which the application of a monitoring system computer has. 監視系コンピュータのアプリケーションが持つ起動命令列の処理動作を説明するフローチャート（その２）である。It is a flowchart (the 2) explaining the processing operation | movement of the starting command sequence which the application of a monitoring system computer has. 監視系コンピュータのアプリケーションが持つ起動命令列の処理動作を説明するフローチャート（その３）である。It is a flowchart (the 3) explaining the processing operation | movement of the starting command sequence which the application of a monitoring system computer has. 監視系コンピュータのアプリケーションが持つ復帰命令列の処理動作を説明するフローチャートである。It is a flowchart explaining the processing operation | movement of the return instruction sequence which the application of a monitoring computer has. 図６Ａ〜図６Ｃ及び図７により説明した処理を、現用、待機、監視の各コンピュータ間の処理として示したシーケンスチャートである。FIG. 8 is a sequence chart illustrating the processing described with reference to FIGS. 6A to 6C and FIG. 7 as processing between the active, standby, and monitoring computers.

Explanation of symbols

１０１現用系コンピュータ
１０２待機系コンピュータ
１０３監視系コンピュータ
１０４、１２８ディスク
１０５、１１８メモリ
１０６、１１７、１１９アプリケーション
１０７ハートビート命令列
１０８、１１６、１２３ＯＳ
１０９、１２４ファームウェア
１１０起動再開命令列
１１１、１２５ＣＰＵ
１１２、１２７ディスク制御装置
１１３電源制御装置
１１４、１２６ネットワーク制御装置
１１５ＯＳブートローダ
１２０現用系待機系管理テーブル
１２１起動命令列
１２２復帰命令列
１２９コンソール制御装置 101 Active computer 102 Standby computer 103 Monitoring computer 104, 128 Disk 105, 118 Memory 106, 117, 119 Application 107 Heartbeat instruction sequence 108, 116, 123 OS
109, 124 Firmware 110 Activation restart instruction sequence 111, 125 CPU
112, 127 Disk control device 113 Power supply control device 114, 126 Network control device 115 OS boot loader 120 Active standby system management table 121 Start command sequence 122 Return command sequence 129 Console control device

Claims

In a computer system in which an active computer and a standby computer share a disk storing an operating system,
The standby computer waits in a state where hardware initialization and self-test are performed when the computer system is started up, and when the failure of the active computer is notified, the processing performed by the active computer A computer system characterized by taking over.

In a computer system in which an active computer and a standby computer share a disk storing an operating system,
A monitoring computer that monitors the status of the active computer and the standby computer and controls system switching is connected to the active computer and the standby computer;
When the computer system is started up, the monitoring computer turns on the active and standby computers to cause the active and standby computers to perform hardware initialization and self-test. The active computer was executed on the standby computer when the active computer was caused to execute a process of providing a service, the standby computer was in that state, and a failure of the active computer was detected. A computer system characterized in that processing can be taken over.

In a system switching method in a computer system in which a working computer and a standby computer share a disk storing an operating system,
A monitoring computer that monitors the status of the active computer and the standby computer and controls system switching is connected to the active computer and the standby computer;
When the computer system is started up, the monitoring computer turns on the active and standby computers to cause the active and standby computers to perform hardware initialization and self-test. The active computer was executed on the standby computer when the active computer was caused to execute a process of providing a service, the standby computer was in that state, and a failure of the active computer was detected. A system switching method characterized in that the processing is taken over.