JP2008097164A

JP2008097164A - Fault monitoring method for a system composed of a plurality of functional elements

Info

Publication number: JP2008097164A
Application number: JP2006275940A
Authority: JP
Inventors: Fujio Yokoyama; 不二夫横山
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2006-10-10
Filing date: 2006-10-10
Publication date: 2008-04-24

Abstract

【課題】
多数の機能要素がネットワークで接続されたシステムにおいて、構成要素の故障検知とバックアップの信頼度を向上させ、システムの可用性を向上する。
【解決手段】
故障要素の特定とバックアップ可否を複数の構成要素で決定、または、多数決で決定することにより上記課題を解決する。また、構成要素内部のサブ要素間で相互監視し、構成要素内のサブ要素の故障時、同一要素内の他のサブ要素により構成要素の動作停止と外部インタフェースを切り離す手段により、さらに、システムのバックアップ動作の信頼性を向上できる。また、複数のサブ要素間で、または、サブ要素間の多数決で構成要素内の故障検知を行い、自己動作の停止を行う手段により、故障要素の動作停止の確実性を確保できる。
【選択図】図1
【Task】
In a system in which a large number of functional elements are connected via a network, the reliability of component failure detection and backup is improved, and the system availability is improved.
[Solution]
The above-mentioned problem is solved by determining the failure element and determining whether backup is possible with a plurality of components or by majority vote. Further, by means of mutual monitoring between the sub-elements inside the component, and when the sub-element within the component fails, the other components in the same element can be stopped and the external interface can be separated from the external interface. The reliability of the backup operation can be improved. In addition, it is possible to secure the certainty of the failure element operation stop by means of detecting a failure in the component element among a plurality of sub-elements or by majority vote between the sub-elements and stopping the self-operation.
[Selection] Figure 1

Description

本発明は、複数の機能要素から構成されるコンピュータシステムや通信システムにおける、故障監視方法に関する。 The present invention relates to a failure monitoring method in a computer system or communication system composed of a plurality of functional elements.

マルチＣＰＵシステムや超並列計算機、多数の局や中継局から構成される通信ネットワークの故障監視や故障処理方法として、１台の待機系による現用系の故障監視、故障発生時のバックアップ処理などが行われている（特許文献１）。また、自ユニット内の処理制御部を監視し故障検知時、処理制御部のリセットを行い再起動することが行われている（特許文献２）。 As a fault monitoring and fault handling method for communication networks consisting of multi-CPU systems, massively parallel computers, and many stations and relay stations, fault monitoring of the active system by a single standby system, backup processing when a fault occurs, etc. (Patent Document 1). In addition, the processing control unit in its own unit is monitored, and when a failure is detected, the processing control unit is reset and restarted (Patent Document 2).

特開平０６−２３１０９５号公報Japanese Patent Laid-Open No. 06-231095 特開２０００−１４８７０９号公報JP 2000-148709 A

しかし、従来の技術では、故障監視側の故障監視回路の故障により、誤って監視対象が故障していると認識する場合が考慮されていない。このため、バックアップや故障処理を行う側の監視機構が故障した場合、システム全体に重大な影響を及ぼすことがある。また、故障発生要素が他の部位に悪影響を与えないために、故障した機能要素の動作を停止する場合、他要素からの強制停止手段や人手による介入が必要であったため、ケーブルや配線エリア、人手作業のコストが大きくなっていた。 However, the conventional technique does not consider a case where the monitoring target is mistakenly recognized as having failed due to a failure in the failure monitoring circuit on the failure monitoring side. For this reason, if the monitoring mechanism that performs backup or failure processing fails, the entire system may be seriously affected. In addition, in order to prevent the malfunctioning element from adversely affecting other parts, when stopping the operation of the malfunctioning functional element, forced stop means from other elements or manual intervention was necessary. The cost of manual labor has increased.

本発明の目的は、従来技術では故障監視回路自体の故障時に誤って故障処理をしてしまうという問題と、故障要素の動作停止のための手段を安価なコストで提供することである。 An object of the present invention is to provide a problem that the fault monitoring circuit erroneously performs fault processing in the case of a fault in the conventional technique and a means for stopping the operation of the faulty element at a low cost.

本発明は、（１）故障箇所の特定とバックアップ可否を複数の構成要素間での一致または多数決で決定するという特徴と、（２）故障要素がシステムへ悪影響を及ぼさないために、機能要素を構成するサブ要素間で相互監視し、自要素における故障の場合自要素の動作を停止し、他要素とのインタフェースを切り離すという特徴と、（３）自要素の動作停止可否を自要素内の複数のサブ要素間の一致または多数決により行うという特徴を有する。さらに、多数の機能要素がネットワークを介して相互接続されたシステムにおいて、前記（１）の特徴を実現する一つの方法として、複数の機能要素を２個以上のネットワークノードとグループ化して、該グループ内で故障箇所特定、バックアップ可否の決定、並びにバックアップ処理を行うという特徴を有する。 The present invention includes (1) the feature of determining the location of failure and whether or not backup is possible by matching or majority voting among a plurality of components, and (2) a functional element to prevent the failure element from adversely affecting the system. Features of mutual monitoring between constituent sub-elements, stopping the operation of the own element in the event of a failure in the own element, and disconnecting the interface with other elements, and (3) whether or not the operation of the own element can be stopped The sub-elements are matched or majority voted. Furthermore, in a system in which a large number of functional elements are interconnected via a network, as one method for realizing the feature (1), a plurality of functional elements are grouped with two or more network nodes, and the group In particular, it has the characteristics of specifying a failure location, determining whether backup is possible, and performing backup processing.

本発明では、故障要素の特定を複数の要素で行うので、故障監視回路の故障による誤指摘、不当なバックアップ動作の発生を抑止できる。また、各機能要素の内部でサブ要素による相互監視と自己動作停止機能により、自要素の異常動作がシステムに影響を及ぼさないように他部位とのインタフェース切り離しを実現できるので、他部位からの強制停止のための回路やネットワークがなくても故障要素によるシステムへの悪影響を防ぐことができる。また、人手介入なしにＣＰＵの停止、バックアップを行えるので、システム運用費／保守費を削減できる。サブ要素間の相互チェックは、通常、サブ要素間のインタフェースで使用しているタイマー監視やデータチェックなどのチェック回路を用いるので大きな追加回路は不要である。 In the present invention, since the failure element is specified by a plurality of elements, it is possible to suppress erroneous indications due to a failure of the failure monitoring circuit and the occurrence of an inappropriate backup operation. In addition, mutual monitoring by sub-elements within each functional element and the self-operation stop function can realize interface separation from other parts so that abnormal operation of the self-element does not affect the system. Even if there is no circuit or network for stopping, it is possible to prevent an adverse effect on the system due to a failure element. In addition, since the CPU can be stopped and backed up without manual intervention, system operation costs / maintenance costs can be reduced. The mutual check between the sub-elements usually uses a check circuit such as timer monitoring or data check used at the interface between the sub-elements, so that a large additional circuit is unnecessary.

さらに、多数の機能要素がネットワークを介して相互接続されたシステムでは、機能要素とネットワークノードをグループ化して該グループ内で相互監視、バックアップを行えるので、少数の機能要素がネットワークで接続されたシステムと同様に簡単にバックアップ機構を構築できる。 Furthermore, in a system in which a large number of functional elements are interconnected via a network, the functional elements and network nodes can be grouped and mutual monitoring and backup can be performed within the group, so a system in which a small number of functional elements are connected via a network. You can build a backup mechanism as easily as

監視側要素の監視回路故障によるバックアップ契機の誤認識を防ぎ、かつ、バックアップ後のシステム動作を被バックアップ要素が撹乱しないように、複数の機能要素により故障要素の特定を行う機構と、自要素内のサブ要素による相互監視・自己動作停止機能を合わせて有するシステムが最良の実施形態である。さらに、多数の機能要素がネットワークを介して接続されたシステムでは、いくつかの機能要素とネットワークノードをグループ化し、各機能要素をグループ内の複数のネットワークノードに接続して、該グループ内で相互監視・バックアップを行うようなシステムが最良の実施形態になる。 A mechanism that identifies the failed element with multiple functional elements to prevent erroneous recognition of the backup trigger due to the monitoring circuit failure of the monitoring element and to prevent the backed up element from disturbing the system operation after backup, A system having both the mutual monitoring and self-operation stop function by the sub-elements is the best embodiment. Furthermore, in a system in which a large number of functional elements are connected via a network, several functional elements and network nodes are grouped, and each functional element is connected to a plurality of network nodes in the group and mutually connected within the group. A system that performs monitoring and backup is the best embodiment.

図1は、第1の実施例である超並列コンピュータシステムの概略構成図である。 FIG. 1 is a schematic configuration diagram of a massively parallel computer system according to the first embodiment.

本実施例以降、自要素の動作停止と外部インタフェース切り離しを「自己FREEZ」と呼ぶことにする。 From this embodiment onward, the operation stop of the own element and the disconnection of the external interface will be referred to as “self FREEZ”.

プロセッサユニット101は、ネットワークノード102により、相互に接続されている。プロセッサユニット101とネットワークノード102は2個ずつグループ化されており、プロセッサユニット101は、信号線群103により、自ユニットに対応するネットワークノードに接続されているが、信号線群103の断線、ネットワークノード102の故障による通信断に対応するため、グループ内の隣のネットワークノードとも別の信号線群104により接続されている。ネットワークノード102はネットワーク信号105によりグループ外のネットワークノードとも相互接続されている。 The processor units 101 are connected to each other by a network node 102. Two processor units 101 and two network nodes 102 are grouped, and the processor unit 101 is connected to a network node corresponding to its own unit by a signal line group 103. In order to cope with a communication disconnection due to a failure of the node 102, it is also connected to a neighboring network node in the group by another signal line group 104. Network node 102 is also interconnected with network nodes outside the group by network signal 105.

図2はプロセッサユニット101の概略構成図である。主ＣＰＵ201は、通信ポート202、203と切替論理204を介して接続されている。相互動作監視テーブル205には自ユニットP(2p)j、ペアユニットP(2p+1)j、ネットワークノードN(2p)j、ペアノードN(2p+1)jの状態を登録している（pは0〜k）。ラッチ206は自己FREEZ状態保持用のラッチであり、主ＣＰＵ201からのセット信号211または、通信ポート202、203の両方のFREEZアサート信号212、213が同時ＯＮ時にセットされる。このラッチはPower ON時やシステムリセット時にもリセットされるが、本図には示していない。 FIG. 2 is a schematic configuration diagram of the processor unit 101. The main CPU 201 is connected to communication ports 202 and 203 via switching logic 204. In the interaction monitoring table 205, the states of the own unit P (2p) j, pair unit P (2p + 1) j, network node N (2p) j, and pair node N (2p + 1) j are registered (p 0 ~ k). The latch 206 is a latch for holding the self-FREEZ state, and is set when the set signal 211 from the main CPU 201 or the FREEZ assert signals 212 and 213 of both the communication ports 202 and 203 are simultaneously turned ON. This latch is reset at power-on or system reset, but is not shown in this figure.

図3は、図1におけるネットワークノード102の概略構成図である。301は通信ポートであり、構成は、図2におけるプロセッサユニット101内の通信ポートと同じ構成である。該ポートは、ネットワークノード102間の通信用に4ポート、プロセッサユニット101との通信用に2ポート必要である。プロセッサユニット101との通信用のうち、ひとつは、バックアップ用ポートである。ＳＷＣ302はネットワークのルート制御を行うスイッチングコントローラである。SWC302は２重化してネットワークノード毎の可用性を高めることもできるが、当実施例では単独の構成である。これは、SWC302の異常時、ネットワークノードの動作を停止する必要があるが、迂回パスにより他のネットワークノードを介してプロセッサユニット間の通信は可能だからである。回路303はFREEZ多数決回路であり、その論理的な構成を図4に示す。通信ポート301とSWC302の合計７個の構成要素からのFREEZ承認信号306の多数決で自己FREEZ要否を決定する。テーブル305は、相互動作監視テーブルであり、図2と同じように、プロセッサユニットP(2p)j、ペアユニットP(2p+1)j、自ネットワークノードN(2p)j、N(2p)jのペアノードN(2p+1)jの状態を登録している。 FIG. 3 is a schematic configuration diagram of the network node 102 in FIG. Reference numeral 301 denotes a communication port, and the configuration is the same as that of the communication port in the processor unit 101 in FIG. The port needs 4 ports for communication between the network nodes 102 and 2 ports for communication with the processor unit 101. One for communication with the processor unit 101 is a backup port. The SWC 302 is a switching controller that performs network route control. Although the SWC 302 can be duplicated to increase the availability of each network node, in this embodiment, the SWC 302 has a single configuration. This is because it is necessary to stop the operation of the network node when the SWC 302 is abnormal, but communication between the processor units is possible via another network node by a detour path. The circuit 303 is a FREEZ majority circuit, and its logical configuration is shown in FIG. Necessity of self-FREEZ is determined by majority decision of FREEZ approval signal 306 from a total of seven components of communication port 301 and SWC 302. The table 305 is an inter-operation monitoring table. Similarly to FIG. 2, the processor unit P (2p) j, the pair unit P (2p + 1) j, the own network node N (2p) j, N (2p) j The state of pair node N (2p + 1) j is registered.

図5は、図2における主ＣＰＵ201の故障監視・故障処理の概略フローである。Power ON時は、NC0(202)を通信ポートとして使用するが、予備ポート203ともポーリングを行い、相互の正常性チェックを行っている。通信動作中に現用ポート202の異常が検出されると予備ポート203のテストを行い、予備ポート203が正常であれば203に通信ポートを切り替えるが、203も正常動作不能の場合、エラーロギング後、自己FREEZを行い、動作停止する。 FIG. 5 is a schematic flow of failure monitoring / failure processing of the main CPU 201 in FIG. When the power is ON, NC0 (202) is used as a communication port, but the spare port 203 is also polled to check each other's normality. If an abnormality of the working port 202 is detected during communication operation, the spare port 203 is tested.If the spare port 203 is normal, the communication port is switched to 203. Self-freeze and stop operation.

図6は、図2における通信ポート202、203の故障監視・処理の概略フローである。左側が現用ポートの処理、右側が予備ポートの処理を示す。現用、予備ポートとも、相互の動作監視(605、615)と共に主ＣＰＵの動作も監視している(603、613)。主CPU201の異常検知時はFREEZアサート信号212または213をONにする。しかし、もう一方の通信ポートが主ＣＰＵの異常を検知してFREEZ信号をONにしない限り、プロセッサユニット101はFREEZしない。相手ポートの異常検知時も自ポートからのFREEZアサート信号212または213をONにする(607、617)が、相手ポートのFREEZ信号がONになって始めてプロセッサユニット101はFREEZされる。自ポートのテスト(606、616)で異常検出時は自ポートのFREEZを行うが、この部分の詳細は省略する。基本的には自ポート内のサブユニット間の相互監視、自己テスト等により正常性をテストし、自己FREEZする。 FIG. 6 is a schematic flow diagram of failure monitoring / processing of the communication ports 202 and 203 in FIG. The left side shows the processing of the working port, and the right side shows the processing of the spare port. Both the active and standby ports monitor the operation of the main CPU (603, 613) as well as the mutual operation monitoring (605, 615). When an abnormality is detected in the main CPU 201, the FREEZ assert signal 212 or 213 is turned ON. However, the processor unit 101 does not FREEZ unless the other communication port detects an abnormality of the main CPU and turns the FREEZ signal ON. Even when the abnormality of the counterpart port is detected, the FREEZ assert signal 212 or 213 from the own port is turned ON (607, 617), but the processor unit 101 is FREEZed only when the FREEZ signal of the counterpart port is turned ON. When an abnormality is detected in the own port test (606, 616), the own port is FREEZed, but details of this part are omitted. Basically, normality is tested by mutual monitoring between the subunits in the own port, self-test, etc., and self-freezes.

図7は、図3におけるSWC(スイッチングコントローラ)302の概略動作を示している。通常の通信処理中は、各通信ポート301と交信するか、ある一定時間間隔で定期ポーリングを行い、各通信ポート301と相互に正常性を確認しあう(702、703)。NC301のひとつに異常が検出された場合、SWC302は自己テストにより自己の正常性を確認(705)後、NC301の異常個数によりSWC自体が自己FREEZするかどうか判定する。ネットワークノード間の通信ポートは４個あるので、2個正常動作すれば他ノードとの通信は行えるので、３個以上のポートが異常と判断された場合に自己FREEZ信号をONにする。プロセッサユニットとの通信ポートは２個とも異常であるときのみ、自己FREEZ信号をアサートする。また、705で自サブユニットSWC302の異常と判断した場合には、自ネットワークノードではもはや、正確な通信ができないので自己FREEZ信号をONにする。SWC302の自己テストとしては、711〜713のようなテストがある。 FIG. 7 shows a schematic operation of the SWC (switching controller) 302 in FIG. During normal communication processing, communication with each communication port 301 is performed, or periodic polling is performed at regular time intervals to check the normality of each communication port 301 (702, 703). If an abnormality is detected in one of the NC 301, the SWC 302 confirms its own normality by a self test (705), and then determines whether or not the SWC itself self-freezes based on the number of abnormality of the NC 301. Since there are four communication ports between network nodes, communication with other nodes can be performed if two normally operate. Therefore, when three or more ports are determined to be abnormal, the self-FREEZ signal is turned ON. The self-FREEZ signal is asserted only when both of the communication ports with the processor unit are abnormal. Also, if it is determined in 705 that the own subunit SWC302 is abnormal, the own network node can no longer perform accurate communication, so the own FREEZ signal is turned ON. SWC302 self tests include tests 711-713.

図8は、図3における通信ポート301の概略動作を示した図である。各ポートともSWC302を介して相互通信するので、通信データがあるときは、相手ポートと共にSWC302の正常性も確認できる。通信データがないときは、正常性確認のため、ポーリングにより相互監視する。一定時間以上相手から応答がない場合は、まず自己テストを行い、自己の正常性確認後(804)、他の通信ポートの異常かSWC302の異常かを区別する（806）。この区別の方法は詳述しないが、通信データのルーティングデータや、通信レスポンスがSWC302から正常に返って来ているかどうか等で区別できる。SWC302の異常の場合でも通信ポートの異常の場合でも、自ポートから出力している、自己FREEZ信号306をONにする(810)。自ポート正常性確認(805)で、自ポートの異常が確認された場合、他ポート301やSWC302に自ポート異常の通知を送付した後(807)、自ポートのFREEZを行う（809）。自ポートFREEZ時には、ネットワークノード102のFREEZアサート信号306もONになる。804の自己テストには821〜824のような項目があるが詳述しない。SWC内のプログラムの暴走等の場合、ポーリング応答は定期的に受け付けている場合があるので、通信ステータス遷移やコマンド／レスポンスの妥当性チェックなども行うほうが信頼度は向上できる。 FIG. 8 is a diagram showing a schematic operation of the communication port 301 in FIG. Since each port communicates with each other via the SWC 302, when there is communication data, the normality of the SWC 302 can be confirmed together with the partner port. When there is no communication data, mutual monitoring is performed by polling to confirm normality. If there is no response from the other party for a certain period of time, a self-test is first performed, and after checking the normality of the self (804), it is distinguished whether the other communication port is abnormal or the SWC 302 is abnormal (806). Although this distinction method is not described in detail, the distinction can be made based on the routing data of the communication data, whether or not the communication response is normally returned from the SWC 302, and the like. Whether the SWC 302 is abnormal or the communication port is abnormal, the self-FREEZ signal 306 output from the local port is turned ON (810). When the own port abnormality is confirmed in the own port normality confirmation (805), a notice of own port abnormality is sent to the other port 301 or SWC 302 (807), and FREEZ of the own port is performed (809). At the time of own port FREEZ, the FREEZ assert signal 306 of the network node 102 is also turned ON. The 804 self-test has items such as 821 to 824, but will not be described in detail. In the case of a runaway program in the SWC, the polling response may be accepted periodically, so the reliability can be improved by performing communication status transition and command / response validity check.

図9は、図1におけるプロセッサユニット101とネットワークノード102の相互監視・バックアップ処理の概要を示した図である。Power ON時やリセット時には相互監視テーブルの内容は全要素とも「起動中」、「正常」に設定される。通常通信処理 (902) 中には、データ交信かポーリングを一定時間間隔で行う。データ異常やポーリング異常が検出された場合、まず自己テストを行い自要素の正常性をテストする(804、805)。自要素の正常性を確認した場合、相互監視テーブルの相手要素の状態を「異常」に設定する(906)。次に他の正常要素と相互監視テーブル205、305を交換する。他要素から受け取ったテーブルと自要素の持つ相互監視テーブルで異常要素が一致するか確認する(908)。異常要素が一致した場合、異常要素の種類が自要素と同一か判定し(909)、異常要素がプロセッサユニット101ならば他方のプロセッサユニット101、異常要素がネットワークノード102ならばもうひとつのネットワークノード102がバックアップ動作を開始する(910)。この時点では、異常要素は自己の異常を検知して自己FREEZしているので、バックアップ動作がシステムに悪影響を及ぼすことはない。自要素の正常性判定905で、自要素の異常が検出された場合、相互監視テーブル205、305内の自要素を「異常」と設定し(921)、他要素へ自要素の異常を通知し(925)、自要素のFREEZを行う。自己テスト904の具体的な項目は、931〜934がある。自要素内のサブ要素のひとつである主処理部の暴走の場合、ポーリング応答チェックだけでは正常性の確認ができない場合があるので、通信ステータスやコマンド／レスポンスの妥当性チェックを行う方が信頼度は向上できる。 FIG. 9 is a diagram showing an overview of mutual monitoring / backup processing of the processor unit 101 and the network node 102 in FIG. When the power is turned on or reset, the contents of the mutual monitoring table are set to “Starting” and “Normal” for all elements. During normal communication processing (902), data communication or polling is performed at regular time intervals. When a data abnormality or polling abnormality is detected, a self test is first performed to test the normality of the own element (804, 805). When the normality of the own element is confirmed, the status of the partner element in the mutual monitoring table is set to “abnormal” (906). Next, the mutual monitoring tables 205 and 305 are exchanged with other normal elements. It is checked whether the abnormal element matches the table received from the other element and the mutual monitoring table of the own element (908). If the abnormal element matches, it is determined whether the type of the abnormal element is the same as its own element (909) .If the abnormal element is the processor unit 101, the other processor unit 101, and if the abnormal element is the network node 102, another network node 102 starts the backup operation (910). At this point, the abnormal element detects its own abnormality and self-freezes, so the backup operation does not adversely affect the system. If an abnormality of the own element is detected in the normality determination 905 of the own element, the own element in the mutual monitoring tables 205 and 305 is set to “abnormal” (921), and the abnormality of the own element is notified to other elements. (925), FREEZ of own element. Specific items of the self-test 904 include 931 to 934. In the case of a runaway of the main processing part that is one of the sub-elements within its own element, the normality may not be confirmed only by polling response check. Can improve.

以上のような構成であるので、図1〜9で示した実施例1では、異常要素の特定、バックアップ動作の開始可否を複数の構成要素の多数決で決めるので、誤って開始したバックアップ動作によるシステムかく乱の可能性をなくすことができる。また、各要素を構成するサブ要素間で相互監視し、自要素の異常検出を行い、自要素の動作を停止すると同時に他要素とのインタフェースを遮断するので、内部ＣＰＵの暴走などによる外部インタフェース、ひいては、システム全体のかく乱を防止することができる。また、2個のネットワークノードと2個のプロセッサユニットのペアで相互監視を行うので、多数要素から構成される超並列計算機システムでありながら、２ノードのシステムと同等の構成でバックアップシステムを構成できる。これにより、ＣＰＵが一台故障してもＬＳＩやＭＣＣ等を停止・交換する必要がなく、故障ＣＰＵの停止やバックアップ時の人手介入をなくし、システム運用費、保守費を削減できる。 Since the configuration is as described above, in the first embodiment shown in FIGS. 1 to 9, the abnormal element is specified and whether or not to start the backup operation is determined by majority decision of a plurality of components. The possibility of disturbance can be eliminated. In addition, the sub-elements that make up each element are mutually monitored, the abnormality of the own element is detected, the operation of the own element is stopped and the interface with other elements is shut off at the same time, so the external interface due to runaway of the internal CPU, etc. As a result, it is possible to prevent the entire system from being disturbed. In addition, since a pair of two network nodes and two processor units perform mutual monitoring, a backup system can be configured with a configuration equivalent to a two-node system, even though it is a massively parallel computer system composed of many elements. . As a result, even if one CPU fails, there is no need to stop / replace LSI, MCC, etc., and there is no need to stop the failed CPU or perform manual intervention during backup, thereby reducing system operation costs and maintenance costs.

図10は、本発明の2番目の実施例である、共有バス接続計算ユニットの構成図である。図1の構成は、プリント基板やMCC、一部の高集積ＬＳＩで実現されている。図10で、N個のCPU１001は、共用バス1009を介して、データ授受、バス使用権制御を行う。CPＵ相互監視テーブル1002は、各ＣＰＵ1001の状態を示す表であり、該当ＣＰＵからみた各ＣＰＵの状態を示す。 CPＵ相互監視テーブル1002のデータは、共用バス1009を介して、ＣＰＵ間で相互に交換される。該テーブル1002には自CPUの状態も登録してある。該テーブル1002の内容は各CPUでチェックされ、図12に示すフローで処理されるようにプログラムされている。図12の処理1205、1207でバックアップを行うCPUは、例えば、故障CPUの番号をZとすると、[(Z+1)/N](Nの剰余)という式で求められる番号のCPUが行う。この他、CPU負荷の少ないCPUが行うなどの決定方法もとることができる。FREEZ多数決回路1003は、実行制御部1005、CPU-BUS制御部1006、メモリアクセス制御部100７、IO-BUS制御1008からの各部位監視状況データ出力を受け取り、各部位からの監視状況データにより、表１に示すような条件で自己FREEZ(自己動作停止と共有バス1009からの切り離し)するかどうかを決める。具体的な回路構成は示さないが、実施例１の図2、図4のFREEZ決定論理と同様な構成になる。各部位の監視状況データの詳細も示さないが、CPＵ相互監視テーブル1002と同等な構造をしている。 FIG. 10 is a configuration diagram of a shared bus connection calculation unit according to the second embodiment of the present invention. The configuration of FIG. 1 is realized by a printed circuit board, MCC, and some highly integrated LSIs. In FIG. 10, N CPUs 1001 perform data transfer and bus use right control via a shared bus 1009. The CCU mutual monitoring table 1002 is a table showing the state of each CPU 1001, and shows the state of each CPU viewed from the corresponding CPU. Data in the CCU mutual monitoring table 1002 is exchanged between the CPUs via the shared bus 1009. The table 1002 also registers its own CPU status. The contents of the table 1002 are checked by each CPU and programmed to be processed by the flow shown in FIG. The CPU that performs backup in the processes 1205 and 1207 in FIG. 12, for example, is performed by the CPU of the number obtained by the equation [(Z + 1) / N] (the remainder of N), where Z is the number of the failed CPU. In addition, it is possible to use a determination method such as a CPU with a low CPU load. FREEZ voting circuit 1003 receives each part monitoring status data output from execution control unit 1005, CPU-BUS control unit 1006, memory access control unit 1007, and IO-BUS control 1008. Whether to perform self-FREEZ (self operation stop and disconnection from the shared bus 1009) is determined under the conditions shown in FIG. Although a specific circuit configuration is not shown, the configuration is the same as the FREEZ decision logic of FIGS. 2 and 4 of the first embodiment. Although details of the monitoring status data of each part are not shown, it has the same structure as the CCU mutual monitoring table 1002.

表１をみれば明らかなように、実行制御1005に関しては、他部位すべてが異常と認識した場合のみ自CPUをFREEZする構成になっている。実行制御部1005以外のサブ要素の異常の場合、他の3個のうち、2個が異常と判断すればFREEZ動作に移る。 As is apparent from Table 1, the execution control 1005 is configured such that its own CPU is FREEZed only when all other parts are recognized as abnormal. In the case of an abnormality of a sub-element other than the execution control unit 1005, if two of the other three are determined to be abnormal, the operation proceeds to the FREEZ operation.

各部位1005〜1008の相互監視方法は、他部位とのインタフェースにより異なるが、以下のような方法がある。下記(３)は通常処理ではインタフェースがないような部位間の相互チェックに用いる。プログラムの暴走などの場合、(１)〜(３)では異常検知できない場合があるが、(４)のようなインタフェース応答内容の妥当性で検知できる。
（１）監視部位が出力したデータのチェック（パリティチェック、ＥＣＣ等）
（２）応答時間監視
（３）監視部位からのライブ信号監視
（４）コマンド/レスポンスの妥当性チェック
図11のNode1101は、図10の計算ユニットに対応しており、CPU1001をN個搭載している。全体でＭ個のNode1101を共有バスで接続したコンピュータシステムになっている。 The mutual monitoring method of each part 1005 to 1008 differs depending on the interface with other parts, but there are the following methods. The following (3) is used for mutual check between parts where there is no interface in normal processing. In cases such as program runaway, abnormalities may not be detected in (1) to (3), but can be detected with the validity of the interface response content as in (4).
(1) Check of data output by monitoring part (parity check, ECC, etc.)
(2) Response time monitoring (3) Live signal monitoring from monitoring part (4) Command / response validity check Node1101 in Fig. 11 corresponds to the calculation unit in Fig. 10, and N CPU1001s are installed. Yes. As a whole, a computer system in which M Nodes 1101 are connected by a shared bus.

Node1101はNode間バス1109により接続され、データ授受、バス使用権制御などのトランザクション送受を行う。FREEZ多数決論理1103はNode内の各ＣＰＵ1001から送信される自NodeのFREEZ信号1105により、自NodeのFREEZ(自Node動作停止、Node間バス出力のOFF)可否を判定する回路である。論理構成詳細は示さないが、実施例１の図2、図4のFREEZ決定論理2と同様な構成になる。自己FREEZ回路1104は、自己ＦＲＥＥＺ状態保持回路である。この回路内の自己ＦＲＥＥＺラッチがＯＮになっていると、自己ＦＲＥＥＺ状態になり、ＣＰＵ等の動作を停止し、外部バスへの出力は論理的にＯＦＦの状態になる。該回路1104には、システムReset時の自己ＦＲＥＥＺラッチReset回路も含まれる。回路1103、1104は、部品点数削減のため、ＢＵＳ制御回路1106と同じ部品内に搭載している。 The Node 1101 is connected by an inter-node bus 1109, and transmits and receives transactions such as data exchange and bus use right control. The FREEZ majority logic 1103 is a circuit that determines whether or not the own node can be FREEZed (stop of own node operation, bus output between nodes is turned off) based on the FREEZ signal 1105 of the own node transmitted from each CPU 1001 in the node. Although details of the logical configuration are not shown, the configuration is the same as the FREEZ decision logic 2 of FIGS. 2 and 4 of the first embodiment. The self-FREEZ circuit 1104 is a self-FREEZ state holding circuit. When the self-FREEZ latch in this circuit is ON, the self-FREEZ state is entered, the operation of the CPU, etc. is stopped, and the output to the external bus is logically OFF. The circuit 1104 also includes a self-FREEZ latch Reset circuit at the time of system reset. The circuits 1103 and 1104 are mounted in the same components as the BUS control circuit 1106 in order to reduce the number of components.

図12は、自己FREEZ回路の概略処理フローである。正常状態では、1201〜1203をループし状態監視している。CPU相互監視データで異常CPUの存在が報告されると、1204に進み、各ＣＰＵ1001間で相互監視データ1002を交換し、各CPUが認識している異常ＣＰＵが一致するかどうかチェックする。一致する場合、無条件に該当ＣＰＵのバックアップ処理1205を行う。1204で異常CPUが一致しない場合、各ＣＰＵを異常と指摘している他のＣＰＵの個数を調べる(1206)。もし、ひとつのＣＰＵを異常と指摘しているＣＰＵの個数が複数個で、かつ、異常指摘されたＣＰＵ1101が1個の場合、正常指摘と判断し処理1207に移り、複数個のＣＰＵ1101から異常と指摘されたＣＰＵ1101の異常処理、バックアップ処理を行う。もし、複数個のＣＰＵから異常と指摘されたＣＰＵが複数個ある場合、多点障害または共有ＢＳＵ障害としてシステム異常処理1208〜1209の処理を行い、自CPUが出力している自Nodeの自己FREEZ信号1105をONにする。 FIG. 12 is a schematic process flow of the self-FREEZ circuit. In the normal state, 1201 to 1203 are looped and the state is monitored. When the presence of the abnormal CPU is reported in the CPU mutual monitoring data, the process proceeds to 1204, the mutual monitoring data 1002 is exchanged between the CPUs 1001, and it is checked whether or not the abnormal CPUs recognized by the CPUs match. If they match, backup processing 1205 of the CPU is unconditionally performed. If the abnormal CPUs do not match at 1204, the number of other CPUs that point out that each CPU is abnormal is checked (1206). If there are a plurality of CPUs that point out that one CPU is abnormal and there is only one CPU 1101 that is abnormally pointed out, the CPU 1101 determines that it is normal and moves to a process 1207. Abnormal processing and backup processing of the specified CPU 1101 are performed. If there are multiple CPUs pointed out as abnormal by multiple CPUs, the system abnormal processing 1208-1209 is processed as a multipoint failure or shared BSU failure, and the self-FREEZ of the own node output by the own CPU Turn on signal 1105.

ある１個のＣＰＵ内部でバスインタフェース回路に故障が発生したとすると、該ＣＰＵ1001は他のＣＰＵ1001すべてと通信できなくなるため、該ＣＰＵは他のＣＰＵすべてが異常と相互監視データに記録する可能性がある。しかし、他のＣＰＵでは、前記故障したＣＰＵが唯一の異常ＣＰＵなので該ＣＰＵを異常として登録する。このため、1206から1207に進み、バックアップ処理を開始する。 If a failure occurs in the bus interface circuit inside a certain CPU, the CPU 1001 cannot communicate with all the other CPUs 1001, and the CPU may record all the other CPUs as abnormal and recorded in the mutual monitoring data. is there. However, other CPUs register the CPU as abnormal because the failed CPU is the only abnormal CPU. For this reason, the process proceeds from 1206 to 1207 to start the backup process.

図13はNode間の相互監視、異常Node多数決判定論理の動作フローを示したものである。1301〜1307までは、図12のＣＰＵの相互監視・多数決判定回路と同様であるが、処理1308では自己FREEZしないで、システム異常処理のみを行い、オペレータ通報する。 FIG. 13 shows an operation flow of mutual monitoring between nodes and abnormal node majority decision logic. Steps 1301 to 1307 are the same as the mutual monitoring / majority decision circuit of the CPU in FIG. 12, but in the processing 1308, only the system abnormality processing is performed and the operator is notified without self-freezing.

実施例2（図10〜13）においては、複数ＣＰＵやＮｏｄｅから構成されるシステムにおいて、バックアップ要否を相互の多数決で決定するので、現用CPUやＮｏｄｅの異常誤認を防止し、正常動作中のCPUやNodeの誤ったディセーブルや誤ったバックアップ動作の立ち上げを防止し、システムの信頼性、稼動性を向上することができる。また、自要素内部の機能要素であるサブ要素間相互監視により自要素の異常監視を行い、異常時に自己Ｆｒｅｅｚ（自己動作の抑止と外部インタフェースの切り離し）を行うので、他のCPUやＮｏｄｅによる異常CPUやＮｏｄｅのディセーブル機能が不要になり、多数のCPU／Ｎｏｄｅ間でのディセーブル関連論理／信号線を削減でき、バックアップ動作を容易に実現でき、従来一部の故障で廃棄されていたＬＳＩなどの部品に関しても交換せずに、延命使用できる。本実施例2は共有バス接続のため、意識しなくても構成上自ずと、ひとつのNode1101がCPUとネットワークNodeのグルーピング単位になっており、請求項4は実現されている。 In the second embodiment (FIGS. 10 to 13), in a system composed of a plurality of CPUs and Nodes, whether backup is necessary or not is determined by mutual majority vote, so that an erroneous misidentification of the active CPU or Node is prevented and normal operation is in progress. It can prevent erroneous disable of CPU and Node and startup of wrong backup operation, and improve system reliability and operability. In addition, because the self-elements are monitored by mutual monitoring between sub-elements, which are functional elements inside the self-element, and self-freez (suppression of self-operation and disconnection of the external interface) is performed at the time of abnormality, an abnormality caused by another CPU or node Disabling related logic / signal lines between a large number of CPUs / Nodes can be reduced, backup operations can be easily implemented, and LSIs that have been discarded due to some failures in the past The parts can be used for life extension without replacement. Since the second embodiment is connected to a shared bus, one node 1101 is a grouping unit of a CPU and a network node without being aware of it, and claim 4 is realized.

超並列コンピュータシステムの概略構成図Schematic configuration diagram of massively parallel computer system プロセッサユニットの概略構成図Schematic configuration diagram of the processor unit ネットワークノードの概略構成図Schematic configuration diagram of network node 多数決論理の論理的構成図Logical configuration diagram of majority logic プロセッサユニット内主ＣＰＵの概略動作フローGeneral operation flow of main CPU in processor unit プロセッサユニット内通信ポートの概略動作フローOutline operation flow of communication port in processor unit ネットワークノード内スイッチング制御部の概略動作フローSchematic operation flow of switching controller in network node ネットワークノード内通信ポートの概略動作フローOutline operation flow of communication port in network node プロセッサユニットとネットワークノードの故障監視・処理の概略動作フローOutline operation flow of failure monitoring and processing of processor unit and network node 共有バス接続のマルチCPU Nodeの概略構成図Schematic diagram of multi-CPU node with shared bus connection 複数Nodeの共有バス接続システムの概略構成図Schematic configuration diagram of shared bus connection system of multiple nodes マルチCPU Nodeにおける故障監視・処理の概略動作フローOutline operation flow of failure monitoring and processing in multi-CPU node 複数Nodeの共有バス接続システムの故障監視・処理の概略動作フローOutline operation flow of failure monitoring and processing of shared bus connection system of multiple nodes

Explanation of symbols

106：超並列計算機での相互監視・バックアップ単位、205：プロセッサユニット101内における相互監視テーブル、206：プロセッサユニット101内の自己FREEZ用ラッチ、305：ネットワークノード102内における相互監視テーブル、303：自己FREEZ多数決論理、304：ネットワークノード102内の自己FREEZ用ラッチ、1002：CPU間相互監視テーブル、1003：CPU1001の自己FREEZ多数決論理、1004：CPU1001内の自己FREEZ用ラッチ、1102：Node間相互監視テーブル、1103：Node1101の自己FREEZ多数決論理、1104：Node1101内の自己FREEZ用ラッチ
106: Mutual monitoring / backup unit in massively parallel computer, 205: Mutual monitoring table in processor unit 101, 206: Self FREEZ latch in processor unit 101, 305: Mutual monitoring table in network node 102, 303: Self FREEZ majority logic, 304: Self-freez latch in network node 102, 1002: CPU mutual monitoring table, 1003: CPU1001 self-freez majority logic, 1004: Self-freez latch in CPU1001, 1102: Node mutual monitoring table , 1103: Self-freeze majority logic of Node1101 1104: Self-freeze latch in Node1101

Claims

In a system in which multiple functional elements are connected via a network and the failed functional element is stopped to perform a degraded operation, whether or not the identification of the failed functional element is consistent among other functional elements A failure monitoring method for a system composed of a plurality of functional elements, which is determined by judgment or majority vote between functional elements, and which backs up the functional elements specified by the determination.

In the system, the sub-functional parts in each functional element are mutually monitored between the sub-functional parts, and if the sub-functional part cannot operate as a single functional element due to abnormality of the sub-functional part, the operation of the own element is stopped and other functions are stopped. 2. The fault monitoring method for a system composed of a plurality of functional elements according to claim 1, wherein the self-element is removed from the system configuration by separating an interface with the element.

3. The plurality of functional elements according to claim 2, wherein the self-operation stop of the functional element is carried out by failure detection recognition of two or more sub-functional parts or by majority vote between the sub-functional parts. Fault monitoring and processing method for configured system

In a system in which four or more functional elements are connected to each other via a network, each of the plurality of functional elements is grouped with two or more network nodes and connected to each other within the group. A failure monitoring method for a system composed of a plurality of functional elements, characterized by monitoring and performing backup in the group when a functional element or network node in the group fails.

In the system, the determination of whether or not the identification of the functional element or network node in which the failure has occurred is the same among a plurality of other functional elements or network nodes in the group, or is determined by majority decision in the group 5. The fault monitoring method for a system composed of a plurality of functional elements according to claim 4, wherein the functional elements or network nodes specified by the determination are backed up.