JPH11296311A

JPH11296311A - Fault-tolerant control method for storage devices

Info

Publication number: JPH11296311A
Application number: JP10095689A
Authority: JP
Inventors: Takeo Fujimoto; 健雄藤本; Hisao Honma; 久雄本間; Osamu Sakaguchi; 治阪口
Original assignee: Hitachi Software Engineering Co Ltd; Hitachi Ltd
Current assignee: Hitachi Software Engineering Co Ltd; Hitachi Ltd
Priority date: 1998-04-08
Filing date: 1998-04-08
Publication date: 1999-10-29

Abstract

PROBLEM TO BE SOLVED: To judge a fault part with high precision, to separate the part from a system and to suppress the continuous occurrence of the fault by taking the statistics of the times of the fault occurring in the system at every part and automatically executing a series of operations including the specification of the fault part through the use of it during a system work. SOLUTION: One kind of counter is assigned at every part kind constituting the system per each fault kind in order to count the times of fault occurrence in the respective parts and the times of the fault are individually cumulated at every element number which exists in each kind. When the fault occurs in the data transfer system function of a certain specified channel adaptor(CHA) 3, only the data transfer fault counter of CHA 3 indicates a high value. In this case, a threshold value to be a base concerning the times of the fault at every part is provided. Then, overall judgement is executed at the point of time when the fault counter at a certain part exceeds the base threshold value. When the fault at an individual part is judged after the judgement, the connected part is separated from the system.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】その一方、コンピュータシス
テムの構成において、複数論理アクセス経路（以下では
論理パス）を有しながらも物理的に共通部位を持つこと
が多くなっている。例えば、構成各コンポーネントを共
通バスに接続するバス結合方式は、システムの構成変
更、拡張が容易のため広く用いられている。しかし共通
部位を持つことにより、個別の障害が全体に影響を及ぼ
したり、１つの障害発生が他の障害を併発させたりする
など、障害部位の特定も難しくなっている。[0003] On the other hand, in the configuration of a computer system, it is common to have a plurality of logical access paths (hereinafter, logical paths) but also have a physically common part. For example, a bus connection system for connecting each component to a common bus is widely used because the system configuration can be easily changed and expanded. However, having a common part makes it difficult to identify a failure part, for example, an individual failure affects the whole, or one failure occurs simultaneously with another failure.

【０００２】本発明はこのような障害部位特定の問題に
対し、１つの解決方法を提案している。障害部位特定の
精度を向上させることにより、故障部位の切り離しが正
確に行え、結果的にシステム全体の信頼性をも向上可能
となる。[0002] The present invention proposes one solution to such a problem of identifying a failure site. By improving the accuracy of specifying the failure site, the failure site can be accurately separated, and as a result, the reliability of the entire system can be improved.

【０００３】[0003]

【従来の技術】近年、記憶装置に対する信頼性の要求が
高まっている。特に無停止システムの構築に際し、高信
頼性を持つ記憶装置は欠かせないものとなった。そのた
め制御部を含むシステムの各構成要素はすべて冗長性を
持たせ、故障部位さえ特定し、切り離せれば、代替コン
ポーネントによるシステム動作が継続可能となる。2. Description of the Related Art In recent years, there has been an increasing demand for reliability of storage devices. In particular, when constructing a non-stop system, a highly reliable storage device became indispensable. Therefore, each component of the system including the control unit is provided with redundancy, and even if a faulty part is specified and separated, the system operation by the substitute component can be continued.

【０００４】検出した障害内容により、１回の障害検出
ですぐ故障部位を特定し、当該部位を閉塞する方式が多
く用いられている。しかし、そのためには複雑で高度な
障害検出構成が必要となり、また恒久的に発生しない一
時的なノイズにもすぐに閉塞を行い、システムの冗長度
を下げかねない。[0004] In many cases, a faulty part is immediately identified by one fault detection based on the detected fault content, and the part is closed. However, for this purpose, a complicated and sophisticated fault detection configuration is required, and temporary noise that does not occur permanently can be immediately blocked, thereby reducing the redundancy of the system.

【０００５】それに対し、ハード的な部位ごとに障害検
出回数を累計し、あらかじめ設定された固定しきい値を
越えれば、当該部位を故障発生部位と判定してシステム
より切り離す方式もあった。しきい値を越えない回数
の、一時的なノイズでは閉塞に結びつかないので過閉塞
をある程度回避可能である。但し、システム共通部位を
有し、各部位が複雑に絡み合うシステムにおいては、１
ヵ所の障害で複数の論理パスに影響を来たすことがあ
り、複数の部位で障害回数が加算されることも考えられ
る。その場合、単純なしきい値チェックのみで故障部位
を正確に特定することは難しい。On the other hand, there is also a method in which the number of times of failure detection is accumulated for each hardware part, and if the number exceeds a fixed threshold value set in advance, the part is determined as a failure occurrence part and is separated from the system. Temporary noise that does not exceed the threshold does not lead to blockage, so that overblocking can be avoided to some extent. However, in a system that has a system common part and each part is complicatedly intertwined, 1
A plurality of logical paths may be affected by a failure at one location, and the number of failures may be added at a plurality of locations. In such a case, it is difficult to accurately specify a failure part only by a simple threshold value check.

【０００６】[0006]

【発明が解決しようとする課題】記憶装置における障害
リカバリ処理で特に重要なのは、システム動作しながら
故障部位を抽出、切り離すことである。しかし、システ
ムを構成する各コンポーネントを共通バスに接続するバ
ス結合方式など、共通部分を持つシステムにおいては、
単体の故障によってシステム全体に障害が波及すること
があり、故障していない部位も一時的に正常動作できな
い恐れがある。その場合においても、障害元となる故障
部位をいかに正しく判定し、障害の再発を断ち切るか
が、システム信頼性確保における重要な課題である。What is particularly important in a failure recovery process in a storage device is to extract and separate a failed part while the system is operating. However, in systems that have common parts, such as a bus connection method that connects each component of the system to a common bus,
A single failure may cause a failure to spread throughout the system, and a part that has not failed may temporarily not operate normally. Even in such a case, it is an important issue in securing system reliability how to correctly determine a failure site which is a failure source and to stop the recurrence of the failure.

【０００７】複数論理パスがあるシステムにおける単体
故障発生時、本発明方式の適用により、故障部位を高い
確率で指摘可能となり、当該故障部位をシステムから切
り離すことで、システムの継続動作を保証する。また、
共通バスなど特に各論理パスとも複雑に絡み合う部位に
ついては、故障部位の閉塞後にも引き続きに障害発生状
況を監視でき、障害が収束しない時には一旦閉塞したバ
スを回復するなどリカバリを行う。[0007] When a single failure occurs in a system having a plurality of logical paths, the application of the method of the present invention makes it possible to point out a failed portion with a high probability, and by isolating the failed portion from the system, guarantees continuous operation of the system. Also,
For a part such as a common bus that is complicatedly entangled with each logical path, the failure occurrence status can be continuously monitored even after the failure part is closed, and when the failure does not converge, recovery is performed by recovering the temporarily blocked bus.

【０００８】[0008]

【課題を解決するための手段】上記課題の達成を目的と
し、本発明ではシステムを切り離し可能な複数部位に分
け、それぞれの障害発生回数を計数し、障害回数の総合
的な判断から故障部位の特定を行う。ここで総合的な判
断というのは、１つの部位の故障判定についてはその部
位の障害発生回数のみならず、他の部位およびシステム
全体の障害発生回数とも比較、判定するため、バスなど
共通部位が故障した場合は、それを使用する各論理パス
で障害が検出されることが予想され、単一部位の障害検
出回数が先にしきい値を越えても他の部位と比べて回数
が突出して大きくなれば、共通部位故障の疑いがあり、
簡単に当該単一部位を閉塞しないような論理とする。SUMMARY OF THE INVENTION In order to achieve the above object, the present invention divides a system into a plurality of separable parts, counts the number of occurrences of each fault, and determines a faulty part from comprehensive judgment of the number of faults. Perform identification. Here, the comprehensive judgment means that the failure judgment of one part is not only the number of failure occurrences of that part but also the number of occurrences of failures of other parts and the entire system. If a failure occurs, it is expected that a failure will be detected in each logical path that uses it, and even if the number of failure detections for a single part exceeds the threshold value first, the number will prominently increase compared to other parts. If this happens, there is a suspicion of a common part failure,
The logic is such that the single site is not easily closed.

【０００９】具体的には、例えばある共通部位を共用す
る複数パスがあり、それぞれの使用頻度がほぼ同一レベ
ルであるとする。共通部位の故障時、各パスにおける障
害発生回数は全体の算数平均に近いはず、１つのパスに
のみ影響する単一部位の障害時、当該パスの障害発生回
数だけが多く、他のパスの障害回数がほとんどないは
ず。実際各パスに計数された障害回数を、前記期待結果
との相似を計算すれば、共通部位の障害が、当該単一部
位の故障かを判定可能である。[0009] Specifically, for example, it is assumed that there are a plurality of paths sharing a certain common part, and the frequency of use of each path is substantially the same. When a failure occurs in a common part, the number of failures in each path should be close to the arithmetic mean of the whole. In the case of a failure in a single part that affects only one path, the number of failures in that path is large, and failures in other paths occur. There should be few times. By calculating the similarity between the number of failures actually counted for each path and the expected result, it is possible to determine whether a failure in the common part is a failure in the single part.

【００１０】複数論理パスを備える記憶制御装置にお
き、本発明方式を用いることにより、複数パス共通部位
の故障か、単一パスのみに影響する個別部位の故障かを
区別することが可能。個別部位の故障であれば、当該論
理パスを閉塞してシステムから切り離すなどを行い、シ
ステム全体への影響を抑止する。共通部位の故障と判定
したとき、冗長度を具備したシステムであれば、その共
通部位の部分縮退を施すことにより、不要なパス閉塞を
行うことなくシステム動作を続行できると考えられる。By using the method of the present invention in a storage controller having a plurality of logical paths, it is possible to distinguish between a failure in a common part of a plurality of paths or a failure in an individual part affecting only a single path. If a failure has occurred in an individual part, the logical path is closed and disconnected from the system to suppress the influence on the entire system. When it is determined that a failure has occurred in the common part, if the system has redundancy, it is considered that the system operation can be continued without unnecessary path blockage by performing partial degeneration of the common part.

【００１１】また本発明方式により、共通部位の部分縮
退を実行した後にも障害発生回数を統計し、障害の発生
が収束しない場合には縮退した共通部位を回復し、共通
部位の他の部分または個別部位の閉塞を行うことが可
能。常に同時に使用され、障害発生回数より切り分けで
きない複数共通部位が存在する場合でも、先ず１つの部
位を閉塞させてから、その後の障害発生を監視すれば、
正しく障害部位を排除できたかどうかを検証可能であ
る。In addition, according to the method of the present invention, the number of times of occurrence of a failure is counted even after executing the partial degeneracy of the common part, and if the failure does not converge, the degenerated common part is recovered and other parts of the common part or It is possible to occlude individual parts. Even if there is a plurality of common parts that are always used simultaneously and cannot be separated from the number of failure occurrences, if one part is closed first and then the subsequent failure occurrence is monitored,
It is possible to verify whether the faulty site has been correctly excluded.

【００１２】[0012]

【発明の実施の形態】本発明方式の実施例について、図
面を用いて詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described in detail with reference to the drawings.

【００１３】図１は、本発明方式を適応した記憶制御装
置のブロック図である。ホストコンピュータに接続する
チャネル接続系１、入出力データを一時的に格納するキ
ャッシュメモリ２、チャネル接続系１とキャッシュメモ
リ２間のデータ転送を制御するチャネルアダプタ（以下
ではＣＨＡ）３、データを蓄積する記憶媒体であるディ
スクアレイ４、キャッシュメモリ２とディスクアレイ４
間のデータ転送を制御するディスクアダプタ（以下では
ＤＫＡ）５、システム管理情報および通信情報などを格
納する共用メモリ６、各ＣＨＡ３、ＤＫＡ５からキャッ
シュメモリ２または共用メモリ６へ接続する共通バス７
によって構成される。FIG. 1 is a block diagram of a storage controller to which the present invention is applied. A channel connection system 1 connected to a host computer, a cache memory 2 for temporarily storing input / output data, a channel adapter (hereinafter, CHA) 3 for controlling data transfer between the channel connection system 1 and the cache memory 2, and data storage Array 4, cache memory 2, and disk array 4 as storage media
Disk adapter (hereinafter referred to as DKA) 5 for controlling data transfer between them, shared memory 6 for storing system management information and communication information, etc., and common bus 7 for connecting each of CHA 3 and DKA 5 to cache memory 2 or shared memory 6
Composed of

【００１４】単一部位の故障によるシステム動作停止を
防ぐため、各構成要素は冗長性を持たせている。すなわ
ち、ＣＨＡ３とＤＫＡ５はシステム内にそれぞれ複数存
在する。Each component is provided with redundancy in order to prevent a system operation stoppage due to a failure of a single part. That is, a plurality of CHAs 3 and DKAs exist in the system.

【００１５】キャッシュメモリ２と共用メモリ６はそれ
ぞれ切り離し可能な２面化構成となっている。ここでは
キャッシュメモリＡ面２１、キャッシュメモリＢ面２
２、共用メモリＡ面６１、共用メモリＢ面６２と呼び、
どの片面故障時にも正常の１面のみで動作可能である。The cache memory 2 and the shared memory 6 have a two-sided structure that can be separated from each other. Here, the cache memory A side 21 and the cache memory B side 2
2, called the shared memory A side 61 and the shared memory B side 62,
In case of any single-sided failure, operation is possible with only one normal side.

【００１６】ディスクアレイ４はパリティディスクを含
み、任意の１つのディスクを閉塞させて継続運転でき
る。The disk array 4 includes a parity disk, and can operate continuously with any one disk closed.

【００１７】共通バス７はＨバス７１、Ｌバス７２、Ｍ
バス７３の３本のバスから構成される。Ｈバス７１はキ
ャッシュメモリ２アクセスに用いられ、Ｍバス７３は共
用メモリ６アクセス用に用いられる。通常、Ｌバス７２
は予め設定されているシステムオプション情報の指定に
より、Ｈバス７１と協調してキャッシュメモリ２の高速
アクセス（同時使用によって２倍のバス幅を実現）に、
または独立して共用メモリ６アクセスに使用可能であ
る。Ｈバス７１が故障時、Ｌバス７２をキャッシュメモ
リ２アクセスに用い、Ｍバス７３が故障時共用メモリ６
アクセスに用いることで、１本のバスが故障してもシス
テムが継続動作できる。The common bus 7 includes an H bus 71, an L bus 72, and an M bus
The bus 73 includes three buses. The H bus 71 is used for accessing the cache memory 2, and the M bus 73 is used for accessing the shared memory 6. Normally, L bus 72
Is designed to provide high-speed access to the cache memory 2 (a double bus width is realized by simultaneous use) in cooperation with the H bus 71 by designating preset system option information.
Alternatively, it can be used independently for accessing the shared memory 6. When the H bus 71 fails, the L bus 72 is used for accessing the cache memory 2, and when the M bus 73 fails, the shared memory 6 is used.
By using it for access, the system can continue to operate even if one bus fails.

【００１８】本実施例システム稼動時には、複数のアク
セス論理経路（以下では論理パス）を用いた多くのアク
セスが同時に動作する。例えば、チャネル接続系１より
データをあるＣＨＡ３を経由してキャッシュメモリＡ面
２１へ転送する論理パス、他のＣＨＡ３から共用メモリ
Ｂ面６２へ転送する論理パス、またはキャッシュメモリ
Ｂ面２２からあるＤＫＡ５を経由してディスクアレイ４
へ転送するパスなど、様々なパスを用いたアクセスが同
時に動作する。When the system of this embodiment is operating, many accesses using a plurality of access logical paths (hereinafter, logical paths) operate simultaneously. For example, a logical path for transferring data from the channel connection system 1 to the cache memory A surface 21 via a certain CHA 3, a logical path for transferring data from another CHA 3 to the shared memory B surface 62, or a DKA 5 for transferring data from the cache memory B surface 22. Disk array 4 via
Access using various paths, such as a path for transferring data to, operates simultaneously.

【００１９】システム動作中、ある特定ハード部位に故
障が発生し、当該部位を用いた論理パスが恒久的にアク
セス不可となれば、障害切り分けテストによって、故障
部位を特定してシステムから切り離すことができる。こ
こでいう障害切り分けテストとは、例えば障害検出パス
上の１つの部位だけを他のものに切り替えてアクセス試
行し、その結果から当該部位の故障かどうかを判定する
論理である。一例として、あるＣＨＡ３からＨバス７１
とＬバス７２を同時に用いるバスモードでキャッシュメ
モリ２へデータ転送を行う際に、データパリティエラー
の障害を検出すると、障害切り分け処理において、当該
ＣＨＡからＨバス７１とＬバス７２を経由し、それぞれ
キャッシュＡ面２１とキャッシュＢ面２２へアクセステ
ストを行う。例えば、Ｈバス７１を用いた２面のアクセ
スに障害がともに検出され、Ｌバス７２を用いた２面の
アクセスに障害が検出されなければ、Ｈバス７１の故障
と判定できる。その場合、Ｈバス７１を閉塞し、Ｌバス
７２のみをデータ転送用に切り替えて、システムは継続
動作可能となる。During the operation of the system, if a failure occurs in a specific hardware part and the logical path using the part becomes permanently inaccessible, the failure part can be specified and isolated from the system by a fault isolation test. it can. The fault isolation test here is, for example, logic for switching only one part on the failure detection path to another and performing an access attempt, and judging from the result as to whether or not the part is faulty. As an example, from a certain CHA3 to an H bus 71
When performing data transfer to the cache memory 2 in the bus mode in which the CHA and the L bus 72 are used at the same time, when a failure of a data parity error is detected, in the failure isolation processing, the CHA passes through the H bus 71 and the L bus 72, and An access test is performed on the cache A surface 21 and the cache B surface 22. For example, if a failure is detected in both accesses using the H bus 71 and no failure is detected in the two accesses using the L bus 72, the failure of the H bus 71 can be determined. In this case, the H bus 71 is closed, and only the L bus 72 is switched for data transfer, so that the system can continuously operate.

【００２０】但し、実際のハード故障は必ずしも恒久的
なものではなく、一時障害が多発することがある。また
特定タイミング、特定アクセスパターンのみ障害が発生
することも考えられる。その際、前記障害切り分けテス
トで故障部位を使用したアクセステストにおいても正常
終了し、故障部位を特定できない場合が多い。However, actual hardware failures are not always permanent, and temporary failures may occur frequently. It is also conceivable that a failure occurs only at a specific timing and a specific access pattern. At that time, in the failure isolation test, the access test using the failed part is also normally completed, and the failed part cannot be specified in many cases.

【００２１】ここで、障害切り分けテストより部位特定
が可能なもの以外、一時障害などのケースに対して、障
害発生回数をカウントし、総合的なしきい値判定を用い
て故障部位の特定を行う。Here, the number of times of occurrence of a fault is counted for a case such as a temporary fault other than the one that can be specified by the fault isolation test, and a faulty portion is specified using comprehensive threshold judgment.

【００２２】以下はまず障害発生回数の計数方式を述
べ、次に総合しきい値判定の論理を説明する。First, the counting method of the number of times of occurrence of a failure will be described, and then the logic of the overall threshold value determination will be described.

【００２３】説明の便宜上、本実施例記憶制御装置にお
ける障害を、データ転送系障害と共用メモリ情報アクセ
ス障害の２種類に限定する。各部位における障害発生回
数を計数するため、各障害種別につき、システムを構成
する部位種別毎に１種類のカウンタを割り当て、また各
種別内に存在する要素数毎に障害回数を別々に累積す
る。例えば、あるＣＨＡ３における共用メモリ情報アク
セス障害の発生回数、キャッシュメモリＡ面２１におけ
るデータ転送系障害の発生回数、Ｌバス７２におけるデ
ータ転送障害の発生回数など、それぞれカウントする。
本方式の特徴として、単にシステム全体の障害回数を累
積するだけでなく、障害検出時に使われている論理パス
を解析し、当該経路上に存在する各部位のカウントアッ
プを行う。For convenience of explanation, faults in the storage controller of this embodiment are limited to two types: a data transfer system fault and a shared memory information access fault. In order to count the number of failure occurrences in each part, one type of counter is assigned to each failure type for each part type constituting the system, and the number of failures is separately accumulated for each number of elements in each type. For example, the number of occurrences of a shared memory information access failure in a certain CHA 3, the number of occurrences of a data transfer failure in the cache memory A surface 21, the number of occurrences of a data transfer failure in the L bus 72, and the like are respectively counted.
As a feature of this method, not only the number of faults of the entire system is simply accumulated, but also a logical path used at the time of detecting a fault is analyzed, and each part existing on the path is counted up.

【００２４】いま、例えばある特定ＣＨＡ３のデータ転
送系機能が故障し、一時障害が多発するケースを考え
る。そのとき、当該ＣＨＡ３のデータ転送障害カウンタ
のみが高い値を示し、他のＣＨＡ３またはＤＫＡ５のカ
ウンタはカウントアップされない。なお、キャッシュメ
モリ２の両面は通常ほぼ均等な確率にアクセスされるた
め、故障ＣＨＡ３からのデータ転送が両面に分散され、
キャッシュメモリＡ面２１とキャッシュメモリＢ面２２
における障害発生カウンタはほぼ同じ値を取るものと考
えられる。Now, let us consider a case where a data transfer function of a specific CHA 3 breaks down and temporary failures frequently occur. At that time, only the data transfer failure counter of the CHA3 indicates a high value, and the counters of the other CHA3 or DKA5 are not counted up. Since both sides of the cache memory 2 are usually accessed with almost equal probability, data transfer from the failed CHA 3 is distributed to both sides,
Cache memory A side 21 and cache memory B side 22
Are considered to have almost the same value.

【００２５】また、例えばＨバス７１とＬバス７２をデ
ータ転送に用い、Ｍバス７３を共用メモリ情報アクセス
に規定するシステムに、Ｍバスに故障が生じたケースを
考える。この場合、Ｍバス７３は完全なシステム共通部
位であり、動作中の全ＣＨＡ３およびＤＫＡ５はＭバス
７３を経由して共用メモリ６へのアクセスを行うため、
各ＣＨＡ３、ＤＫＡ５で障害が均等に検出され、共用メ
モリＡ面６１と共用メモリＢ面６２における障害回数も
ほぼ同じ値になる。Further, consider a case where a failure occurs in the M bus in a system in which the H bus 71 and the L bus 72 are used for data transfer and the M bus 73 is specified for shared memory information access. In this case, the M bus 73 is a complete system common part, and all the operating CHAs 3 and DKA 5 access the shared memory 6 via the M bus 73.
Failures are equally detected in each of the CHAs 3 and DKAs 5, and the number of failures in the shared memory A surface 61 and the shared memory B surface 62 also becomes substantially the same value.

【００２６】障害回数カウンタは上記２種類の傾向があ
ることを踏まえ、システム内各カウンタ値を用い、故障
部位を特定する判定論理を説明する。Based on the fact that the number of failure counters has the two above-mentioned tendencies, a determination logic for specifying a failed part using each counter value in the system will be described.

【００２７】まず、各部位の障害回数についてベースと
なるしきい値を設ける。当該障害回数カウント値がその
しきい値に達しない場合は、正常システムにおいても発
生し得るノイズの可能性もあるとし、ある部位の障害カ
ウンタがベースしきい値を越えた時点で総合判定を行
う。First, a base threshold value is set for the number of failures of each part. If the failure count value does not reach the threshold value, it is determined that there is a possibility that noise may occur even in a normal system, and a comprehensive judgment is made when the failure counter of a certain part exceeds the base threshold value. .

【００２８】判定に際しては、ベースしきい値を越えた
カウンタと同一種別のカウンタをすべて抽出して、判定
の材料に用いる。ここで同一種別のカウンタは同等機能
を持つ、独立に動作するその他の部位のカウンタを指
す。例えば、あるＣＨＡ３のデータ転送障害回数がベー
スしきい値を超過した場合、他の動作中のＣＨＡ３のデ
ータ転送障害カウンタ値をも取り出す。それらを元に最
初に障害検出した部位のカウント値が他のカウント値に
比べて飛び抜けて大きくなっているか、それとも各カウ
ンタの値が近い値を示しているかを判定する。At the time of judgment, all counters of the same type as the counter exceeding the base threshold value are extracted and used as the material for judgment. Here, the counters of the same type indicate counters of other parts which have the same function and operate independently. For example, when the number of data transfer failures of a certain CHA3 exceeds the base threshold, the data transfer failure counter value of another active CHA3 is also extracted. Based on these, it is determined whether the count value of the part where the failure is first detected is significantly larger than the other count values, or whether the value of each counter indicates a close value.

【００２９】その判定方法の一例として、現実の障害回
数分布が上記２ケースの理想回数分布にどれだけ近いか
を、相似度の計算で比較する方法がある。個別部位故障
のケースでは、理想的には当該部位の障害検出回数のみ
が抽出したカウンタの合計値に達し、他のカウンタ値が
０となる。ここで、実際の各部位のカウント値と上記理
想値の距離（差値の２乗）の合計を求め、この計算値が
小さいほど個別部位故障の理想分布に近い。あらかじめ
設けられる基準値以上に、上記理想分布に近ければ、当
該部位の故障と判定する。同様に、共通部位故障のケー
スでは、理想的には全カウント値が各部位の平均回数に
等しい。同じように実際のカウント値から上記理想値の
距離（差値の２乗）の合計を求めれば、共通部位故障の
理想分布との相似度が得られる。基準値以上に近けれ
ば、共通部位の故障と判定する。As an example of the determination method, there is a method of comparing, by calculating similarity, how close the actual number of failure times distribution is to the ideal number of times distribution in the above two cases. In the case of an individual part failure, ideally, only the number of failure detections of the part reaches the total value of the extracted counters, and the other counter values become zero. Here, the sum of the actual count value of each part and the distance (square of the difference value) between the ideal value and the ideal value is obtained. The smaller this calculated value is, the closer to the ideal distribution of the individual part failure is. If the distribution is closer to the ideal distribution than a reference value provided in advance, it is determined that a failure has occurred in the relevant part. Similarly, in the case of a common site failure, ideally all count values are equal to the average number of times for each site. Similarly, if the sum of the distance of the ideal value (square of the difference value) is obtained from the actual count value, the similarity with the ideal distribution of the common part failure can be obtained. If it is closer to the reference value or more, it is determined that a failure has occurred in the common part.

【００３０】ここで本実施例では、上記２種類の理想分
布との比較を行っているが、他の障害分布を持つハード
構成も考えられる。例えば共通部位の故障において、特
定ある部位は他部位の２倍の確率に障害が検出される場
合なども、対応した理想分布を用意すれば、同様な相似
度計算が可能となる。さらに、実際各部位へのアクセス
数を計数し、そのアクセス回数に応じた障害予想回数を
動的に、前記理想分布に反映する方式も容易に考えられ
る。Here, in the present embodiment, the comparison is made with the above two types of ideal distributions, but a hardware configuration having another fault distribution is also conceivable. For example, in a case where a failure is detected in a specific part at a probability twice as high as that of another part in a failure of a common part, similar similarity calculation can be performed by preparing a corresponding ideal distribution. Further, a method of counting the actual number of accesses to each part and dynamically reflecting the expected number of failures according to the number of accesses to the ideal distribution can be easily considered.

【００３１】以上の判定を経て、個別部位の故障と判定
すれば、当該部位をシステムから切り離す。例えばキャ
ッシュメモリＡ面２１の故障と特定したら、キャッシュ
メモリＡ面２１を使用しないようにシステム管理情報を
更新し、正常なキャッシュメモリＢ面２２のみによるシ
ステム動作を行う。If it is determined that a failure has occurred in an individual part through the above determination, the part is separated from the system. For example, if it is specified that the cache memory A surface 21 has failed, the system management information is updated so that the cache memory A surface 21 is not used, and the system operation is performed using only the normal cache memory B surface 22.

【００３２】共通部位故障と判定したときにも、可能な
限りその共通部位を縮退して全体に影響を与えないとう
にシステム動作を継続する。例えば、共用メモリ情報ア
クセス障害からＭバス７３の故障と判定したとき、Ｌバ
ス７２を共用メモリアクセス用に切り替え、Ｍバス７３
を介したアクセスを停止する。When it is determined that the common part has failed, the system operation is continued so that the common part is degenerated as much as possible without affecting the whole. For example, when it is determined from the shared memory information access failure that the M bus 73 has failed, the L bus 72 is switched for shared memory access, and the M bus 73 is switched.
Stop access through.

【００３３】故障部位を判定し、障害閉塞を行った後に
は障害発生回数のカウント値をクリアする。After the failure site is determined and the failure is closed, the count value of the number of failure occurrences is cleared.

【００３４】なお、共通部位が同時に使われ、故障部位
が一意的に決定できないケースも考えられる。例えば、
本実施例システムは冒頭に言及したように、Ｈバス７１
とＬバス７２を合わせた高速転送バスモードが指定可能
である。そうしたシステムオプションが指定されたとき
に、Ｈバス７１もしくはＬバス７２の単体の故障でも、
２バスを同時に使うために２バスに同じ障害回数が計数
されていて、どのバスの故障かを切り分けることができ
ない。このケースを考慮し、Ｈバス７１を先ず閉塞させ
る論理を盛り込む。そして当該閉塞実行後の障害発生状
況を引き続きに監視し、もう一度ベースしきい値を越え
て同じように共通部故障と判定したときに、前回の判定
結果を引き継ぎ、Ｈバス７１を回復して、Ｌバス７２を
閉塞するように制御する。There may be a case where a common part is used at the same time and a failed part cannot be uniquely determined. For example,
The system of this embodiment is, as mentioned at the beginning, H bus 71
And the high-speed transfer bus mode combining the L bus 72 can be designated. When such a system option is specified, even if the H bus 71 or the L bus 72 alone fails,
Since the two buses are used at the same time, the same number of failures is counted for the two buses, and it is not possible to determine which bus has failed. In consideration of this case, a logic for closing the H bus 71 first is included. Then, the failure occurrence status after the execution of the blockage is continuously monitored, and when it again exceeds the base threshold and is similarly determined to be a common unit failure, the previous determination result is taken over, and the H bus 71 is recovered. The L bus 72 is controlled to be closed.

【００３５】このような、閉塞後の障害発生状況を監視
し、一旦閉塞された部位を再び回復する論理は、前記同
時使用された複数部位の切り分けに必要のみではなく、
何らかの要因によって誤った故障部位指摘後の訂正にも
有効なのは明らかである。The logic for monitoring the occurrence of a failure after the blockage and recovering the blockage once again is not only necessary for separating the plurality of portions used at the same time.
It is clear that it is also effective for correction after erroneous failure point out due to some factor.

【００３６】[0036]

【発明の効果】本発明方式を適用した記憶制御装置は、
システムで発生する障害を各部位毎に回数の統計を行
い、それを用いて故障部位の特定を含む一連の動作を、
システム稼動中に自動的に行うことが可能である。これ
により、共通バス接続など共通部位を持ち、故障箇所の
特定が困難なシステムにおいても、故障部位を高精度で
判定でき、その結果故障部位をシステムから切り離し、
障害の続発は抑止可能となる。According to the storage control device to which the method of the present invention is applied,
Performs statistics of the number of failures that occur in the system for each part, and uses it to perform a series of operations including identification of failure parts,
This can be done automatically while the system is running. As a result, even in a system having a common part such as a common bus connection and in which it is difficult to specify a failure part, the failure part can be determined with high accuracy, and as a result, the failure part is separated from the system,
Subsequent failures can be deterred.

【図面の簡単な説明】[Brief description of the drawings]

【図１】実施例記憶制御装置のブロック図。FIG. 1 is a block diagram of a storage control device according to an embodiment.

[Explanation of symbols]

１…チャネル接続系、２…キャッシュメ
モリ、３…ＣＨＡ（チャネルアダプタ）、４…ディス
クアレイ、５…ＤＫＡ（ディスクアダプタ）、６…共
用メモリ、７…共通バス、２１…キ
ャッシュメモリＡ面、２２…キャッシュメモリＢ面、
６１…共用メモリＡ面、６２…共用メモリＢ面、
７１…Ｈバス（データ転送専用バス）、７２
…Ｌバス（データ転送、共用メモリアクセス用切り替え
可能バス）、７３…Ｍバス（共用メモリアクセス専用バ
ス）。DESCRIPTION OF SYMBOLS 1 ... Channel connection system, 2 ... Cache memory, 3 ... CHA (channel adapter), 4 ... Disk array, 5 ... DKA (disk adapter), 6 ... Shared memory, 7 ... Common bus, 21 ... Cache memory A side, 22 ... Cache memory B side,
61: Shared memory A side, 62: Shared memory B side,
71 ... H bus (data transfer dedicated bus), 72
... L bus (switchable bus for data transfer and shared memory access), 73 ... M bus (dedicated bus for shared memory access).

───────────────────────────────────────────────────── フロントページの続き (72)発明者本間久雄神奈川県小田原市国府津2880番地株式会社日立製作所ストレージシステム事業部内 (72)発明者阪口治神奈川県横浜市中区尾上町６丁目81番地日立ソフトウェアエンジニアリング株式会社 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Hisao Honma 2880 Kozu, Odawara-shi, Kanagawa Prefecture Storage Systems Division, Hitachi, Ltd. (72) Inventor Osamu Sakaguchi 6-81-8 Ouecho, Naka-ku, Yokohama-shi, Hitachi Hitachi Software Engineering Co., Ltd.

Claims

[Claims]

In a storage device control unit composed of a plurality of components, the number of fault occurrences is counted for each separable portion by bus connection, and a faulty portion is determined and analyzed by statistical analysis using the count result. Fault-tolerant control system that automatically disconnects from the system.

2. The fault location judging method according to claim 1, wherein the fault occurrence status after the fault location is separated is continuously monitored. Fault-tolerant control method that enables continuous feedback such as recovery.

3. In the case where it is difficult to specify a faulty part by only one fault detection, the control method according to claim 1 or 2 can prevent erroneous determination of a faulty part and avoid a system down. Storage device for the purpose.