JP2005085114A

JP2005085114A - Cluster system, upgrade method, and program

Info

Publication number: JP2005085114A
Application number: JP2003318370A
Authority: JP
Inventors: Nobuyuki Morimoto; 展行森本; Kotaro Endo; 浩太郎遠藤; Tetsuya Iinuma; 哲也飯沼
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-09-10
Filing date: 2003-09-10
Publication date: 2005-03-31

Abstract

<P>PROBLEM TO BE SOLVED: To upgrade cluster software without stopping services. <P>SOLUTION: Cluster control parts are successively upgraded by; stopping cluster control parts other than at least one cluster control part while operating the cluster control part (step S16); upgrading software of a stopped cluster control part (step S18); and successively changing the stopped and upgraded cluster control part (step S24). <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、複数のコンピュータから構成されるクラスタシステムに係り、特にクラスタ制御を実行するクラスタ制御部と、当該クラスタ制御部とは独立して設けられ、複数のクラスタ制御部が同期して動作することにより実現される仮想マシンとしてのカーネルの制御によって動作するエージェントとからなるクラスタシステム、アップグレード方法及びプログラムに関する。 The present invention relates to a cluster system including a plurality of computers, and in particular, a cluster control unit that executes cluster control and the cluster control unit are provided independently, and the plurality of cluster control units operate in synchronization. The present invention relates to a cluster system including an agent that operates under the control of a kernel as a virtual machine, an upgrade method and a program.

近年、コンピュータ上でアプリケーションプログラムを実行することにより、ユーザ（クライアント端末）にサービス（業務）を提供するシステムが運用されている。この種のシステムでは、継続したサービスの提供が必須になっている。これに伴い、サービスを実行しているコンピュータ（サーバコンピュータ）にも高い可用性（サーバ稼働率、業務稼働率）が求められる。そこで、複数のコンピュータをクラスタ構成として、一部のコンピュータで障害が発生しても、別のコンピュータでサービスを引き継いで、システム全体が停止するのを防止する、いわゆるクラスタシステムが開発されている（例えば、非特許文献１参照）。 In recent years, a system for providing a service (business) to a user (client terminal) by executing an application program on a computer has been operated. In this type of system, it is essential to provide a continuous service. Accordingly, high availability (server operation rate, business operation rate) is also required for the computer (server computer) executing the service. In view of this, a so-called cluster system has been developed in which a plurality of computers have a cluster configuration, and even if a failure occurs in some computers, the service is taken over by another computer and the entire system is prevented from being stopped ( For example, refer nonpatent literature 1).

クラスタシステムを構成するには、各コンピュータ上にクラスタマネージャが必要となる。クラスタマネージャは、クラスタの制御と、アプリケーションを起動・停止する制御（サービス制御）とを実行する。クラスタシステムでは、複数のコンピュータ（ノード）で分散してクラスタマネージャによるクラスタ制御が行われる。ここでは、クラスタ全体としての視点でクラスタ制御が行われること、つまり各コンピュータで分散して行われるクラスタ制御が、全体として一貫性のある制御となっていることが必要となる。そこでクラスタシステムでは、各コンピュータでのクラスタ制御が、互いに通信を行いながら同期して（連携して）行われる。つまり、クラスタ制御が各コンピュータで多重化して行われる。これにより高可用性（高業務稼働率）(High Availability、以下ＨＡと略称する）が実現される。 To configure a cluster system, a cluster manager is required on each computer. The cluster manager executes cluster control and control (service control) for starting and stopping an application. In a cluster system, cluster control is performed by a cluster manager distributed among a plurality of computers (nodes). Here, it is necessary that the cluster control is performed from the viewpoint of the entire cluster, that is, the cluster control performed in a distributed manner on each computer is a consistent control as a whole. Therefore, in the cluster system, cluster control in each computer is performed synchronously (in cooperation) while communicating with each other. That is, cluster control is performed by multiplexing in each computer. Thereby, high availability (high business operation rate) (High Availability, hereinafter abbreviated as HA) is realized.

従来、クラスタ構成で運用されているシステムでクラスタソフトウェアをアップグレードする必要がある場合、クラスタ制御下で稼動しているサービスの停止が必要であった。これは、クラスタシステムを構成、制御しているクラスタ制御部とエージェントとが密に連携しているため、クラスタソフトウェアのアップグレードを行うためには、クラスタソフトウェアの停止が必要となり、これに伴い、稼動しているサービスの停止が必要となるためである。具体的にはアップグレードは、（１）稼動系サービスの停止、（２）稼動系、待機系のクラスタソフトウェアの停止処理、（３）クラスタソフトウェアのアップグレード、（４）クラスタソフトウェアの開始、（５）サービスの起動（サービスの自動開始も含む）の処理により行われる。 Conventionally, when it is necessary to upgrade cluster software in a system operating in a cluster configuration, it is necessary to stop a service operating under cluster control. This is because the cluster controller that configures and controls the cluster system and the agent are closely linked, so it is necessary to stop the cluster software in order to upgrade the cluster software. This is because it is necessary to stop the service. Specifically, the upgrade includes (1) stop of the active service, (2) stop processing of the active and standby cluster software, (3) upgrade of the cluster software, (4) start of the cluster software, (5) It is performed by processing of service activation (including automatic service start).

近年、サービス停止時間を短くするための提案がなされている（例えば、特許文献１参照）。ここで提案されているアップグレードは、（１）稼動系サービスの停止、（２）待機系でサービスの開始（スイッチオーバ、あるいはフェイルオーバ処理）、（３）稼動系クラスタウェアの停止処理、（４）稼動系のクラスタウェアのアップグレード、（５）稼動系のクラスタソフトウェアの開始（待機系サービスの停止）、（６）稼動系でサービスの開始（スイッチバック、あるいはフェイルバック処理）、（７）待機系のクラスタソフトウェアの停止処理、（８）待機系のクラスタソフトウェアウェアのアップグレード、（９）待機系のクラスタソフトウェアの開始の処理により行われる。 In recent years, proposals have been made to shorten the service stop time (see, for example, Patent Document 1). The upgrades proposed here are: (1) Stop of active service, (2) Start of service in standby system (switchover or failover processing), (3) Stop processing of active clusterware, (4) Upgrade of active clusterware, (5) Start of active cluster software (stop of standby system service), (6) Start of service in active system (switchback or failback processing), (7) Standby system Cluster software stop processing, (8) standby cluster software upgrade, and (9) standby cluster software start processing.

この場合は、サービスの停止時間は（１）から（２）までの期間と、（５）から（６）までの期間であり、僅かな時間ではあるが
一時的にせよサービスが停止する。 In this case, the service stop time is a period from (1) to (2) and a period from (5) to (6), and the service stops temporarily, though only for a short time.

なお、サービスを稼動系から待機系へスイッチオーバ／スイッチバックしなければならない理由は、クラスタソフトウェアの障害検出機構によるものである。従来、クラスタソフトウェアは他系の障害を検出するための機構としてハートビートによる障害検出を行っている。他系からのハートビートの途絶を検出することによって、他系で障害が発生したものと判断し、サービスを稼動系から待機系へスイッチオーバすることによって、サービスの可用性を高めている。 The reason why the service must be switched over / switched back from the active system to the standby system is due to the failure detection mechanism of the cluster software. Conventionally, cluster software detects a failure by heartbeat as a mechanism for detecting a failure of another system. By detecting a heartbeat disruption from another system, it is determined that a failure has occurred in the other system, and the service availability is increased by switching over the service from the active system to the standby system.

サービスを待機系にスイッチオーバを行わずに、稼動系のクラスタソフトウェアを停止した場合、待機系でハートビートの途絶を誤検出し、テイクオーバが発生することによって、両系でサービスの２重起動が生じる。このような状況を発生させないために、サービスを稼動系から待機系へスイッチオーバしている。
特開２００１−６７３３１（段落０００５）金子哲夫、森良哉、「クラスタソフトウェア」、東芝レビュー、Vol.54 No.12(1999)、p.18-21 If the cluster software on the active system is stopped without switching over the service to the standby system, the standby system erroneously detects a heartbeat disruption, and a takeover occurs. Arise. In order to prevent such a situation from occurring, the service is switched over from the active system to the standby system.
JP 2001-67331 (paragraph 0005) Tetsuo Kaneko, Yoshiya Mori, "Cluster Software", Toshiba Review, Vol.54 No.12 (1999), p.18-21

上記したように従来のクラスタシステムでは、クラスタソフトウェアをアップグレードする必要がある場合、クラスタ制御下で稼動しているサービスの停止が必要であった。 As described above, in the conventional cluster system, when the cluster software needs to be upgraded, it is necessary to stop the service operating under the cluster control.

本発明は上記事情を考慮してなされたものでその目的は、短時間のサービス停止でクラスタソフトウェアをアップグレードすることができるクラスタシステム、アップグレード方法及びプログラムを提供することである。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a cluster system, an upgrade method, and a program capable of upgrading cluster software with a short service stop.

本発明の１つの観点によれば、複数のコンピュータから構成されるクラスタシステムが提供される。クラスタシステムは、クラスタシステムを構成する複数のコンピュータのうちの少なくとも一部のコンピュータ上で独立して動作し、クライアント端末から要求されたサービスを提供するためのサービス制御を行う複数のエージェントと、クラスタシステムを構成する複数のコンピュータのうちの少なくとも一部のコンピュータ上で独立して動作するクラスタ制御部であって、他のクラスタ制御部と通信を行いながら同期してエージェントを制御することにより、他のクラスタ制御部と一体となって１つのカーネルとして一貫性のあるクラスタ制御を行う複数のクラスタ制御部とを具備し、少なくとも１つのクラスタ制御部を稼動させた状態で残りのクラスタ制御部を停止させ、停止させたクラスタ制御部のソフトウェアをアップグレードし、この停止させアップグレードするクラスタ制御部を順次変更することにより、クラスタ制御部を順次アップグレードするものである。 According to one aspect of the present invention, a cluster system including a plurality of computers is provided. The cluster system includes a plurality of agents that operate independently on at least some of a plurality of computers constituting the cluster system and perform service control for providing a service requested from a client terminal, and a cluster A cluster control unit that operates independently on at least a part of a plurality of computers constituting the system, and controls the agent in synchronization with the other cluster control units while communicating with other cluster control units. The cluster control unit is integrated with a plurality of cluster control units that perform consistent cluster control as one kernel, and the remaining cluster control units are stopped while at least one cluster control unit is operating. Upgrade the stopped cluster controller software. By the cluster control unit to upgrade the stopped sequentially changed, is to sequentially upgrade the cluster controller.

本発明によれば、サービス制御とクラスタ制御とがそれぞれ独立のエージェントとクラスタ制御部によって行われているので、クラスタソフトウェアのアップグレード時にはクラスタ制御部のみを停止させればよい。このようにエージェントは停止しないので、サービスを継続して提供することができる。 According to the present invention, the service control and the cluster control are performed by the independent agent and the cluster control unit, respectively, so that only the cluster control unit needs to be stopped when the cluster software is upgraded. Since the agent does not stop in this way, the service can be continuously provided.

以下、本発明の実施の形態につき図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

［第１の実施の形態］
図１は本発明の第１の実施の形態に係るクラスタシステムの構成を示すブロック図である。図１のクラスタシステムは、複数台の、例えば４台のサーバコンピュータ１０-1（＃１），１０-2（＃２），１０-3（＃３），１０-4（＃４）から構成される。サーバコンピュータ１０-1，１０-2，１０-3，１０-4は２つのネットワーク２１，２２により相互接続されている。 [First Embodiment]
FIG. 1 is a block diagram showing the configuration of the cluster system according to the first embodiment of the present invention. The cluster system of FIG. 1 is composed of a plurality of, for example, four server computers 10-1 (# 1), 10-2 (# 2), 10-3 (# 3), 10-4 (# 4). Is done. Server computers 10-1, 10-2, 10-3, 10-4 are interconnected by two networks 21, 22.

サーバコンピュータ１０-1，１０-2，１０-3，１０-4上では、クラスタ制御を行うクラスタ制御部１１-1（＃１），１１-2（＃２），１１-3（＃３），１１-4（＃４）が動作する。クラスタ制御部１１-1〜１１-4は、サーバコンピュータ１０-1〜１０-4が、対応するサーバプログラム（クラスタ制御のためのクラスタ制御部プログラム）を実行することにより実現される。 On the server computers 10-1, 10-2, 10-3, 10-4, cluster control units 11-1 (# 1), 11-2 (# 2), 11-3 (# 3) for performing cluster control , 11-4 (# 4) operate. The cluster control units 11-1 to 11-4 are realized by the server computers 10-1 to 10-4 executing corresponding server programs (cluster control unit programs for cluster control).

また、サーバコンピュータ１０-1〜１０-4のうちのサーバコンピュータ１０-2，１０-3上では、サービス制御（つまりアプリケーションを起動・停止する制御）を行うエージェント１２-1（＃１），１２-2（＃２）が動作する。エージェント１２-1，１２-2は、サーバコンピュータ１０-2，１２-3が、対応するサーバプログラム（サービス制御のためのエージェントプログラム）を実行することにより実現される。 Further, on the server computers 10-2 and 10-3 among the server computers 10-1 to 10-4, agents 12-1 (# 1) and 12 that perform service control (that is, control for starting / stopping applications) are performed. -2 (# 2) works. The agents 12-1 and 12-2 are realized by the server computers 10-2 and 12-3 executing corresponding server programs (agent programs for service control).

エージェント１２-1，１２-2の各々は、通信経路切換制御部１２２、ハートビート送信部１２４、サービス制御部１２６からなり、サーバコンピュータ固有の制御を司る。通信経路切換制御部１２２は、任意のクラスタ制御部１１-1〜１１-4との通信を切換えるための制御部である。ハートビート送信部１２４は、他エージェントとのハートビート途絶を検出するための制御部である。サービス制御部１２６は、サービス１２８の実行制御を行う制御部である。 Each of the agents 12-1 and 12-2 includes a communication path switching control unit 122, a heartbeat transmission unit 124, and a service control unit 126, and manages control unique to the server computer. The communication path switching control unit 122 is a control unit for switching communication with any of the cluster control units 11-1 to 11-4. The heartbeat transmission unit 124 is a control unit for detecting heartbeat disruption with other agents. The service control unit 126 is a control unit that controls execution of the service 128.

ネットワーク２２には、エージェント１２-1，１２-2に対してサービスの実行を要求するクライアント端末２３が接続されている。エージェント１２-1，１２-2とクライアント端末２３との間の通信は、ネットワーク２２を介して行われる。なお、図１では、作図の都合上、ネットワーク２２に１つのクライアント端末２３が接続されている例が示されている。しかし、ネットワーク２２には、複数のクライアント端末が接続されるのが一般的である。 Connected to the network 22 is a client terminal 23 that requests the agents 12-1 and 12-2 to execute a service. Communication between the agents 12-1 and 12-2 and the client terminal 23 is performed via the network 22. 1 shows an example in which one client terminal 23 is connected to the network 22 for the sake of drawing. However, a plurality of client terminals are generally connected to the network 22.

サーバコンピュータ１０-1〜１０-4上の各クラスタ制御部１１-1〜１１-4は、従来のクラスタシステムにおける各コンピュータ上で動作するクラスタマネージャのクラスタ制御機能と同様のクラスタ制御機能を有し、他のクラスタ制御部と一体となって（多重化して動作して）、クラスタシステムの制御を実行する。クラスタ制御部１１-1〜１１-4は、サービスの開始要求、停止要求、スイッチオーバ処理、優先度に従ったテイクオーバ処理などのクラスタ全体の制御を司る。一体となったクラスタ制御部１１-1〜１１-4は、カーネル１１０と呼ぶ１つのバーチャルマシン（仮想的な実行環境）を形成する。カーネル１１０は、個々のサーバコンピュータ１０-1〜１０-4上で動作するクラスタ制御部１１-1〜１１-4が連携して形成される。このためカーネル１１０は、サーバコンピュータ１０-1〜１０-4にまたがって存在していると考えることができる。つまり、個々のクラスタ制御部１１-1〜１１-4がカーネルなのではなく、クラスタ制御部１１-1〜１１-4が一体となってカーネル１１０が構築され、よってクラスタシステムにはひとつだけカーネルが存在する。 Each of the cluster control units 11-1 to 11-4 on the server computers 10-1 to 10-4 has a cluster control function similar to the cluster control function of the cluster manager that operates on each computer in the conventional cluster system. Integrate with other cluster control units (operate in a multiplexed manner) to execute control of the cluster system. The cluster control units 11-1 to 11-4 control the entire cluster such as service start requests, stop requests, switchover processing, and takeover processing according to priority. The integrated cluster control units 11-1 to 11-4 form one virtual machine (virtual execution environment) called a kernel 110. The kernel 110 is formed in cooperation with cluster control units 11-1 to 11-4 operating on the individual server computers 10-1 to 10-4. Therefore, the kernel 110 can be considered to exist across the server computers 10-1 to 10-4. That is, the individual cluster control units 11-1 to 11-4 are not kernels, but the cluster control units 11-1 to 11-4 are integrated to form the kernel 110. Therefore, only one kernel is included in the cluster system. Exists.

クラスタ制御部１１-1〜１１-4が一体となって同期（連携）して動作する（つまり多重化して動作する）のに必要な、クラスタ制御部１１-1〜１１-4間の通信には、ネットワーク２１が用いられる。ここで、ネットワーク２１には、上記多重化動作の高速化のために、エージェント１２-1，１２-2とクライアント端末２３との間の通信に用いられるネットワーク２２よりも高速のネットワークが用いられる。なお、高速化が要求されない場合には、クラスタ制御部１１-1〜１１-4間の通信にもネットワーク２２を用いても構わない。 For communication between the cluster control units 11-1 to 11-4 necessary for the cluster control units 11-1 to 11-4 to operate in synchronism (cooperation) as a whole (that is, operate in a multiplexed manner). The network 21 is used. Here, in order to speed up the multiplexing operation, a network faster than the network 22 used for communication between the agents 12-1 and 12-2 and the client terminal 23 is used as the network 21. If speeding up is not required, the network 22 may be used for communication between the cluster control units 11-1 to 11-4.

サーバコンピュータ１０-2，１０-3上のエージェント１２-1，１２-2は、カーネル１１０からの制御に従って動作し、サービス（アプリケーション）１２８の起動・停止を制御する。つまり、エージェント１２-1，１２-2ではサービス１２８が実行される。このサービス１２８について、ウェブサービスを例に説明する。まずウェブサービスを実行するには、当該サービスに対するクライアント端末２３からのリクエストに応答してコンテンツを配信する役割を持つウェブサーバが必要である。またクライアント端末からウェブサーバにリクエストを届けるためにはアドレス、例えばＩＰ（Internet Protocol）アドレスが必要である。提供されるコンテンツを保存するためのファイルシステムも必要である。つまりウェブサービスの提供には、ウェブサーバ、ＩＰアドレス、ファイルシステムなどの、物理的、もしくは論理的な実体が必要となる。これらの実体は、「リソース」と呼ばれる。つまり、エージェント１２-1，１２-2では、それぞれ複数のリソースを組み合わせて実行することにより、結果としてサービスが実行される。エージェント１２-1，１２-2はリソースの起動・停止を実際に行う手段である。エージェント１２０-1，１２０-2は、カーネル１１０から制御の指示を受け取り、それに従ってリソースを制御し、結果をカーネル１１０に返す。 The agents 12-1 and 12-2 on the server computers 10-2 and 10-3 operate according to the control from the kernel 110, and control the start / stop of the service (application) 128. That is, the service 128 is executed in the agents 12-1 and 12-2. The service 128 will be described using a web service as an example. First, in order to execute a web service, a web server having a role of distributing contents in response to a request from the client terminal 23 for the service is required. In order to deliver a request from the client terminal to the web server, an address, for example, an IP (Internet Protocol) address is required. There is also a need for a file system for storing the provided content. That is, to provide a web service, physical or logical entities such as a web server, an IP address, and a file system are required. These entities are called “resources”. In other words, the agents 12-1 and 12-2 execute a service by combining and executing a plurality of resources. The agents 12-1 and 12-2 are means for actually starting and stopping resources. The agents 120-1 and 120-2 receive control instructions from the kernel 110, control resources accordingly, and return the results to the kernel 110.

このように、図１のクラスタシステムには２台のエージェント１２-1，１２-2が存在し、それぞれが独立して稼働する。このため、カーネル１１０が、エージェント１２-1，１２-2をいかに制御し、同期をとるかという点が重要である。そこで本実施の形態では、クラスタシステムの制御がカーネル１１０によって一元的に行われる構成を適用している。カーネル１１０は、クラスタ制御部１１-1〜１１-4で多重化実行されるバーチャルマシンである。このため、一部のクラスタ制御部１１-1〜１１-4に障害が発生した場合でも、クラスタシステムの制御を継続することが可能である。しかも、サーバコンピュータ（クラスタソフトウェア）をそれぞれ独立して動作するクラスタ制御部１１とエージェント１２に分離することにより、クラスタ制御部１１のソフトウェアをアップブレードする場合、サービス１２８を停止することなくアップグレードが可能となる。これにより、クライアント端末２３に対する継続したサービス提供が可能となる。 As described above, there are two agents 12-1 and 12-2 in the cluster system of FIG. 1, and each of them operates independently. Therefore, it is important how the kernel 110 controls and synchronizes the agents 12-1 and 12-2. Therefore, in the present embodiment, a configuration in which control of the cluster system is performed centrally by the kernel 110 is applied. The kernel 110 is a virtual machine that is multiplexed by the cluster control units 11-1 to 11-4. Therefore, control of the cluster system can be continued even when a failure occurs in some of the cluster control units 11-1 to 11-4. In addition, by separating the server computer (cluster software) into the cluster control unit 11 and the agent 12 that operate independently, the software can be upgraded without stopping the service 128 when the software of the cluster control unit 11 is upgraded. It becomes. Thereby, continuous service provision to the client terminal 23 becomes possible.

クラスタ制御部１１-1〜１１-4における多重化実行のアルゴリズムには、前記特許文献１に記載されているスプリットブレインを起こさない２／３定足数アルゴリズムが適用される。本実施の形態のクラスタシステムは、ｎ＝４，ｆ＝１であり、２／３定足数アルゴリズムが適用可能な最も少ない数のコンピュータで構成される。つまり、本実施の形態のクラスタシステムは２／３定足数アルゴリズムが適用可能な最小クラスタ構成をとる。これにより、ｆ（＝１）台のクラスタ制御部の停止障害だけでなく、ビザンティン故障にも耐障害性を実現できる。このように、カーネル１１０は、超高信頼なバーチャルマシンとなっており、信頼性の高いクラスタシステムの制御を実現する。 As a multiplexing execution algorithm in the cluster control units 11-1 to 11-4, the 2/3 quorum algorithm described in Patent Document 1 that does not cause split brain is applied. The cluster system according to the present embodiment has n = 4 and f = 1, and is composed of the smallest number of computers to which the 2/3 quorum algorithm can be applied. That is, the cluster system of the present embodiment has a minimum cluster configuration to which the 2/3 quorum algorithm can be applied. As a result, fault tolerance can be realized not only for stop failures of f (= 1) cluster control units but also for Byzantine failures. As described above, the kernel 110 is an ultra-reliable virtual machine and realizes control of a highly reliable cluster system.

さて本実施の形態では、クラスタシステム内の全てのサーバコンピュータ、つまり４台のサーバコンピュータ１０-1〜１０-4上でクラスタ制御部１１-1〜１１-4が稼働させられて、カーネル１１０が実現される。これにより上記したように、２／３定足数アルゴリズムを適用するのに必要な最少のコンピュータ数（ｎ＝４）でクラスタシステムが構築される。また、この４台のサーバコンピュータ１０-1〜１０-4のうちの２台のサーバコンピュータ１０-2，１０-3上でエージェント１２-1，１２-2が稼働させられて、クラスタシステムのサービスが実行される。このため、サーバコンピュータ１０-2，１０-3には、それぞれそのサービスに必要なリソースと、それに見合ったハードウェア／ソフトウェアが必要になるものの、サーバコンピュータ１０-1，１０-4ではクラスタ制御部１１-1，１１-2しか動作しないため、当該サーバコンピュータ１０-1，１０-4は比較的小さなコンピュータで十分である。 In this embodiment, the cluster controllers 11-1 to 11-4 are operated on all server computers in the cluster system, that is, four server computers 10-1 to 10-4, and the kernel 110 is executed. Realized. Thus, as described above, the cluster system is constructed with the minimum number of computers (n = 4) necessary for applying the 2/3 quorum algorithm. In addition, the agents 12-1 and 12-2 are operated on two server computers 10-2 and 10-3 of the four server computers 10-1 to 10-4, and the cluster system service is provided. Is executed. For this reason, each of the server computers 10-2 and 10-3 requires resources necessary for the service and hardware / software corresponding to the resources, but the server computers 10-1 and 10-4 have a cluster control unit. Since only 11-1 and 11-2 operate, a relatively small computer is sufficient for the server computers 10-1 and 10-4.

以上のことから、図１のクラスタシステムでは、サービス制御に関しては、２つのエージェント１２-1，１２-2間でのフェイルオーバによる可用性の向上が期待できる。一方、クラスタ制御に関しては、４つのクラスタ制御部１１-1〜１１-4による２／３定足数アルゴリズムによる可用性の向上が期待できる。 From the above, in the cluster system of FIG. 1, with respect to service control, an improvement in availability due to failover between the two agents 12-1 and 12-2 can be expected. On the other hand, with respect to cluster control, improvement in availability can be expected by the 2/3 quorum algorithm by the four cluster control units 11-1 to 11-4.

次に、本実施形態におけるソフトウェアのアップグレードについて説明する。前述したように、クラスタソフトウェアをクラスタ制御部１１とエージェント１２とに分離し、それぞれ独立して動作させることにより、クラスタ制御部１１のアップブレードを行う場合、エージェント１２が提供するサービス１２８を停止することなく、アップグレードが可能となる。これにより、クライアント２３に対する継続したサービス１２８の提供が可能となる。なお、エージェント１２のアップグレードに関しては、エージェント１２がサービス１２８の提供を制御しているために、サービスのスイッチオーバにより稼動系と待機計を切換えることは、従来方式と同様である。 Next, software upgrade in the present embodiment will be described. As described above, when the cluster software is separated into the cluster control unit 11 and the agent 12 and operated independently, the service 128 provided by the agent 12 is stopped when the cluster control unit 11 is upgraded. Upgrade is possible without any problems. As a result, the continuous service 128 can be provided to the client 23. Regarding the upgrade of the agent 12, since the agent 12 controls the provision of the service 128, switching between the active system and the standby meter by the service switchover is the same as in the conventional system.

以下、提供するサービスの具体例（クラスタシステムの構成例）に応じたアップグレードの例を説明する。 Hereinafter, an example of upgrade according to a specific example of service to be provided (configuration example of a cluster system) will be described.

図２は同一マシン（サーバコンピュータ）１０内にクラスタ制御部１１、エージェント１２があり、複数マシンで１つ以上のサービスが稼動している場合を示す。クラスタ制御部１１-1〜１１-Nは連携して、サービス１２８-1〜１２８-Nを制御している。なお、サービスの数とクラスタ制御部の数は一致していなくてもよい。 FIG. 2 shows a case where there are a cluster control unit 11 and an agent 12 in the same machine (server computer) 10 and one or more services are operating on a plurality of machines. The cluster control units 11-1 to 11-N cooperate to control the services 128-1 to 128-N. Note that the number of services and the number of cluster control units do not have to match.

この場合のアップグレードの手順を図３のフローチャートに示す。ステップＳ１２に示すように、サーバ１０-1〜１０-Nで１つ以上のサービス１２８-1〜１２８-Nを実施している。ステップＳ１４で変数ｉに１をセットする。ステップＳ１６で、サーバ１０-iのクラスタ制御部１１-iを停止する。他のクラスタ制御部は稼動を続ける。ステップＳ１８で、この停止中のクラスタ制御部１１-iのクラスタソフトウェアをアップグレードする。ステップＳ２０で、このアップグレードしたクラスタ制御部１１-iを再起動する。ステップＳ２２で、全てのクラスタ制御部１１のアップグレードが終了したか否かを判定する。否の場合は、ステップＳ２４で変数ｉを１だけ増加してステップＳ１６からの処理を次のｉ（＋１されたｉ）に対して再び実行する。 The upgrade procedure in this case is shown in the flowchart of FIG. As shown in step S12, one or more services 128-1 to 128-N are implemented by the servers 10-1 to 10-N. In step S14, 1 is set to the variable i. In step S16, the cluster control unit 11-i of the server 10-i is stopped. Other cluster controllers continue to operate. In step S18, the cluster software of the stopped cluster control unit 11-i is upgraded. In step S20, the upgraded cluster control unit 11-i is restarted. In step S22, it is determined whether or not all the cluster control units 11 have been upgraded. If not, the variable i is incremented by 1 in step S24, and the processing from step S16 is executed again for the next i (+1 added i).

このようにして、クラスタシステムを構成するサーバコンピュータ１０-1〜１０-Nのクラスタ制御部１１-1〜１１-Nを１つづつアップグレードすることにより、エージェント１２-1〜１２-Nで稼動中のサービス１２８-1〜１２８-Nを停止することなく、クラスタソフトウェアをアップグレードすることが可能となる。また、クラスタソフトウェアのアップグレード中も他のクラスタ制御部１１-1〜１１-Nは稼動しているので、サービス１２８-1〜１２８-Nに障害が発生した場合でも、他サーバ（クラスタ制御部）へのテイクオーバが可能であり、サービスが停止することが防止される。なお、図３のフローチャートではアックグレードするために停止するクラスタ制御部１１は１台づつとしたが、アップグレード中に少なくとも１台のクラスタ制御部が稼動していればよく、複数台づつ纏めてアップグレードしてもよい。 In this way, the cluster control units 11-1 to 11-N of the server computers 10-1 to 10-N constituting the cluster system are upgraded one by one so that they are operating on the agents 12-1 to 12-N. The cluster software can be upgraded without stopping the services 128-1 to 128-N. In addition, since the other cluster control units 11-1 to 11-N are operating during the upgrade of the cluster software, even if a failure occurs in the services 128-1 to 128-N, other servers (cluster control units) Can be taken over and the service is prevented from stopping. In the flowchart of FIG. 3, the number of cluster control units 11 to be stopped for upgrading is one by one, but it is sufficient that at least one cluster control unit is operating during the upgrade, and multiple units are upgraded together. May be.

図４はクラスタ制御部１１、エージェント１２が別々のマシン（サーバコンピュータ）１０-1〜１０-4内にあり、１つのマシン１０-1で１つのサービス１２８のみが稼動している（他のマシン１０-2はサービス１２８を稼動していない）場合を示す。クラスタ制御部１１-1〜１１-2は連携して、１つのサービス１２８を制御している。 In FIG. 4, the cluster control unit 11 and the agent 12 are in separate machines (server computers) 10-1 to 10-4, and only one service 128 is operating on one machine 10-1 (other machines). 10-2 indicates a case where the service 128 is not operating. The cluster control units 11-1 to 11-2 cooperate to control one service 128.

この場合のアップグレードの手順を図５のフローチャートに示す。ステップＳ３２に示すように、マシン１０-1でサービス１２８-1を実施している。ステップＳ３４で、マシン１０-3のクラスタ制御部１１-1を停止する。他のクラスタ制御部１１-2は稼動を続ける。ステップＳ３６で、この停止中のクラスタ制御部１１-1のクラスタソフトウェアをアップグレードする。ステップＳ３８で、このアップグレードしたクラスタ制御部１１-1を再起動する。次に、ステップＳ３４０で、マシン１０-4のクラスタ制御部１１-2を停止する。クラスタ制御部１１-1は稼動を続ける。ステップＳ４２で、この停止中のクラスタ制御部１１-2のクラスタソフトウェアをアップグレードする。ステップＳ４４で、このアップグレードしたクラスタ制御部１１-2を再起動する。 The upgrade procedure in this case is shown in the flowchart of FIG. As shown in step S32, the service 128-1 is executed on the machine 10-1. In step S34, the cluster control unit 11-1 of the machine 10-3 is stopped. The other cluster control unit 11-2 continues to operate. In step S36, the cluster software of the stopped cluster control unit 11-1 is upgraded. In step S38, the upgraded cluster control unit 11-1 is restarted. Next, in step S340, the cluster control unit 11-2 of the machine 10-4 is stopped. The cluster control unit 11-1 continues to operate. In step S42, the cluster software of the stopped cluster control unit 11-2 is upgraded. In step S44, the upgraded cluster control unit 11-2 is restarted.

このようにして、クラスタシステムを構成するクラスタ制御部１１-1〜１１-2を１つづつアップグレードすることにより、エージェント１２-1で稼動中のサービス１２８-1を停止することなく、クラスタソフトウェアをアップグレードすることが可能となる。また、クラスタソフトウェアのアップグレード中も他のクラスタ制御部は稼動しているので、サービス１２８に障害が発生した場合でも、他サーバ（クラスタ制御部）へのテイクオーバが可能であり、サービスが停止することが防止される。さらに、エージェント１２-2はサービス１２８-1に対してホットスタンバイ状態で待機しているので、エージェント１２-1に障害が発生した場合でも、直ちにエージェントを切換えることができ、可用性を高めることができる。 In this way, by upgrading the cluster control units 11-1 to 11-2 constituting the cluster system one by one, the cluster software can be installed without stopping the service 128-1 running on the agent 12-1. It becomes possible to upgrade. In addition, since other cluster control units are operating during the upgrade of the cluster software, even if a failure occurs in the service 128, it is possible to take over to another server (cluster control unit) and the service will stop. Is prevented. Furthermore, since the agent 12-2 stands by in the hot standby state with respect to the service 128-1, even when a failure occurs in the agent 12-1, the agent can be switched immediately and the availability can be increased. .

図５はクラスタ制御部１１、エージェント１２が別々のマシン（サーバコンピュータ）１０内にあり、複数マシンで１つ以上のサービスが稼動している場合を示す。クラスタ制御部１１-(N+1)〜１１-(N+M)は連携して、サービス１２８-1〜１２８-Nを制御している。ＮとＭとは同じでも異なっていてもよい。 FIG. 5 shows a case where the cluster control unit 11 and the agent 12 are in separate machines (server computers) 10 and one or more services are operating on a plurality of machines. The cluster control units 11- (N + 1) to 11- (N + M) control the services 128-1 to 128-N in cooperation. N and M may be the same or different.

この場合のアップグレードの手順を図６のフローチャートに示す。ステップＳ６２に示すように、マシン１０-(N+1)〜１０-(N+M)で１つ以上のサービス１２８-1〜１２８-Nを実施している。ステップＳ６４で変数ｉに１をセットする。ステップＳ６６で、マシン１０-(N+i)のクラスタ制御部１１-(N+i)を停止する。他のクラスタ制御部は稼動を続ける。ステップＳ６８で、この停止中のクラスタ制御部１１-(N+i)のクラスタソフトウェアをアップグレードする。ステップＳ７０で、このアップグレードしたクラスタ制御部１１-(N+i)を再起動する。ステップＳ７２で、全てのクラスタ制御部１１のアップグレードが終了したか否か判定する。否の場合は、ステップＳ７４で変数ｉを１だけ増加してステップＳ７６からの処理を次のｉ（＋１されたｉ）に対して再び実行する。 The upgrade procedure in this case is shown in the flowchart of FIG. As shown in step S62, one or more services 128-1 to 128-N are executed by the machines 10- (N + 1) to 10- (N + M). In step S64, 1 is set to the variable i. In step S66, the cluster control unit 11- (N + i) of the machine 10- (N + i) is stopped. Other cluster controllers continue to operate. In step S68, the cluster software of the stopped cluster control unit 11- (N + i) is upgraded. In step S70, the upgraded cluster control unit 11- (N + i) is restarted. In step S72, it is determined whether or not all the cluster control units 11 have been upgraded. If not, the variable i is incremented by 1 in step S74, and the processing from step S76 is executed again for the next i (+1 added i).

このようにして、クラスタシステムを構成するマシン１０-(N+1)〜１０-(N+M)のクラスタ制御部１１-1〜１１-Mを１つづつアップグレードすることにより、エージェント１２-1〜１２-Nで稼動中のサービス１２８-1〜１２８-Nを停止することなく、クラスタソフトウェアをアップグレードすることが可能となる。また、クラスタソフトウェアのアップグレード中も他のクラスタ制御部１１-(N+1)〜１１-(N+M)は稼動しているので、サービス１２８-1〜１２８-Nに障害が発生した場合でも、他サーバ（クラスタ制御部）へのテイクオーバが可能であり、サービスが停止することが防止される。なお、図３の例と同様に、複数台づつ纏めてアップグレードしてもよい。 In this way, the agents 12-1 are upgraded by upgrading the cluster controllers 11-1 to 11-M of the machines 10- (N + 1) to 10- (N + M) constituting the cluster system one by one. It becomes possible to upgrade the cluster software without stopping the services 128-1 to 128-N that are running at ~ 12-N. Further, since the other cluster control units 11- (N + 1) to 11- (N + M) are operating during the upgrade of the cluster software, even when a failure occurs in the services 128-1 to 128-N. The takeover to another server (cluster control unit) is possible, and the service is prevented from stopping. As in the example of FIG. 3, a plurality of units may be upgraded together.

なお、本発明は、上記実施の形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施の形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施の形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施の形態に亘る構成要素を適宜組み合せてもよい。 Note that the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, you may combine the component covering different embodiment suitably.

本発明の第１の実施の形態に係るクラスタシステムの構成を示すブロック図。1 is a block diagram showing a configuration of a cluster system according to a first embodiment of the present invention. アップグレードの一具体例を説明するためのクラスタシステムの一構成例を示すブロック図。The block diagram which shows the example of 1 structure of the cluster system for demonstrating one specific example of upgrade. 図２の構成例のアップグレード処理を説明するためのフローチャート。The flowchart for demonstrating the upgrade process of the structural example of FIG. アップグレードの他の具体例を説明するためのクラスタシステムの他の構成例を示すブロック図。The block diagram which shows the other structural example of the cluster system for demonstrating the other specific example of upgrade. 図４の構成例のアップグレード処理を説明するためのフローチャート。5 is a flowchart for explaining an upgrade process of the configuration example of FIG. 4. アップグレードのさらに他の具体例を説明するためのクラスタシステムのさらに他の構成例を示すブロック図。The block diagram which shows the further another structural example of the cluster system for demonstrating the other specific example of upgrade. 図６の構成例のアップグレード処理を説明するためのフローチャート。7 is a flowchart for explaining an upgrade process of the configuration example of FIG.

Explanation of symbols

１０-1〜１０-4…サーバコンピュータ、１１-1〜１１-4…クラスタ制御部、１２-1〜１２-2…エージェント、２１，２２ネットワーク、２３…クライアント端末、１１０…カーネル、１２２…通信経路切換制御部、１２４…ハートビード送信部、１２６…サービス制御部。 10-1 to 10-4 ... server computer, 11-1 to 11-4 ... cluster control unit, 12-1 to 12-2 ... agent, 21, 22 network, 23 ... client terminal, 110 ... kernel, 122 ... communication Route switching control unit, 124... Heart bead transmission unit, 126... Service control unit.

Claims

A cluster system composed of a plurality of computers,
A plurality of agents that operate independently on at least some of the plurality of computers constituting the cluster system and perform service control for providing a service requested from a client terminal;
A cluster control unit that operates independently on at least some of the plurality of computers constituting the cluster system, and controls the agent synchronously while communicating with other cluster control units. A plurality of cluster control units that perform consistent cluster control as a single kernel together with other cluster control units,
Cluster control by stopping the remaining cluster control units while at least one cluster control unit is operating, upgrading the software of the stopped cluster control units, and sequentially changing the cluster control units to be stopped and upgraded Means for sequentially upgrading parts,
A cluster system comprising:

Each of the plurality of computers includes an agent and a cluster control unit,
The plurality of agents respectively control a plurality of services,
The cluster system according to claim 1, wherein the upgrade unit stops each cluster control unit and upgrades the cluster control unit part by part.

The plurality of computers include a first computer that includes an agent and does not include a cluster control unit, and a second computer that includes a cluster control unit and does not include an agent.
The plurality of agents includes a first agent that controls a service and a second agent in a standby state that does not control the service;
The cluster system according to claim 1, wherein the upgrade unit stops each cluster control unit and upgrades the cluster control unit part by part.

The plurality of computers include a first computer that includes an agent and does not include a cluster control unit, and a second computer that includes a cluster control unit and does not include an agent.
The plurality of agents respectively control a plurality of services,
The cluster system according to claim 1, wherein the upgrade unit stops each cluster control unit and upgrades the cluster control unit part by part.

2. The cluster system according to claim 1, wherein only one of the cluster control unit and the agent operates on each computer.

Each of the plurality of agents switches a communication with the cluster control unit, a communication path switching control unit for communicating with any one of the cluster control units, and a heartbeat transmission for detecting a heartbeat interruption with another agent The cluster system according to claim 1, further comprising: a service control unit configured to control execution of the service.

A cluster system composed of a plurality of computers,
A plurality of agents that operate independently on at least some of the plurality of computers constituting the cluster system and perform service control for providing a service requested from a client terminal;
A cluster control unit that operates independently on at least some of the plurality of computers constituting the cluster system, and controls the agent synchronously while communicating with other cluster control units. In the upgrade method of a cluster system including a plurality of cluster control units that perform consistent cluster control as one kernel integrated with other cluster control units,
Cluster control by stopping the remaining cluster control units while at least one cluster control unit is operating, upgrading the software of the stopped cluster control units, and sequentially changing the cluster control units to be stopped and upgraded Upgrade method of cluster system, wherein upgrade is performed sequentially.

A cluster system composed of a plurality of computers,
A plurality of agents that operate independently on at least some of the plurality of computers constituting the cluster system and perform service control for providing a service requested from a client terminal;
A cluster control unit that operates independently on at least some of the plurality of computers constituting the cluster system, and controls the agent synchronously while communicating with other cluster control units. In a cluster system program comprising a plurality of cluster control units that perform consistent cluster control as one kernel integrated with other cluster control units,
By stopping at least one cluster control unit in the computer and stopping the remaining cluster control unit, upgrading the software of the stopped cluster control unit, and sequentially changing the cluster control unit to stop and upgrade, A cluster system program that upgrades the cluster controller sequentially.