CN1251103C

CN1251103C - Method for improving serviceability of business machine group

Info

Publication number: CN1251103C
Application number: CN 02159492
Authority: CN
Inventors: 李电森; 冯锐; 黄平; 肖利民
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2002-12-31
Filing date: 2002-12-31
Publication date: 2006-04-12
Anticipated expiration: 2022-12-31
Also published as: CN1512363A

Abstract

The present invention relates to a method for improving the serviceability of a business cluster. The method is characterized in that a client end is communicated with a preposition node and sends out a service request to the preposition node; each service node respectively transmits the load information of the service node to the preposition node through heartbeat rings; the preposition node determines the current processing capacity of the service nodes according to configuration information and the load information and sequentially regulates the sequence of the service nodes; when the request from the client end arrives, loads are distributed according to the sequence. Besides, the service nodes are respectively provided with a module for mutual monitoring so as to achieve highly reliable running of the nodes. The present invention provides double insurance for the service nodes and makes the service nodes have the self recovery capabilities of a process level, an application level and a system level; the transmission of the load information by the heartbeat rings achieves the dynamic equalization of the loads and avoids the situation of service unavailability because one single (or partial) service node collapses; thus, the processing capacity of the service nodes is fully utilized to provide highly available service for users.

Description

Improve the method for commercial group of planes serviceability

Technical field

The present invention relates to a kind of method that improves commercial group of planes serviceability, on especially a kind of service node in computer cluster system state is monitored, realize the system and the method thereof of the dynamic equalization of load; Belong to technical field of the computer network.

Background technology

In the computer cluster environment, monitor node is responsible for collecting the state of resource, process and the service of each node, and carries out the operation of start and stop process or service according to specific instruction.Usually, monitor node is just collected each node state of a process in system simply, and starts or stops corresponding process according to the instruction of Control Node.

In fact, the main target of people is to use all service nodes that high available service is provided.The availability of so-called service is meant: utilize the software/hardware resource that the ability of service is provided for the user; High available service must be reliable, stable, and the response time is that the user can accept.And process status is not unique determinative of system service availability, moves normal service node and may can not provide certain service owing to the scarcity of certain resource.Phenomenon for fear of this scarcity of resources occurs, and need carry out load balancing to service node.

Because the restriction of various factors, existing load balancing strategy is all fairly simple.For example: polling method just is forwarded to the request of client on each service node circularly successively; Minimum linking number method preferentially sends to client-requested on the minimum service node of linking number; Or the like.In fact, these methods can not realize real load balancing; Its main cause is: 1) connection request of each service has nothing in common with each other, and some may need to consume a large amount of resources, and linking number is the loading condition of response service node truly; 2) configuration of service node has nothing in common with each other, because the good extensibility of cluster, the resource of the service node of Kuo Chonging outfit afterwards may be more far better than original service node; 3) moving in each service node of task has nothing in common with each other, and some node may provide a plurality of services, has perhaps moved a large amount of processes, and the resource consumption situation varies, and conversion at any time.Therefore, if want to realize real load balancing, just how the state of monitor service node in real time distributes load according to the existing processing power decision of service node; In addition, service is provided jointly by one group of associated process often, all has certain dependence usually between these processes, in case certain process goes wrong, whole service will all can be affected.

In some environment, the monitoring of process, service and resource is realized by a center control nodes on each service node, and the advantage of Chu Liing is that control is concentrated like this; But, because the service node that breaks down often is not only that a process goes wrong, and deficient situation often also can appear in corresponding resource, and the correctly order of processing enter Control Node probably finally causes the fault of this node to can not get recovering always.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of method that improves commercial group of planes serviceability, ensures that the stable and reliable operation of calculation machine group system, service are high available, and has fault self-recovery ability.

Another purpose of the present invention is to provide a kind of method that improves commercial group of planes serviceability, realizes load balancing between all nodes.

A further object of the present invention is to provide a kind of method that improves commercial group of planes serviceability, system resource is monitored early warning and processing in time when the resource load reaches the waterline of appointment, thereby the risk of minimization system collapse.

The object of the present invention is achieved like this:

Client and the preposition node communication that is provided with node service monitor module send its services request to described preposition node; The service node that respectively is provided with described node service monitor module passes to preposition node by the heartbeat ring that is made of with preposition node each service node with the load information of this node, preposition node determines the current processing power of these service nodes according to the configuration information of each service node and the load information received, and successively the priority of the current processing power of these service nodes is sorted, when client-side service request arrives, distribute load according to this ordering;

Described preposition node regularly receives the load information of whole heartbeat ring, and according to the distribution of this load information control client-side service request and the reconstruct of heartbeat ring;

If preposition node fails to collect the heartbeat message of the service node that breaks down in official hour, the whole heartbeat ring of reconstruct then, and the failed services node got rid of outside new heartbeat ring is not to the services request of this failed services node distribution customer terminal;

After the service node that breaks down recovers, can apply for adding the heartbeat ring again.

Be equipped with node service monitor (LifeGuard) module in described service node and the preposition node, be used for process, service and the state of resources of supervisory system, specifically comprise:

Steps A: obtain configuration information, start associated process and service;

Step B: monitoring resource, process, service state;

Step C: judge whether system resource reaches waterline, if do not reach then execution in step E;

Step D: send early warning information to preposition node; Execution in step B;

Step e: judge whether process, service be unusual, unusual execution in step B;

Step F: restart process, service;

Step G: whether judgement restarts successful, if successful execution step B;

Step H: send failure message to preposition node, restart system; Finish.

Above-mentioned service node also is provided with command executer (Executor) module, is used for monitoring mutually with the LifeGuard module, realizes the highly reliable operation of node.

The flow process of Executor module is as follows in the described LifeGuard module monitors service node:

Step 10: system initialization starts the associated process that comprises Executor;

Step 11: the normal transaction of carrying out node is handled;

Step 12: the state of monitoring Executor, if normally then continue execution in step 12;

Step 13: restart this Executor;

Step 14: other issued transaction of carrying out node.

The flow process of LifeGuard module is as follows in the described Executor module monitors service node:

Step 20: system initialization starts the associated process that comprises LifeGuard;

Step 21: the normal transaction of carrying out node is handled;

Step 22: the state of monitoring LifeGuard, if normally then continue execution in step 12;

Step 23: restart this LifeGuard;

Step 24: other issued transaction of carrying out node.

Described preposition node regularly receives the load information of whole heartbeat ring, and according to the distribution of this load information control client-requested and the reconstruct of heartbeat ring.

If preposition node fails to collect the heartbeat message of the node that breaks down in official hour, the whole heartbeat ring of reconstruct then, and malfunctioning node got rid of outside new heartbeat ring is not to the services request of this node distribution customer terminal.After the service node that breaks down recovers, can apply for adding the heartbeat ring again.

The present invention uses LifeGuard and Executor to provide double shield as service node, makes service node have process level, application layer and system-level three kinds of self-recovery abilities; Use the heartbeat ring to transmit load information, really realized the dynamic equalization of load, and can avoid the collapse of single (or part) service node and the unavailable situation of service that causes; Thereby make full use of the processing power of service node, for the user provides high available service.

Description of drawings

Fig. 1 is that computer cluster of the present invention constitutes synoptic diagram;

Fig. 2 is the processing flow chart of LifeGuard module monitors system among the present invention and automatic recovery system;

Fig. 3 is the monitoring process flow diagram of LifeGuard module of the present invention to the Executor module;

Fig. 4 is the monitoring process flow diagram of Executor module of the present invention to the LifeGuard module;

Fig. 5 for LifeGuard module in the present invention's one specific embodiment to system process, service and monitoring resource process flow diagram.

Embodiment

Below the present invention is described in detail by specific embodiments and the drawings.

Adopt computer cluster structure such as Fig. 1 of dual monitoring, this system is made up of customer end A, preposition Node B, a plurality of service node.Customer end A is communicated by letter with preposition Node B, and the request of customer end A all is sent on the preposition Node B; Adopt the heartbeat line that service node 1, service node 2...... and preposition Node B are constituted a heartbeat ring, and some Agents are set on each service node, be responsible for collecting the load information of this node, and it is passed to preposition Node B by the heartbeat ring.Preposition Node B determines the current processing power of these service nodes according to configuration information and the load information of collecting, priority to service node sorts successively, just the height according to this priority distributes load in proper order when next client-requested arrives, like this, just reach the mobile equilibrium of resource, and can give full play to the processing power of service node hardware.

Adopt the heartbeat ring mechanism, except the load information of energy passing service node, also having a function is exactly to find the fault of service node in real time.If (or a plurality of) service node breaks down, the heartbeat ring will interrupt at the malfunctioning node place.If the downstream node of malfunctioning node does not receive the heartbeat ring load information of upstream node in certain cycle, just the upstream node to preposition node report oneself breaks down; Preposition node receives after the report, can notify the upstream and downstream node of malfunctioning node, revises own upper and lower trip node respectively, thus the whole heartbeat ring of reconstruct, the malfunctioning node eliminating outside new heartbeat ring, and not to this node distribution services request.If the overtime load information that does not receive whole heartbeat ring of preposition node, then the mode that newly adds the heartbeat ring according to all nodes re-constructs and the whole heartbeat ring of initialization.After malfunctioning node recovers, can apply for adding the heartbeat ring again.Adopt this mechanism, the load that preposition node just can be grasped all service nodes in time dynamically realizes load balancing, can avoid simultaneously occurring owing to the service node fault causes serving disabled situation.

Two modules are set: LifeGuard module and Executor module on each service node.The LifeGuard module also is deployed on the preposition node, is responsible for process, service and the state of resources of supervisory system.

Dissimilar services is also different to the requirement of resource, and for example: the calculated amount of some application need is quite big, and the resource that this application is concerned about most is CPU, and some uses then very harsh to the requirement of I/O, internal memory or network.But each service node the actual physical hardware that has always limited, the concrete configuration situation of hardware also can not be applicable to all application.Under the opposite extreme situations, most of resource of service node may be all idle, but owing to another keystone resources depletes, and cause service can not be provided.Therefore need monitor the resource of system, and keystone resources is provided with a waterline, after the utilization factor of this resource in the system surpasses waterline, just report to the police to preposition node, preposition node is adjusted forwarding strategy, no longer transmit this request, thereby avoid depleting all resources of this system to this service node.

The high availability of service comes from the availability of service node after all, and each service is all provided by a series of associated process, so the availability of the associated process that guarantees application service on the service node and relied on provides the basis of high availability service.

Provide jointly between the process of service often to have certain dependence, when starting service, need start dependent process earlier, restart this process according to specific order, process as shown in Figure 2:

Step B: monitoring resource, process, service state;

Step e: judge whether process, service be unusual, unusual execution in step B;

Step F: restart process, service;

Step G: whether judgement restarts successful, if successful execution step B;

Step H: send failure message to preposition node, restart system; Finish.

LifeGuard has become the guarantee of service node availability, if in a single day LifeGuard breaks down, the self-recovery ability of whole service node will not exist.Therefore the present invention adopts another one module Executor and Lifeguard to monitor mutually, in case the other side is broken down, it initiatively can be recovered, and the specific implementation of its monitoring as shown in Figure 3, Figure 4.

The flow process of Executor module is as follows in the LifeGuard module monitors service node:

Step 11: the normal transaction of carrying out node is handled;

Step 13: restart this Executor;

Step 14: other issued transaction of carrying out node.

The flow process of LifeGuard module is as follows in the Executor module monitors service node:

Step 21: the normal transaction of carrying out node is handled;

Step 23: restart this LifeGuard;

Step 24: other issued transaction of carrying out node.

The flow process of LifeGuard module monitors system of the present invention and automatic recovery system is as shown in Figure 5:

Step 101: if signal is received by system, then execution in step 120, otherwise begin to carry out from step 102 in order;

Step 102: read the service that moves on the node;

Step 103: read the resource that needs this service monitoring;

Step 104: need to judge whether the resource of monitoring, if there is not execution in step 130;

Step 105: check this resource;

Step 106: whether resource evaluation is normal, if normal, execution in step 111;

Step 107: judge whether this resource reaches warning line, if do not reach execution in step 110;

Step 108: warning line is handled, and regulates this resource shared weight in all resources;

Step 109: notice host node resource arrives warning line;

Step 110: read the resource that the next one will be monitored, execution in step 104;

Step 111: misregistration, execution in step 110;

Step 120: judge whether to be SIGKILL, if then finish;

Step 121: judge whether signal into SIGHUP, if not, execution in step 102;

Step 122: read the setting of resource water level again; Execution in step 102;

Step 130: read the process queue that needs monitoring;

Step 131: need to judge whether the process of monitoring, if there is not execution in step 136;

Step 132: check this process;

Step 133: whether this process is unusual, if unusual, execution in step 135;

Step 134: recover this process;

Step 135: get the next process that needs monitoring, execution in step 131;

Step 136: judge whether service processes unusual, if no abnormal, hold step 138;

Step 137: to preposition node report service fault;

Step 138: the next one service of getting this node;

Step 139: judge whether to also have the service of monitoring; If execution in step 103 is arranged;

Step 140: back execution in step 101 is waited in dormancy.

From above flow process as can be seen: the LifeGuard module can realize three self-recovery abilities on the level:

Process level, LifeGuard can monitor all states of a process, and the process relevant with service all is set, in case these processes break down, just can restart process according to predetermined in advance strategy.

Application layer, service routine occurs with the form of service routine or finger daemon usually, even all processes of these service routines dependences are all normal sometimes, also have the disabled situation of service and take place, just available after the time-delay that necessary process was certain after for example socket (socket) port discharged.LifeGuard to the processing policy of application layer program is: enable local client or the test procedure application programs is tested, if determine can not provide service, then notify preposition node to suspend distributed tasks, restart server program simultaneously.

System-level when running into gross error, a lot of programs become unavailable, and must restarting systems could normally recover this moment.The strategy of LifeGuard is to use dongle (softdog) mechanism, restarts and recovery system.

It should be noted that at last: above embodiment is only unrestricted in order to explanation the present invention, although the present invention is had been described in detail with reference to preferred embodiment, those of ordinary skill in the art is to be understood that, can make amendment or be equal to replacement the present invention, and not breaking away from the spirit and scope of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims

1, a kind of method that improves commercial group of planes serviceability is characterized in that:

2, the method for the commercial group of planes serviceability of raising according to claim 1 is characterized in that: be provided with proxy module in described each service node, be used to collect the load information of this node, and give preposition node with this load information by the heartbeat environment-development.

3, the method for the commercial group of planes serviceability of raising according to claim 1 and 2, it is characterized in that: the node service monitor module that is provided with in the described service node is used to monitor process, service and the state of resources of commercial Network of Workstation, and its flow process of process, service and state of resources of monitoring commercial Network of Workstation is specific as follows:

Steps A: the node service monitor module of service node is obtained the configuration information of service node, starts the associated process and the service of service node;

Step B: resource, process, the service state of the node service monitor module monitors service node of service node;

Step C: the node service monitor module of service node judges whether the resource of service node reaches the resource utilization waterline, if do not reach then execution in step E;

Step e: the node service monitor module of service node judges whether service node process, service be unusual, unusual execution in step B;

Step F: restart service node process, service;

Step G: whether judgement restarts successful, if successful execution step B;

Step H: send failure message to preposition node, restart commercial Network of Workstation; Finish.

4, the method for the commercial group of planes serviceability of raising according to claim 1 and 2, it is characterized in that: described service node also is provided with the command executer module, be used for monitoring mutually, realize the highly reliable operation of service node with the node service monitor module of service node.

5, the method for the commercial group of planes serviceability of raising according to claim 4 is characterized in that: the flow process of the command executer module in the node service monitor module monitors service node in the described service node is specific as follows:

Step 10: commercial Network of Workstation initialization starts the associated process that comprises the command executer module;

Step 11: the normal use services transaction that carries out service node is handled;

Step 12: the state of monitor command executor module, if normally then continue execution in step 12;

Step 13: restart this command executer module;

Step 14: other application service issued transaction of carrying out service node.

6, the method for the commercial group of planes serviceability of raising according to claim 4 is characterized in that: in the described service node in the command executer module monitors service node flow process of node service monitor module as follows:

Step 20: the service node system initialization starts the associated process that comprises node service monitor module;

Step 21: the normal use services transaction that carries out service node is handled;

Step 22: the state of monitor node service monitor module, if normally then continue execution in step 22;

Step 23: restart this node service monitor module;

Step 24: other issued transaction of carrying out service node.