GB2330431A

GB2330431A - Client/server computing with failure detection

Info

Publication number: GB2330431A
Application number: GB9721914A
Authority: GB
Inventors: Amanda Elizabeth Chessell
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1997-10-17
Filing date: 1997-10-17
Publication date: 1999-04-21
Anticipated expiration: 2017-10-17
Also published as: GB2330431B; GB9721914D0

Abstract

A method operable in a client/server computing system for processing a distributed transaction, uses a superior coordinator located in a first server and a subordinate coordinator and a server resource having local data located in a second server. The method, taking place at the second server, comprises steps of: checking whether the operating system process of the superior coordinator has failed; and upon detecting such a failure, commanding said server resource to roll back any changes said server resource has made to its local data during the carrying out of said transaction.

Description

APPARATUS, METHOD AND COMPUTER PROGRAM PRODUCT FOR CLIENT/SERVER COMPUTING WITH FAILURE DETECTION Field of the Invention The invention relates to the field of client/server (also known as "distributed") computing, where one computing device ("the client") requests another computing device ("the server") to perform part of the client's work. The client and server can also be both located on the same physical computing device.

Background of the Invention Client/server computing has become more and more important over the past few years in the information technology world. This type of distributed computing allows one machine to delegate some of its work to another machine that might be, for example, better suited to perform that work. For example, the server could be a high-powered computer running a database program managing the storage of a vast amount of data, while the client is simply a desktop personal computer (PC) which requests information from the database to use in one of its local programs.

The benefits of client/server computing have been even further enhanced by the use of a well-known computer programming technology called object-oriented programming (OOP), which allows the client and server to be located on different (heterogeneous) "platforms". A platform is a combination of the specific hardware/software/operating system/communication protocol which a machine uses to do its work. Cop allows the client application program and server application program to operate on their own platforms without worrying how the client application's work requests will be communicated and accepted by the server application. Likewise, the server application does not have to worry about how the OOP system will receive, translate and send the server application's processing results back to the requesting client application.

Details of how OOP techniques have been integrated with heterogeneous client/server systems are explained in US Patent No.

5,440,744 and European Patent Published Application No. EP 0 677,943 A2.

These latter two publications are hereby incorporated by reference.

However, an example of the basic architecture will be given below for contextual understanding of the inventionts environment.

As shown in Fig. 1, the client computer 10 (which could, for example, be a personal computer having the IBM OS/2 operating system installed thereon) has an application program 40 running on its operating system ("IBM" and "OS/2" are trademarks of the International Business Machines corporation). The application program 40 will periodically require work to be performed on the server computer 20 and/or data to be returned from the server 20 for subsequent use by the application program 40. The server computer 20 can be, for example, a high-powered mainframe computer running on IBM's MVS operating system ("MVS" is also a trademark of the IBM corp.) . For the purposes of the present invention it is irrelevant whether the requests for communications services to be carried out by the server are instigated by user interaction with the first application program 40, or whether the application program 40 operates independently of user interaction and makes the requests automatically during the running of the program.

When the client computer 10 wishes to make a request for the server computer 20's services, the first application program 40 informs the first logic means 50 of the service required. It may for example do this by sending the first logic means the name of a remote procedure along with a list of input and output parameters. The first logic means 50 then handles the task of establishing the necessary communications with the second computer 20 with reference to definitions of the available communications services stored in the storage device 60. All the possible services are defined as a cohesive framework of object classes 70, these classes being derived from a single object class. Defining the services in this way gives rise to a great number of advantages in terms of performance and reusability.

To establish the necessary communication with the server 20, the first logic means 50 determines which object class in the framework needs to be used, and then creates an instance of that object at the server, a message being sent to that object so as to cause that object to invoke one of its methods. This gives rise to the establishment of the connection with the server computer 20 via the connection means 80, and the subsequent sending of a request to the second logic means 90.

The second logic means 90 then passes the request on to the second application program 100 (hereafter called the service application) running on the server computer 20 so that the service application 100 can perform the specific task required by that request, such as running a data retrieval procedure. Once this task has been completed the service application may need to send results back to the first computer 10. The server application 100 interacts with the second logic means 90 during the performance of the requested tasks and when results are to be sent back to the first computer 10. The second logic means 90 establishes instances of objects, and invokes appropriate methods of those objects, as and when required by the server application 100, the object instances being created from the cohesive framework of object classes stored in the storage device 110.

Using the above technique, the client application program 40 is not exposed to the communications architecture. Further the service application 100 is invoked through the standard mechanism for its environment; it does not know that it is being invoked remotely.

The Object Management Group (OMG) is an international consortium of organizations involved in various aspects of client/server computing on heterogeneous platforms with distributed objects as is shown in Fig. 1.

The OMG has set forth published standards by which client computers (e.g.

10) communicate (in OOP form) with server machines (e.g. 20). As part of these standards, an Object Request Broker (called CORBA-the Common Object Request Broker Architecture) has been defined, which provides the objectoriented bridge between the client and the server machines. The ORB decouples the client and server applications from the object oriented implementation details, performing at least part of the work of the first and second logic means 50 and 90 as well as the connection means 80.

As part of the CORBA software structure, the OMG has set forth standards related to "transactions" and these standards are known as the OTS or Object Transaction Service. See, e.g., CORBA Object Transaction Service Specification 1.0, OMG Document 94.8.4. Computer implemented transaction processing systems are used for critical business tasks in a number of industries. A transaction defines a single unit of work that must either be fully completed or fully purged without action. For example, in the case of a bank automated teller machine from which a customer seeks to withdraw money, the actions of issuing the money, reducing the balance of money on hand in the machine and reducing the customer's bank balance must all occur or none of them must occur.

Failure of one of the subordinate actions would lead to inconsistency between the records and the actual occurrences.

Distributed transaction processing involves a transaction that affects resources at more than one physical or logical location. In the above example, a transaction affects resources managed at the local automated teller device as well as bank balances managed by a bank's main computer. Such transactions involve one particular client computer (e.g, 10) communicating with one particular server computer (e.g., 20) over a series of client requests which are processed by the server. The OMG's OTS is responsible for coordinating these distributed transactions.

An application running on a client process begins a transaction which may involve calling a plurality of different servers, each of which will initiate a server process to make changes to its local data according to the instructions contained in the transaction. The transaction finishes by either committing the transaction (and thus all servers finalize the changes to their local data) or aborting the transaction (and thus all servers "rollback" or ignore the changes to their local data made during the transaction) . To communicate with the servers during the transaction (e.g., instructing them to either commit or abort their part in the transaction) one of the processes involved must maintain state data for the transaction. According to the OTS standard, this involves the process setting up a series of objects, one of which is a coordinator object which coordinates the transaction with respect to the various servers.

The main purpose of this coordinator object is to keep track of which server objects are involved in the transaction, so that when the transaction is finished, each server object involved in the transaction can be told to commit the changes made locally to the local database associated with that server object, in a single unified effort. This ensures that no server object makes a data change final without the other server objects which are also involved in the same transaction doing so.

Thus, each server object which is to join a transaction must first register with the coordinator object so that the coordinator object will know of the server object's existence, its wish to join the transaction, and where to find the server object (e.g., which server machine the server object resides on) when it comes time to complete the transaction (where the coordinator object instructs all server objects to make the changes to their respective local data final).

A server object responsible for updating data (referred to hereinbelow as a resource object) gets involved in a transaction when another server object (or the original client object which started the transaction) sends a request to the resource object for the resource object to do some work. This request carries some information, called the transaction context, to inform the resource object that the request is part of a transaction. Once a resource object finds out that it is to be involved in a transaction, it then makes a registration request with the coordinator object.

When the resource object is located in a different operating system process from the coordinator object, it has been found to be useful to use a subordinate coordinator object (222 in Fig. 2) located in the same operating system process as the resource object (223 or 224). The main coordinator object is then called the "superior coordinator object" 211.

During registration of a resource object 223 to the transaction, the subordinate coordinator 222 is set up locally inside the server machine 22 which houses the resource object 223 and the resource object 223 communicates directly with this subordinate coordinator object 222 when it makes a registration request. (It should be noted that while the term "server machine" is used here, the term "server process" could also be used, to thus indicate that the distributed server objects could, in fact, be located on the same server machine but on different operating system processes running on the server machine, and hereinafter the term "server" will be used to refer to both terms.) The subordinate coordinator 222, in turn, registers itself with the superior coordinator object 211 (which is located in another process possibly on another server machine as if it were a resource object).

The subordinate coordinator object 222 thus provides a representation of the existence of the transaction within the server housing the resource object. Instead of communicating directly with the superior coordinator object 211, the resource objects 223 and 224 first communicate with their local subordinate coordinator object 222 which in turn communicates with the superior coordinator object. This greatly reduces the number of cross-operating-system-process calls.

A problem, however, exists in such systems. If the operating system process containing the superior coordinator object 211 fails during a transaction, the resource objects 223, 224 could be left waiting while holding locks to valuable server resources for a long period of time, until the operating system process recovers from the failure so that the superior coordinator can instruct its subordinate coordinators that the transaction should be rolled back (cancelled) and the held locks on the server resources released. In the prior art, only upon such notification did the server resources release their locks for use by other transactions. Thus, the prior art system has been very inefficient in terms of unnecessarily tying up the resources of servers.

One possible solution is for the client application that starts the transaction to include a transaction timeout so that if the transaction goes on for longer than the timeout period, the transaction is terminated. However, this requires that the application writer take an additional step to put in this timeout. Further, the timeout is not an optimal solution as it could be set too long, resulting in resource locks still being held for too long a time period. Of course, if it is set too short, a transaction could be prematurely terminated before it has a chance to finish.

Accordingly, there is a great need in the art for a better way of determining when server resource locks can be released, in particular, when a transaction must be aborted due to operating system process failure.

Summary of the Invention According to a first view, the present invention provides a first server in a client/server computing system for processing a distributed transaction, a superior coordinator is located in a second server of said system and a subordinate coordinator is located in said first server, the first server comprising: a server resource having local data associated therewith; and said subordinate coordinator having means for checking whether the operating system process of the superior coordinator has failed, and, upon detecting such a failure, commanding said server resource to roll back any changes said server resource has made to its local data since the beginning of the transaction.

Preferably, the means for checking periodically checks the superior coordinator's operating system process for a failure by sending a request to a recovery coordinator assigned to the resource.

According to a second aspect, the present invention provides a method operable in a client/server computing system for processing a distributed transaction, a superior coordinator is located in a first server of said system and a subordinate coordinator and a server resource having local data are located in a second server, the method, taking place at the second server, comprising steps of: checking whether the operating system process of a superior coordinator has failed; and upon detecting such a failure, commanding said server resource to roll back any changes said server resource has made to its local data during the carrying out of said transaction.

Preferably, the checking step involves calling a recovery coordinator object assigned to said server resource. Further preferably, the checking step includes periodically checking to determine whether a failure has occurred.

According to a third view, the invention provides a computer program product, stored on a computer-readable storage medium for, when run on a computer, performing a method in a client/server computing system for processing a distributed transaction, a superior coordinator is located in a first server of said system and a subordinate coordinator and a server resource having local data are located in a second server, the method, taking place at the second server, comprising steps of: checking whether the operating system process of a superior coordinator has failed; and upon detecting such a failure, commanding said server resource to roll back any changes said server resource has made to its local data during the carrying out of said transaction.

With the present invention, the subordinate coordinator associated with a resource checks whether the operating system process containing the superior coordinator object has failed, and takes the immediate action of commanding the resource to release locks and rollback any database changes upon detection of such failure. This has the strong advantage of freeing up locks on resources at an early time so that such resources can be used by other transactions.

Brief Description of the Drawings The invention will be better understood by the below description of preferred embodiments thereof to be read while referring to the following figures.

Figure 1 is a block diagram of a well-known heterogeneous client/server architecture using object technology, in the context of which preferred embodiments of the present invention can be applied; Figure 2 is a block diagram showing the various objects instantiated within two co-transactional servers according to a conventional design in the context of which the preferred embodiment of the present invention is applied; Figure 3 is another block diagram showing the various objects instantiated within two co-transactional servers according to a conventional design in the context of which the preferred embodiment of the present invention is applied; and Figure 4 is a flowchart showing the steps involved within a server, according to a preferred embodiment of the present invention.

Detailed DescriDtion of the Embodiments Fig. 3 is the same as Fig. 2 except that another object, called a recovery coordinator object 312, is shown. In general, and as is well known, a recovery coordinator object is created and assigned to a resource object when that resource object registers with a coordinator.

The usual function of the recovery coordinator object is that it receives a call from a resource object after a failure occurs with respect to the resource object or with respect to the resource object's operating system process and after the resource object or process has recovered from the failure. The recovery coordinator returns the status of the transaction to the resource object so that the resource object can attempt to rejoin the transaction. Thus, when a subordinate coordinator 322 registers (as if it were a resource object 323, 324) with the superior coordinator 311, it is given a recovery coordinator object 312. This latter object 312 is running in the same operating system process as the superior coordinator 311.

The preferred embodiment of the present invention makes use of the recovery coordinator object 312 and the steps taken at a server (e.g., server 32) will now be explained with reference to the flowchart of Fig.

4.

At step 41, the subordinate coordinator object of server 32 calls the recovery coordinator object 312 to determine its status. This call, in the preferred embodiment, is done periodically. This call is done using the object 312's CosTransactions::RecoveryCoordinator::replay~completion interface, which is part of the OTS standard cited above. If this call results in the object 312 returning the reply CORBA::COMM~FAILURE as an exception, then the subordinate coordinator 322 determines, at step 42, that a failure has occurred in the operating system process containing the superior coordinator object 311. The YES branch is then taken to step 43.

If, however, this failure reply is not received by the subordinate coordinator 322, the NO branch is taken at step 42 and the control returns to step 41 where another call will be made. Again, in the preferred embodiment this call will take place after a predetermined time interval, but other embodiments are contemplated where the call takes place based on some other criteria other than strictly time (e.g., after a certain number of local data updates).

Assuming the YES branch is taken at step 42, the subordinate coordinator object 322 rolls back each of its registered resource objects 323, 324. That is, since the subordinate coordinator object 322 has found out that the operating system process containing the superior coordinator object 311 has failed, the subordinate coordinator object 322 does rot need to wait until the failed operating system process recovers.

When the failed operating system process eventually recovers, it will instruct all of its subordinate coordinators (including coordinator 322) to rollback all of its resource objects back to the initial state of the transaction. However, if rollback is delayed until this point, the resource objects 323, 324 will still be left holding valuable locks on their local data (resources). The present invention avoids this undesirable result.

Specifically, the subordinate coordinator object 322 immediately commands its associated resource objects 323, 324 to rollback any changes they have made to their local databases during the failed transaction.

Then, when the failed operating system process containing the superior coordinator eventually recovers and the superior coordinator instructs a rollback to all of its registered subordinate coordinators, the subordinate coordinator that has already commanded its locally registered resource objects to rollback and release locks can simply ignore the superior coordinator's rollback command.

While the recovery coordinator object 312 has been described in the preferred embodiment as the element which is contacted by the subordinate coordinator object 322 to determine the status of the operating system process containing the superior coordinator object 311, any other element capable of providing such status could also be used.

Claims

CLAIMS 1. A first server in a client/server computing system for processing a distributed transaction, a superior coordinator is located in a second server of said system and a subordinate coordinator is located in said first server, the first server comprising: a server resource having local data associated therewith; and said subordinate coordinator having means for checking whether the operating system process of the superior coordinator has failed, and, upon detecting such a failure, commanding said server resource to roll back any changes said server resource has made to its local data since the beginning of the transaction.
2. The server of claim 1 wherein said means periodically checks the superior coordinator's operating system process for a failure by sending a request to a recovery coordinator assigned to the resource.
3. A method operable in a client/server computing system for processing a distributed transaction, a superior coordinator is located in a first server of said system and a subordinate coordinator and a server resource having local data are located in a second server, the method, taking place at the second server, comprising steps of: checking whether the operating system process of a superior coordinator has failed; and upon detecting such a failure, commanding said server resource to roll back any changes said server resource has made to its local data during the carrying out of said transaction.
4. The method of claim 3 wherein the checking step involves calling a recovery coordinator object assigned to said server resource.
5. The method of claim 3 wherein said checking step includes periodically checking to determine whether a failure has occurred.
6. A computer program product stored on a computer-readable storage medium for, when run on a computer, performing a method in a client/server computing system for processing a distributed transaction, a superior coordinator is located in a first server of said system and a subordinate coordinator and a server resource having local data are located in a second server, the method, taking place at the second server, comprising steps of: checking whether the operating system process of a superior coordinator has failed; and upon detecting such a failure commanding said server resource to roll back any changes said server resource has made to its local data during the carrying out of said transaction.
7. The program product of claim 6 wherein the checking step involves calling a recovery coordinator object assigned to said server resource.
8. The program product of claim 6 wherein said checking step includes periodically checking to determine whether a failure has occurred.