WO2001050263A1

WO2001050263A1 - System and method for device failure recognition

Info

Publication number: WO2001050263A1
Application number: PCT/US2000/035722
Authority: WO
Inventors: William Gaske
Original assignee: Computer Associates Think Inc
Current assignee: CA Inc
Priority date: 1999-12-30
Filing date: 2000-12-29
Publication date: 2001-07-12
Anticipated expiration: 2002-06-30
Also published as: AU2612501A

Abstract

A method of monitoring a device, the method including sending a plurality of monitoring requests (S2) to the device and processing responses to the plurality of monitoring requests after a response to each monitoring request has been received (S10) or the request has timed out (S16), wherein processing is not performed for any of the monitoring requests unless a valid response to at least one of the plurality of monitoring request is received.

Description

SYSTEM AND METHOD FOR DEVICE FAILURE RECOGNITION

Reference to Related Application

The present application claims the benefit of provisional Application Serial No.

60/174,085 filed on December 30, 1999, which is hereby incorporated herein by reference.

BACKGROUND OF THE DISCLOSURE

1. Field

The present disclosure relates to device monitoring, and in particular to a system and

method for device failure recognition.

2. Description of Related Art

Many network monitoring and management software applications have the ability to

monitor many different components within a device such as: physical network interface

availability (ping, echo or diagnostic socket request); TCP/IP port connectivity (ability to

connect to a port); system event log and system error log files; application log files;

services or processes running on a device; and physical performance data, for example.

In order to communicate efficiently with multiple devices and objects, most operating

systems today implement the concept of a thread. A thread is a code sequence that runs multi-

tasked with other threads on either a uni-processor or multi-processor operating system. Many system operations, and in particular, calls to remote devices including monitoring requests,

require a period of time in which the task performing the request is blocked from executing until

the request is answered or otherwise completes. However, the processor can still be processing

other threads while the blocked thread in inactive. These system operations often include

monitoring operations for monitoring components within a device such as those mentioned

above. Threads provide an efficient way to improve the throughput and scalability of

monitoring, by allowing multiple monitoring requests to execute concurrently, each in different

threads. It is thus possible to quickly perform multiple monitoring requests. An example of a

monitoring operation that can be performed by a thread is a "ping" to remote network devices.

An example of a ping is a UNIX utility used to determine whether a specified address can be

reached. A ping command can use an Internet Control Message Protocol (ICMP) to determine if

a node can respond.

In some of the types of monitored components listed above (such as physical network

interface availability monitoring utilizing ping type requests) the monitoring application verifies

availability of a device or interface using network APIs (application program interfaces). API's

are generally application object methods which allow external applications to access object

features in the application or operating system providing the API.

In other types of monitoring, such as the monitoring of system event logs, log files, and

services or process information, API's which are specific to the OS (Operating System) platform

use some type of remote procedure call (RPC) mechanism to gather information. RPCs enable

systems to search for information from either the system making the RPC, a client or from a

remote system or server. In essence, RPCs are protocols that enable computers to transmit data or to request services of other computers or devices.

Regardless ofthe type of remote call that is executed, in most cases the actual monitoring

request is blocked with respect to the current execution path. In other words, after a task makes a

request (for example, to open a remote event log), the task is blocked until the call completes. It

is possible that on a single device, many instances of monitoring ofthe above-mentioned

components can exist. For example, on a single device one or more interfaces may be

monitored, one or more ports may be monitored, one or more log files may be monitored, one or

more application log files may be monitored, one or more services and processes may be

monitored, and/or one or more types of physical discrete performance data may be gathered, etc.

If there are many instances ofthe above components being monitored on a single device,

when the device catastrophically fails (such as a failure which is followed by a cold start), each

ofthe monitoring methods will fail within varying periods of time based on the complexity and

code path of he API being used to perform the monitoring. This may require the monitoring

system to be tied up for substantial lengths of time. Accordingly, optimal methods are required

for determining whether a particular device is completely down and for suppressing the

processing ofthe failure of each ofthe individual calls when a catastrophic failure ofthe device

occurs.

There are several methods which might be used to accomplish this goal. One method

may be to correlate failure of specific requests to a single device, and thereby establish whether

the failures are due to a catastrophic failure ofthe device itself. Another method may select a

specific monitoring type (for example, a specific interface Ping), which when failed can be used

to determine that the device is down. Unfortunately, neither of these methods is optimal. In the first case, correlating failure of

the requests may itself take a considerable amount of time and processing. In the second case, it

is difficult to select just a single service as the critical service on a device. In each case it may be

possible that a specific service can be unavailable without the device actually being down.

SUMMARY

A method of monitoring a device, the method including sending a plurality of monitoring

requests to the device and processing responses to the plurality of monitoring requests after a

response to each monitoring request has been received or the request has timed out, wherein

processing is not performed for any ofthe monitoring requests unless a valid response to at least

one ofthe plurality of monitoring requests is received. The method may further comprise

discarding any responses to monitoring requests if a valid response to at least one ofthe plurality

of monitoring requests is not received. Each ofthe plurality of monitoring requests may run via

a respective monitoring thread. If a valid response to at least one ofthe plurality of monitoring

requests is received, each ofthe respective monitoring tlireads may be instructed to process

results of their respective requests. If all ofthe monitoring requests failed or timed out, each of

the respective monitoring tlireads may be instructed to discard results of their respective requests.

The method may further include running a status checking thread, wherein if a valid response to

at least one of he plurality of monitoring requests is received, information indicating the valid

response may be forwarded to the status checking thread. The plurality of monitoring requests

may be sent to the device at substantially the same time. The plurality of requests sent to the device may be for monitoring at least one of physical network interface availability, TCP/IP port

connectivity, system event log files, system error log files, application log files, services running

on a device, processes running on the device and physical performance data. The method may

further include starting a timer at substantially the same time the plurality of monitoring requests

are sent, wherein a monitoring request may be considered to have timed out, if a response is not

received within a predefined amount of time from when the timer was started.

A monitoring system for monitoring at least one device on a network may include storage

for storing information and a processing system capable of running at least one monitoring

application, the at least one monitoring application sending a plurality of monitoring requests to

the device and processing responses to the plurality of monitoring requests only after a response

to each monitoring request has been received or the monitoring request times out, wherein

processing of responses is not performed for any ofthe monitoring requests unless a valid

response to at least one of the plurality of monitoring requests is received.

A storage medium storing a monitoring application includes computer executable code

for running a plurality of monitoring threads, each monitoring thread monitoring a request made

to a remote device and computer executable code for running a status checking thread, wherein

after responses to all requests have been received or the requests time out, if any ofthe requests

have received a valid response, each ofthe monitoring tlireads process the results of their

respective requests.

A method of monitoring a device includes sending a plurality of monitoring requests to

the device, each monitoring request having a time out time associated therewith, wherein a

monitoring request is considered to have timed out, if a response to the request is not received prior to the time out time expiring and processing responses to the plurality of monitoring

requests only after a response to each monitoring request has been received or the monitoring

request times out, wherein processing of responses is not performed for any ofthe monitoring

requests unless a valid response to at least one ofthe plurality of monitoring requests is received.

A method of monitoring a device may include sending a set of monitoring requests to the

device, the set of monitoring requests being commenced in multiple tlireads at substantially the

same time, wherein each of these requests times out if no response is received within a set

amount of time and wherein responses to any requests, whether a success or a failure, are not

processed until all requests in the set complete or time-out, and wherein at a completion or time¬

out of all requests, if none ofthe requests completed successfully, then none ofthe individual

responses are treated as failures, but rather the entire device is considered down, and the results

ofthe requests are not processed.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the present embodiments and many of the attendant

advantages thereof will be readily obtained as the same becomes better understood by

reference to the following detailed description when considered in connection with the

accompanying drawings, wherein:

Fig. 1 is a block diagram of a network including a monitoring system according to an embodiment;

Figs. 2A, 2B and 2C depict a flow chart for describing a monitoring method according to an embodiment; Fig. 3 is a block diagram of a monitoring system according to an embodiment;

Fig. 4 is a block diagram depicting exemplary components capable of being monitored;

Figs. 5A, 5B and 5C depict a flowchart for describing a monitoring method according to

another embodiment; and

Fig. 6 depicts a monitoring queue according to an embodiment.

DETAILED DESCRIPTION

In describing the preferred embodiments illustrated in the drawings, specific

terminology is employed for sake of clarity. However, the disclosed embodiments are not

intended to be limited to the specific terminology so selected and it is to be understood that

each specific element includes all technical equivalents which operate in a similar manner.

Fig. 1, depicts an exemplary network to which the present system and method may be

applied. For example, in this embodiment, a monitoring system 102 may be connected to remote

devices such as a network printer 104, network facsimile device 106, server 108, computer

workstations 110, 112 and a router 118, via network 116. Applications running on monitoring

system 102 can communicate with these other remote devices on the network 116 via suitable

interfaces and protocols determined based on the operating systems on the network and the

network architecture used. Network 116 may be, for example, a local area network (LAN). Of

course, the present system and method may be implemented on other types of networks such as,

for example, a wide area network (WAN), the Internet, etc. According to the present

embodiments, monitoring system 102 includes a monitoring application that may be used to

monitor one or more ofthe remote devices on network 116. Of course, the monitoring application can be provided on one ofthe other remote devices shown in Fig. 1 or on a device

remote to network 116 and used to monitor one or more ofthe other devices on network 1 16. A

monitoring device may itself be monitored by another device running a monitoring application.

It is assumed hereinafter that suitable protocols and network operating systems are in

place and that the interfaces, files, ports and other components ofthe remote devices can be

accessed through APIs or appropriate RPCs. Such arrangements are very well know and utilized

on most networked devices today.

Monitoring system 102 may be a standard PC, laptop, mainframe, etc. capable of running

a monitoring application according to an embodiment described herein. Fig. 3 depicts a block

diagram of exemplary components monitoring system 102 may include. Of course, monitoring

system 102 may not include each component shown and/or may include additional components

not shown. As shown, monitoring system 102 may include a central processing unit (CPU) 2, a

memory 4, a clock circuit 6, a printer interface 8, a display unit 10, a LAN data transmission

controller 12, a LAN interface 14, a network controller 16, an internal bus 18 and one or more

input devices 20 such as, for example, a keyboard and mouse.

CPU 2 controls the operation of system 102 and is capable of running applications stored

in memory 4. Memory 4 may include, for example, RAM, ROM, removable CDROM, DVD,

etc. Memory 4 may also store various types of data necessary for the execution ofthe

applications, as well as a work area reserved for use by CPU 2. Clock circuit 6 may include a

circuit for generating information indicating the present time, and may be capable of being

programmed to count down a predetermined or set amount of time.

The LAN interface 14 allows communication between the network 116, which in this example is a LAN, and the LAN data transmission controller 12. The LAN data transmission

controller 12 uses a predetermined protocol suite to exchange information and data with the other

devices on network 116. Monitoring system 102 may also be capable of communicating with

and/or monitoring devices on other networks via router 118. System 102 may also be capable of

communicating with other devices via a Public Switched Telephone Network (PSTN) using

network controller 16. Internal bus 18, which may actually consists of a plurality of buses,

allows communication between each ofthe components connected thereto.

Each of the devices on network 116, including server 108, computers 110, 112, facsimile

106, printer 104 and router 118, may include one or more ofthe components shown in Fig. 4

which are capable of being monitored by system 102. Of course, these are just examples ofthe

types of components that may be monitored and the present embodiments envision the

monitoring of any component capable of being monitored. As shown in Fig. 4, a device may

include one or more system event logs 40, one or more system error log files 42, one or more

application log files 44, TCP/IP port (connectivity) information 46, services/processes running

status information 48, physical performance data 50 and/or physical network interface

(availability) information 52.

Monitoring system 102 may include an operating system capable of running different

applications independently and at the same time. In this type of operating system, an application

can spawn multiple execution tlireads each of which runs independently at the same time. Each

of these threads is capable of performing a monitoring request for monitoring one ofthe

components shown in Fig. 4, for example. Accordingly, system 102 is capable of simultaneously

executing multiple threads, each for monitoring a different component on a device being monitored. Each thread performing a monitoring request will typically execute until a blocking

request is made which is dependent on the asynchronous completion of a remote procedure call

(RPC), which will then return its results. At the same time, other threads can be executing and

can either execute until they block or time-out, or can be processing (e.g., storing) the response

of a completed monitoring request.

It is often desirable to determine when a catastrophic failure occurs on a device being

monitored. To achieve this and other benefits, the monitoring application design ofthe present

embodiment starts a set of monitoring requests to a specific device in multiple threads

(monitoring threads) as a main group and at substantially the same time. Ideally, a basic interval

will be set, at the end of which each monitoring thread in the main group will have begun. A

status checking thread can be commenced at substantially the same time as the monitoring

tlireads for checking the status ofthe monitoring threads.

When a monitoring request has results returned from the device, the results are stored by

the monitoring thread performing the monitoring request. The monitoring thread also forwards

information indicating success ofthe request to the status checking thread. The status checking

thread waits for responses from each ofthe monitoring tlireads. Theoretically, a monitoring

thread may wait indefinitely for a result to be returned from the device being monitored.

According to an embodiment described herein, in order to avoid an indefinite wait state, a timer

is set to a defined amount of time. After this time expires, it is determined that a response from

the device is not forthcoming. After the time expires, the monitoring request or requests that

have not received results are designated as having timed out. If one or more requests time out, an

indication is stored by the associated monitoring thread(s), and this information is also forwarded to the status checking thread.

Once the status checking thread has received all the success/failure messages for the

different monitoring requests or information indicating that the request(s) have timed out, it

examines these results to determine the status ofthe requests. If a request receives a response

indicating a failure, (e.g., an error, failure or invalid message), the request is considered to have

received a failed response. In addition, a request that has timed out, for purposes of this

description, is also considered a failed response. If all requests receive failed responses, the

device is considered down and a message is sent to all ofthe processing threads to terminate, and

to not process their results. If one or more ofthe tlireads received a valid response from the

device, responses to each ofthe threads is processed in a normal manner.

A method for monitoring according to an embodiment is described below by reference to

Figures 2A-2C. In step SI, the monitoring application provided on system 102 is started. The

monitoring application may be manually started by an operator or system 102 may be

programmed to automatically run the monitoring application periodically or at a set time each

day, week, etc. In step S2, a set ofthe specific types of monitoring requests to be performed can

be input by an operator and/or a predetermined set of monitoring requests can be retrieved from

memory 4 (Fig. 3). In step S3, an overall monitoring time can be input and set by the operator or

retrieved from memory and used to set a time-out timer (e.g., Figure 3, clock circuit 6).

According to this embodiment, each monitoring request in the set of requests is performed by a

separate monitoring thread. In step S4, each ofthe monitoring request threads is commenced at

substantially the same time. At substantially the same time, a status checking thread is also

commenced. In step S6, the time-out timer is started. Each ofthe monitoring threads then monitors for a result to its request or for a time-out (Step S8). If a monitoring result is received,

(Yes, Step S10), the result is forwarded by the monitoring thread to the status checking thread

and stored (Step S 14) and monitoring continues. If a monitoring thread result is not received

(No, Step S10), a determination is made whether the time-out timer has timed out. If the time¬

out timer has not timed out (No, Step SI 6), the process returns to Step S8 and monitoring

continues. If the time-out timer has timed out (Yes, Step SI 6), monitoring requests that have not

received a result are considered to have timed-out. Information identifying the monitoring

requests that have not received a result is then forwarded by each ofthe respective monitoring

tlireads to the status checking thread and stored (Step SI 8). The status checking thread then

examines the request results (Step S22). In Step S24, a determination is made whether each of

the monitoring requests have received "failed responses", either having received an error, failure

or invalid message from the device being monitored or having timed out. If all monitoring

requests have received "failed responses" (Yes, Step S24), it is determined that a catastrophic

failure ofthe device has occurred and all threads are terminated (Step S26). Notification ofthe

catastrophic failure may then be reported, for example, to an operator via the display or other

output device and/or stored. If one or more ofthe monitoring requests did not fail, a message is

sent to all monitoring threads to process all results (Step S28) and the results ofthe monitoring

requests are processed in a normal manner and can be reported, for example, to the operator via

the display or other output device and/or stored.

Each ofthe components capable of being monitored may not take the same amount of

time to respond to a monitoring request. Fig. 5 depicts a method for monitoring according to

another embodiment, which takes into account these variations in monitoring times. According to this next embodiment, each monitoring request has a time-out time associated therewith. For

example, each monitoring request is monitored for a result. If a result is not received within the

time-out time associated with that monitoring request, a time-out indication is provided to the

monitoring thread.

In this embodiment, each monitoring request is arranged in a monitor queue such as that

shown in Fig. 6 and each is individually monitored in turn for either a result or a time-out

indication. As shown in Fig. 6, each monitoring requests includes a "Time-out" time associated

therewith. Referring now to Figs. 5A-5C, in step S50, the monitoring application provided on

system 102 is started. The monitoring application may be manually started by an operator or

system 102 may be programmed to automatically run the monitoring application periodically or

at a set time each day, week, etc. In step S52, a set ofthe specific types of monitoring requests to

be performed can be input by an operator and/or a predetermined set of monitoring requests can

be retrieved from memory 4 (Fig. 3). In step S53, a monitoring time can be input and set by the

operator or retrieved from memory for each monitoring request. Each monitoring request in the

set of requests is performed by a separate monitoring tliread. In step S54, each ofthe monitoring

request threads is commenced at substantially the same time. At substantially the same time, a

status checking thread is also commenced. In step S56, the time-out timer is started. In this

embodiment, the time-out timer may be, for example, an elapsed time counter. Each ofthe

monitoring threads then monitors for a result to its request or for a time-out (Step S58). The

status checking thread then monitors for results from the monitoring threads. In Step S60, a

determination is made whether a result to the first monitoring request in the monitor queue has

been received. If a result has been received, (Yes, Step S60), information indicating the result (e.g., valid message or error message) is forwarded to the status checking thread and stored (Step

S62) and the next monitoring request in the queue is selected (Step S64). That monitoring

request is then monitored for a result (Step S58). If a result for the monitoring request is not

received (No, Step S60), a determination is made whether the request has timed out. This

determination can be made by determining whether the time-out time associated with that

monitoring request as shown in Fig. 6, has expired, by referring to the elapsed time counter

started in Step S56. If the request has not timed out (No, Step S66), the next monitoring request

in the queue is selected (Step S67) and it is checked whether a result for that request has been

received (Step S58). On the other hand, if the monitoring request timed out (Yes, Step S66),

time-out information identifying the monitoring request that timed out is forwarded to the status

checking thread and that monitoring request is, in effect, removed from the monitoring queue. In

Step S70, a determination is made whether each monitoring request in the queue has either

received a result or has timed out. If No in Step S70, the next monitoring request in the queue is

selected (Step S67) and the process returns to Step S58. If all monitoring requests have either

received a result or timed out (Step S70), the status checking thread then examines the request

results (Step S72). In Step S74, a determination is made whether all requests have failed, either

having received an error or invalid message from the device being monitored or having timed

out. If all requests failed (Yes, Step S74), it is determined that a catastrophic failure ofthe

device has occurred and all processing tlireads are terminated (Step S76). Notification may then

be provided to an operator and/or stored. If one or more ofthe requests did not fail, a message is

sent to all processing threads to process all results (Step S78) and the results are processed in a

normal manner. If certain types of monitoring requests are required more frequently than other types to a

particular device, extra calls can be commenced independently ofthe main group and either

monitored with the main group or monitored independently from but in a manner similar to that

performed on the main group. If certain requests are required less frequently, they can be made

on alternate start schedules, or whatever other proportion is desired. Alternatively, a schedule of

start times can be set up, and any particular monitoring request or requests can be assigned to the

nearest start time, with those starting at substantially the same time being monitored as a group.

The present embodiments may be conveniently implemented using one or more

conventional general purpose digital computers and/or servers programmed according to the

teachings of the present specification. Appropriate software coding can readily be prepared by

skilled programmers based on the teachings of the present disclosure. The present

embodiments may also be implemented by the preparation of application specific integrated

circuits or by interconnecting an appropriate network of conventional component circuits.

Numerous additional modifications and variations of the present embodiments are

possible in view of the above-teachings. It is therefore to be understood that within the scope

of the appended claims, the present embodiments may be practiced other than as specifically

described herein.

Claims

WHAT IS CLAIMED IS:

1. A method of monitoring a device, said method comprising:

sending a plurality of monitoring requests to the device; and

processing responses to the plurality of monitoring requests after a response to each

monitoring request has been received or the request has timed out, wherein processing is not

performed for any ofthe monitoring requests unless a valid response to at least one ofthe

plurality of monitoring requests is received.

2. A method as recited in claim 1, further comprising discarding any responses to

monitoring requests if a valid response to at least one ofthe plurality of monitoring requests is

not received.

3. A method as recited in claim 1, wherein each ofthe plurality of monitoring requests is

run via a respective monitoring thread.

4. A method as recited in claim 3, wherein if a valid response to at least one ofthe plurality

of monitoring requests is received, each ofthe respective monitoring threads are instructed to

process results of their respective requests.

5. A method as recited in claim 4, wherein if all ofthe monitoring requests failed or timed out, each ofthe respective monitoring threads is instructed to discard results of their respective

requests.

6. A method as recited in claim 4, further comprising a status checking thread, wherein if a

valid response to at least one ofthe plurality of monitoring requests is received, information

indicating the valid response is forwarded to the status checking thread.

7. A method as recited in claim 1, wherein the plurality of monitoring requests are sent to

the device at substantially the same time.

8. A method as recited in claim 1, wherein the plurality of requests sent to the device are for

monitoring at least one of physical network interface availability, TCP/IP port connectivity,

system event log files, system error log files, application log files, services running on a device,

processes running on the device and physical perfoπnance data.

9. A method as recited in claim 1, further comprising starting a timer at substantially the

same time the plurality of monitoring requests are sent, wherein a monitoring request is

considered to have timed out, if a response is not received within a predefined amount of time

from when the timer was started.

10. A monitoring system for monitoring at least one device on a network, said system

comprising: storage for storing information;

a processing system capable of running at least one monitoring application, the at least

one monitoring application sending a plurality of monitoring requests to the device and

processing responses to the plurality of monitoring requests only after a response to each

monitoring request has been received or the monitoring request times out, wherein processing of

responses is not performed for any ofthe monitoring requests unless a valid response to at least

one ofthe plurality of monitoring requests is received.

11. A system as recited in claim 10, wherein said processing system discards results of all of

the monitoring requests if a valid response to at least one ofthe plurality of monitoring requests

is not received.

12. A system as recited in claim 10, wherein each ofthe plurality of monitoring requests is

run via a respective monitoring thread by the processing system.

13. A system as recited in claim 12, wherein if a valid response at least one ofthe plurality of

monitoring requests is received, each ofthe respective monitoring tlireads are instructed to

process results of their respective requests.

14. A system as recited in claim 13, wherein if all ofthe monitoring requests failed or timed

out, each ofthe respective monitoring tlireads is instructed to discard results of their respective

requests.

15. A system as recited in claim 13, wherein the processing system runs a status checking

thread, wherein if a valid response to at least one ofthe plurality of monitoring requests is

received, information indicating the valid response is forwarded to the status checking thread.

16. A system as recited in claim 10, wherein the plurality of monitoring requests are sent to

the device at substantially the same time.

17. A system as recited in claim 10, wherein the plurality of requests sent to the device are

for monitoring at least one of physical network interface availability, TCP/IP port connectivity,

processes running on the device and physical performance data.

18. A storage medium storing a monitoring application, comprising:

computer executable code for running a plurality of monitoring threads, each monitoring

thread monitoring a request made to a remote device;

computer executable code for running a status checking thread, wherein after responses to

all requests have been received or the requests time out, if any ofthe requests have received a

valid response, each ofthe monitoring tlireads process the results of their respective requests.

19. A storage medium as recited in claim 18, wherein the computer executable code for

running the plurality of monitoring threads starts each ofthe plurality of monitoring tlireads at

substantially the same time.

20. A method of monitoring a device, said method comprising:

sending a plurality of monitoring requests to the device, each monitoring request having a

time out time associated therewith, wherein a monitoring request is considered to have timed out,

if a response to the request is not received prior to the time out time expiring; and

responses is not perfoπned for any ofthe monitoring requests unless a valid response to at least

one ofthe plurality of monitoring requests is received.

21. A method as recited in claim 20, further comprising discarding results of all of the

not received.

22. A method as recited in claim 20, wherein each ofthe plurality of monitoring requests is

run via a respective monitoring thread.

23. A method as recited in claim 22, wherein if a valid response at least one ofthe plurality

process results of their respective requests.

24. A method as recited in claim 23, wherein if all ofthe monitoring requests failed or timed

out, each of the respective monitoring threads is instructed to discard results of their respective requests.

25. A method as recited in claim 23, further comprising a status checking thread, wherein if a

indicating the valid response is forwarded to the status checking thread.

26. A method as recited in claim 20, wherein the plurality of monitoring requests are sent to

the device at substantially the same time.

27. A method as recited in claim 20, wherein the plurality of requests sent to the device are

system event log files, system en'or log files, application log files, services running on a device,

processes running on the device and physical perfoιτnance data.

28. A method of monitoring a device, comprising:

sending a set of monitoring requests to the device, the set of monitoring requests being

commenced in multiple threads at substantially the same time, wherein each of these requests

times out if no response is received within a set amount of time and wherein responses to any

requests, whether a success or a failure, are not processed until all requests in the set complete or

time-out, and wherein at a completion or time-out of all requests, if none ofthe requests

completed successfully, then none ofthe individual responses are treated as failures, but rather

the entire device is considered down, and the results ofthe requests are not processed.

29. A method as recited in claim 28, wherein if any ofthe requests complete

successfully then all requests are processed including both those which were successful and those

that failed.