SYSTEM AND METHOD FOR DEVICE FAILURE RECOGNITION
Reference to Related Application
The present application claims the benefit of provisional Application Serial No.
60/174,085 filed on December 30, 1999, which is hereby incorporated herein by reference.
BACKGROUND OF THE DISCLOSURE
1. Field
The present disclosure relates to device monitoring, and in particular to a system and
method for device failure recognition.
2. Description of Related Art
Many network monitoring and management software applications have the ability to
monitor many different components within a device such as: physical network interface
availability (ping, echo or diagnostic socket request); TCP/IP port connectivity (ability to
connect to a port); system event log and system error log files; application log files;
services or processes running on a device; and physical performance data, for example.
In order to communicate efficiently with multiple devices and objects, most operating
systems today implement the concept of a thread. A thread is a code sequence that runs multi-
tasked with other threads on either a uni-processor or multi-processor operating system. Many
system operations, and in particular, calls to remote devices including monitoring requests,
require a period of time in which the task performing the request is blocked from executing until
the request is answered or otherwise completes. However, the processor can still be processing
other threads while the blocked thread in inactive. These system operations often include
monitoring operations for monitoring components within a device such as those mentioned
above. Threads provide an efficient way to improve the throughput and scalability of
monitoring, by allowing multiple monitoring requests to execute concurrently, each in different
threads. It is thus possible to quickly perform multiple monitoring requests. An example of a
monitoring operation that can be performed by a thread is a "ping" to remote network devices.
An example of a ping is a UNIX utility used to determine whether a specified address can be
reached. A ping command can use an Internet Control Message Protocol (ICMP) to determine if
a node can respond.
In some of the types of monitored components listed above (such as physical network
interface availability monitoring utilizing ping type requests) the monitoring application verifies
availability of a device or interface using network APIs (application program interfaces). API's
are generally application object methods which allow external applications to access object
features in the application or operating system providing the API.
In other types of monitoring, such as the monitoring of system event logs, log files, and
services or process information, API's which are specific to the OS (Operating System) platform
use some type of remote procedure call (RPC) mechanism to gather information. RPCs enable
systems to search for information from either the system making the RPC, a client or from a
remote system or server. In essence, RPCs are protocols that enable computers to transmit data
or to request services of other computers or devices.
Regardless ofthe type of remote call that is executed, in most cases the actual monitoring
request is blocked with respect to the current execution path. In other words, after a task makes a
request (for example, to open a remote event log), the task is blocked until the call completes. It
is possible that on a single device, many instances of monitoring ofthe above-mentioned
components can exist. For example, on a single device one or more interfaces may be
monitored, one or more ports may be monitored, one or more log files may be monitored, one or
more application log files may be monitored, one or more services and processes may be
monitored, and/or one or more types of physical discrete performance data may be gathered, etc.
If there are many instances ofthe above components being monitored on a single device,
when the device catastrophically fails (such as a failure which is followed by a cold start), each
ofthe monitoring methods will fail within varying periods of time based on the complexity and
code path of he API being used to perform the monitoring. This may require the monitoring
system to be tied up for substantial lengths of time. Accordingly, optimal methods are required
for determining whether a particular device is completely down and for suppressing the
processing ofthe failure of each ofthe individual calls when a catastrophic failure ofthe device
occurs.
There are several methods which might be used to accomplish this goal. One method
may be to correlate failure of specific requests to a single device, and thereby establish whether
the failures are due to a catastrophic failure ofthe device itself. Another method may select a
specific monitoring type (for example, a specific interface Ping), which when failed can be used
to determine that the device is down.
Unfortunately, neither of these methods is optimal. In the first case, correlating failure of
the requests may itself take a considerable amount of time and processing. In the second case, it
is difficult to select just a single service as the critical service on a device. In each case it may be
possible that a specific service can be unavailable without the device actually being down.
SUMMARY
A method of monitoring a device, the method including sending a plurality of monitoring
requests to the device and processing responses to the plurality of monitoring requests after a
response to each monitoring request has been received or the request has timed out, wherein
processing is not performed for any ofthe monitoring requests unless a valid response to at least
one ofthe plurality of monitoring requests is received. The method may further comprise
discarding any responses to monitoring requests if a valid response to at least one ofthe plurality
of monitoring requests is not received. Each ofthe plurality of monitoring requests may run via
a respective monitoring thread. If a valid response to at least one ofthe plurality of monitoring
requests is received, each ofthe respective monitoring tlireads may be instructed to process
results of their respective requests. If all ofthe monitoring requests failed or timed out, each of
the respective monitoring tlireads may be instructed to discard results of their respective requests.
The method may further include running a status checking thread, wherein if a valid response to
at least one of he plurality of monitoring requests is received, information indicating the valid
response may be forwarded to the status checking thread. The plurality of monitoring requests
may be sent to the device at substantially the same time. The plurality of requests sent to the
device may be for monitoring at least one of physical network interface availability, TCP/IP port
connectivity, system event log files, system error log files, application log files, services running
on a device, processes running on the device and physical performance data. The method may
further include starting a timer at substantially the same time the plurality of monitoring requests
are sent, wherein a monitoring request may be considered to have timed out, if a response is not
received within a predefined amount of time from when the timer was started.
A monitoring system for monitoring at least one device on a network may include storage
for storing information and a processing system capable of running at least one monitoring
application, the at least one monitoring application sending a plurality of monitoring requests to
the device and processing responses to the plurality of monitoring requests only after a response
to each monitoring request has been received or the monitoring request times out, wherein
processing of responses is not performed for any ofthe monitoring requests unless a valid
response to at least one of the plurality of monitoring requests is received.
A storage medium storing a monitoring application includes computer executable code
for running a plurality of monitoring threads, each monitoring thread monitoring a request made
to a remote device and computer executable code for running a status checking thread, wherein
after responses to all requests have been received or the requests time out, if any ofthe requests
have received a valid response, each ofthe monitoring tlireads process the results of their
respective requests.
A method of monitoring a device includes sending a plurality of monitoring requests to
the device, each monitoring request having a time out time associated therewith, wherein a
monitoring request is considered to have timed out, if a response to the request is not received
prior to the time out time expiring and processing responses to the plurality of monitoring
requests only after a response to each monitoring request has been received or the monitoring
request times out, wherein processing of responses is not performed for any ofthe monitoring
requests unless a valid response to at least one ofthe plurality of monitoring requests is received.
A method of monitoring a device may include sending a set of monitoring requests to the
device, the set of monitoring requests being commenced in multiple tlireads at substantially the
same time, wherein each of these requests times out if no response is received within a set
amount of time and wherein responses to any requests, whether a success or a failure, are not
processed until all requests in the set complete or time-out, and wherein at a completion or time¬
out of all requests, if none ofthe requests completed successfully, then none ofthe individual
responses are treated as failures, but rather the entire device is considered down, and the results
ofthe requests are not processed.
BRIEF DESCRIPTION OF THE DRAWINGS
A more complete appreciation of the present embodiments and many of the attendant
advantages thereof will be readily obtained as the same becomes better understood by
reference to the following detailed description when considered in connection with the
accompanying drawings, wherein:
Fig. 1 is a block diagram of a network including a monitoring system according to an embodiment;
Figs. 2A, 2B and 2C depict a flow chart for describing a monitoring method according to an embodiment;
Fig. 3 is a block diagram of a monitoring system according to an embodiment;
Fig. 4 is a block diagram depicting exemplary components capable of being monitored;
Figs. 5A, 5B and 5C depict a flowchart for describing a monitoring method according to
another embodiment; and
Fig. 6 depicts a monitoring queue according to an embodiment.
DETAILED DESCRIPTION
In describing the preferred embodiments illustrated in the drawings, specific
terminology is employed for sake of clarity. However, the disclosed embodiments are not
intended to be limited to the specific terminology so selected and it is to be understood that
each specific element includes all technical equivalents which operate in a similar manner.
Fig. 1, depicts an exemplary network to which the present system and method may be
applied. For example, in this embodiment, a monitoring system 102 may be connected to remote
devices such as a network printer 104, network facsimile device 106, server 108, computer
workstations 110, 112 and a router 118, via network 116. Applications running on monitoring
system 102 can communicate with these other remote devices on the network 116 via suitable
interfaces and protocols determined based on the operating systems on the network and the
network architecture used. Network 116 may be, for example, a local area network (LAN). Of
course, the present system and method may be implemented on other types of networks such as,
for example, a wide area network (WAN), the Internet, etc. According to the present
embodiments, monitoring system 102 includes a monitoring application that may be used to
monitor one or more ofthe remote devices on network 116. Of course, the monitoring
application can be provided on one ofthe other remote devices shown in Fig. 1 or on a device
remote to network 116 and used to monitor one or more ofthe other devices on network 1 16. A
monitoring device may itself be monitored by another device running a monitoring application.
It is assumed hereinafter that suitable protocols and network operating systems are in
place and that the interfaces, files, ports and other components ofthe remote devices can be
accessed through APIs or appropriate RPCs. Such arrangements are very well know and utilized
on most networked devices today.
Monitoring system 102 may be a standard PC, laptop, mainframe, etc. capable of running
a monitoring application according to an embodiment described herein. Fig. 3 depicts a block
diagram of exemplary components monitoring system 102 may include. Of course, monitoring
system 102 may not include each component shown and/or may include additional components
not shown. As shown, monitoring system 102 may include a central processing unit (CPU) 2, a
memory 4, a clock circuit 6, a printer interface 8, a display unit 10, a LAN data transmission
controller 12, a LAN interface 14, a network controller 16, an internal bus 18 and one or more
input devices 20 such as, for example, a keyboard and mouse.
CPU 2 controls the operation of system 102 and is capable of running applications stored
in memory 4. Memory 4 may include, for example, RAM, ROM, removable CDROM, DVD,
etc. Memory 4 may also store various types of data necessary for the execution ofthe
applications, as well as a work area reserved for use by CPU 2. Clock circuit 6 may include a
circuit for generating information indicating the present time, and may be capable of being
programmed to count down a predetermined or set amount of time.
The LAN interface 14 allows communication between the network 116, which in this
example is a LAN, and the LAN data transmission controller 12. The LAN data transmission
controller 12 uses a predetermined protocol suite to exchange information and data with the other
devices on network 116. Monitoring system 102 may also be capable of communicating with
and/or monitoring devices on other networks via router 118. System 102 may also be capable of
communicating with other devices via a Public Switched Telephone Network (PSTN) using
network controller 16. Internal bus 18, which may actually consists of a plurality of buses,
allows communication between each ofthe components connected thereto.
Each of the devices on network 116, including server 108, computers 110, 112, facsimile
106, printer 104 and router 118, may include one or more ofthe components shown in Fig. 4
which are capable of being monitored by system 102. Of course, these are just examples ofthe
types of components that may be monitored and the present embodiments envision the
monitoring of any component capable of being monitored. As shown in Fig. 4, a device may
include one or more system event logs 40, one or more system error log files 42, one or more
application log files 44, TCP/IP port (connectivity) information 46, services/processes running
status information 48, physical performance data 50 and/or physical network interface
(availability) information 52.
Monitoring system 102 may include an operating system capable of running different
applications independently and at the same time. In this type of operating system, an application
can spawn multiple execution tlireads each of which runs independently at the same time. Each
of these threads is capable of performing a monitoring request for monitoring one ofthe
components shown in Fig. 4, for example. Accordingly, system 102 is capable of simultaneously
executing multiple threads, each for monitoring a different component on a device being
monitored. Each thread performing a monitoring request will typically execute until a blocking
request is made which is dependent on the asynchronous completion of a remote procedure call
(RPC), which will then return its results. At the same time, other threads can be executing and
can either execute until they block or time-out, or can be processing (e.g., storing) the response
of a completed monitoring request.
It is often desirable to determine when a catastrophic failure occurs on a device being
monitored. To achieve this and other benefits, the monitoring application design ofthe present
embodiment starts a set of monitoring requests to a specific device in multiple threads
(monitoring threads) as a main group and at substantially the same time. Ideally, a basic interval
will be set, at the end of which each monitoring thread in the main group will have begun. A
status checking thread can be commenced at substantially the same time as the monitoring
tlireads for checking the status ofthe monitoring threads.
When a monitoring request has results returned from the device, the results are stored by
the monitoring thread performing the monitoring request. The monitoring thread also forwards
information indicating success ofthe request to the status checking thread. The status checking
thread waits for responses from each ofthe monitoring tlireads. Theoretically, a monitoring
thread may wait indefinitely for a result to be returned from the device being monitored.
According to an embodiment described herein, in order to avoid an indefinite wait state, a timer
is set to a defined amount of time. After this time expires, it is determined that a response from
the device is not forthcoming. After the time expires, the monitoring request or requests that
have not received results are designated as having timed out. If one or more requests time out, an
indication is stored by the associated monitoring thread(s), and this information is also forwarded
to the status checking thread.
Once the status checking thread has received all the success/failure messages for the
different monitoring requests or information indicating that the request(s) have timed out, it
examines these results to determine the status ofthe requests. If a request receives a response
indicating a failure, (e.g., an error, failure or invalid message), the request is considered to have
received a failed response. In addition, a request that has timed out, for purposes of this
description, is also considered a failed response. If all requests receive failed responses, the
device is considered down and a message is sent to all ofthe processing threads to terminate, and
to not process their results. If one or more ofthe tlireads received a valid response from the
device, responses to each ofthe threads is processed in a normal manner.
A method for monitoring according to an embodiment is described below by reference to
Figures 2A-2C. In step SI, the monitoring application provided on system 102 is started. The
monitoring application may be manually started by an operator or system 102 may be
programmed to automatically run the monitoring application periodically or at a set time each
day, week, etc. In step S2, a set ofthe specific types of monitoring requests to be performed can
be input by an operator and/or a predetermined set of monitoring requests can be retrieved from
memory 4 (Fig. 3). In step S3, an overall monitoring time can be input and set by the operator or
retrieved from memory and used to set a time-out timer (e.g., Figure 3, clock circuit 6).
According to this embodiment, each monitoring request in the set of requests is performed by a
separate monitoring thread. In step S4, each ofthe monitoring request threads is commenced at
substantially the same time. At substantially the same time, a status checking thread is also
commenced. In step S6, the time-out timer is started. Each ofthe monitoring threads then
monitors for a result to its request or for a time-out (Step S8). If a monitoring result is received,
(Yes, Step S10), the result is forwarded by the monitoring thread to the status checking thread
and stored (Step S 14) and monitoring continues. If a monitoring thread result is not received
(No, Step S10), a determination is made whether the time-out timer has timed out. If the time¬
out timer has not timed out (No, Step SI 6), the process returns to Step S8 and monitoring
continues. If the time-out timer has timed out (Yes, Step SI 6), monitoring requests that have not
received a result are considered to have timed-out. Information identifying the monitoring
requests that have not received a result is then forwarded by each ofthe respective monitoring
tlireads to the status checking thread and stored (Step SI 8). The status checking thread then
examines the request results (Step S22). In Step S24, a determination is made whether each of
the monitoring requests have received "failed responses", either having received an error, failure
or invalid message from the device being monitored or having timed out. If all monitoring
requests have received "failed responses" (Yes, Step S24), it is determined that a catastrophic
failure ofthe device has occurred and all threads are terminated (Step S26). Notification ofthe
catastrophic failure may then be reported, for example, to an operator via the display or other
output device and/or stored. If one or more ofthe monitoring requests did not fail, a message is
sent to all monitoring threads to process all results (Step S28) and the results ofthe monitoring
requests are processed in a normal manner and can be reported, for example, to the operator via
the display or other output device and/or stored.
Each ofthe components capable of being monitored may not take the same amount of
time to respond to a monitoring request. Fig. 5 depicts a method for monitoring according to
another embodiment, which takes into account these variations in monitoring times. According
to this next embodiment, each monitoring request has a time-out time associated therewith. For
example, each monitoring request is monitored for a result. If a result is not received within the
time-out time associated with that monitoring request, a time-out indication is provided to the
monitoring thread.
In this embodiment, each monitoring request is arranged in a monitor queue such as that
shown in Fig. 6 and each is individually monitored in turn for either a result or a time-out
indication. As shown in Fig. 6, each monitoring requests includes a "Time-out" time associated
therewith. Referring now to Figs. 5A-5C, in step S50, the monitoring application provided on
system 102 is started. The monitoring application may be manually started by an operator or
system 102 may be programmed to automatically run the monitoring application periodically or
at a set time each day, week, etc. In step S52, a set ofthe specific types of monitoring requests to
be performed can be input by an operator and/or a predetermined set of monitoring requests can
be retrieved from memory 4 (Fig. 3). In step S53, a monitoring time can be input and set by the
operator or retrieved from memory for each monitoring request. Each monitoring request in the
set of requests is performed by a separate monitoring tliread. In step S54, each ofthe monitoring
request threads is commenced at substantially the same time. At substantially the same time, a
status checking thread is also commenced. In step S56, the time-out timer is started. In this
embodiment, the time-out timer may be, for example, an elapsed time counter. Each ofthe
monitoring threads then monitors for a result to its request or for a time-out (Step S58). The
status checking thread then monitors for results from the monitoring threads. In Step S60, a
determination is made whether a result to the first monitoring request in the monitor queue has
been received. If a result has been received, (Yes, Step S60), information indicating the result
(e.g., valid message or error message) is forwarded to the status checking thread and stored (Step
S62) and the next monitoring request in the queue is selected (Step S64). That monitoring
request is then monitored for a result (Step S58). If a result for the monitoring request is not
received (No, Step S60), a determination is made whether the request has timed out. This
determination can be made by determining whether the time-out time associated with that
monitoring request as shown in Fig. 6, has expired, by referring to the elapsed time counter
started in Step S56. If the request has not timed out (No, Step S66), the next monitoring request
in the queue is selected (Step S67) and it is checked whether a result for that request has been
received (Step S58). On the other hand, if the monitoring request timed out (Yes, Step S66),
time-out information identifying the monitoring request that timed out is forwarded to the status
checking thread and that monitoring request is, in effect, removed from the monitoring queue. In
Step S70, a determination is made whether each monitoring request in the queue has either
received a result or has timed out. If No in Step S70, the next monitoring request in the queue is
selected (Step S67) and the process returns to Step S58. If all monitoring requests have either
received a result or timed out (Step S70), the status checking thread then examines the request
results (Step S72). In Step S74, a determination is made whether all requests have failed, either
having received an error or invalid message from the device being monitored or having timed
out. If all requests failed (Yes, Step S74), it is determined that a catastrophic failure ofthe
device has occurred and all processing tlireads are terminated (Step S76). Notification may then
be provided to an operator and/or stored. If one or more ofthe requests did not fail, a message is
sent to all processing threads to process all results (Step S78) and the results are processed in a
normal manner.
If certain types of monitoring requests are required more frequently than other types to a
particular device, extra calls can be commenced independently ofthe main group and either
monitored with the main group or monitored independently from but in a manner similar to that
performed on the main group. If certain requests are required less frequently, they can be made
on alternate start schedules, or whatever other proportion is desired. Alternatively, a schedule of
start times can be set up, and any particular monitoring request or requests can be assigned to the
nearest start time, with those starting at substantially the same time being monitored as a group.
The present embodiments may be conveniently implemented using one or more
conventional general purpose digital computers and/or servers programmed according to the
teachings of the present specification. Appropriate software coding can readily be prepared by
skilled programmers based on the teachings of the present disclosure. The present
embodiments may also be implemented by the preparation of application specific integrated
circuits or by interconnecting an appropriate network of conventional component circuits.
Numerous additional modifications and variations of the present embodiments are
possible in view of the above-teachings. It is therefore to be understood that within the scope
of the appended claims, the present embodiments may be practiced other than as specifically
described herein.