Watchdog Code

[1e99dd]: / watchdog.8 Maximize Restore History

236 lines (224 with data), 9.5 kB

.TH WATCHDOG 8 "February 1996"
.UC 4
.SH NAME
watchdog \- a software watchdog daemon
.SH SYNOPSIS
.B watchdog
[
.I -f | --force
] [
.I -c filename | --config-file filename
] [
.I -v | --verbose
] [
.I -s | --sync
] [
.I -b | --softboot
] [
.I -q | --no-action
]
.br
.SH DESCRIPTION
Watchdog is a daemon that checks if your system is still working. If
programs in user space are not longer executed it will hard reset the system.

The kernel provides /dev/watchdog, which when open must be written
to within a minute or the machine will reboot. Each write delays the reboot
time another minute. After a minute the watchdog hardware will cause the
reset. In the case of the software watchdog the ability to 
reboot will depend on the state of the machines and interrupts.

Watchdog can be stopped without causing a reboot if the device /dev/watchdog
is closed correctly, unless of course your kernel is compiled with the
CONFIG_WATCHDOG_NOWAYOUT option enabled.
.LP
.SH TESTS
Watchdog itself does several additional tests to check the system status:
.TP
Check whether the process table is full.
.TP
Check whether some given files are accessible.
.TP
Check whether some given files change in a given interval.
.TP
Check whether the average work load exceeds a predefined maximal value.
.TP
Check whether the a file table overflow occurred.
.TP
Check whether some given IP addresses answer to a ping message.
.TP
Check the temperature (if available).
.TP
Execute a user defined binary to do arbitrary tests.
.LP
If any of these checks fail watchdog will cause a shutdown. Should any of
these tests except the user defined binary last longer than one minute the
machine will be rebooted, too.
.LP
.SH OPTIONS
Available command line options are the following:
.TP
-v | --verbose
Set verbose mode. Only implemented if compiled with SYSLOG feature. This
mode will log each several infos in LOG_DAEMON with priority LOG_INFO.
This is useful if you want to see exactly what happened until watchdog rebooted
the system. Currently it logs the temperature (if available), the load
average, the change date of the files it checks and how often it went to sleep.
.TP
-s | --sync
Try to sync the filesystem every time the process is awake. Be aware that
the system is rebooted if for any reason syncing lasts longer than a minute.
.TP
-b | --softboot
Soft-boot the system if an error occurs during the main loop, e.g. if the
file given with option -n is not accessible via the stat call. Note that
this does not apply to the open calls to /dev/watchdog and /proc/loadavg
which are opened before the main loop starts.
.TP
-f | --force
Force the usage of the interval given or the maximal load average given 
in the config file.
.TP
-c <config file> | --config-file <config file>
Use <config file> as config file instead of the default /etc/watchdog.conf.
.TP
-q | --no-act
Do not reboot or halt the machine. This is for testing purposes. All checks
are executed and the results are logged as usual, but no action is taken.
Also your hardware card resp. the kernel software watchdog driver is not
enabled. Note that temperature checking is also disabled since this triggers
the hardware watchdog on some cards.
.LP
.SH FUNCTION
Watchdog starts, put itself into the background and then try all checks
specified in its config file in turn. Between each two tests it will trigger
the kernel device. After finishing all tests watchdog goes to sleep for some
time. The kernel drivers expects a write to the watchdog device every minute.
Otherwise the system will be rebooted. As a default watchdog will sleep for
only 10 seconds so it triggers the device early enough.

Under high system load watchdog might be swapped out of memory and may fail
to make it back in in time. Under these circumstances the Linux kernel will
hard reset the machine. To make sure you won't get unnecassary reboots make
sure you have the variable 'realtime' set to yes in the config file
watchdog.conf. It adds real time support to watchdog. Thus it will lock
itself into memeory and there should be no problem even under the highest of
loads.

Also you can specify a maximal allowed load average. Once this load average
is reached the system is rebooted. You may specify maximal load averages for
1 minute, 5 minutes or 15 minutes. The default values are 12 resp. 9 resp.
6. Be careful not to set this parameter too low. To set a value less then
the predefined minimal value of 2, you have to use the -f option.

If you have a watchdog card with temperature sensor you can specify 
the maximal allowed temperature. Once this temperature is reached the
system is halted. Default value is 120. There is no unit conversion. So make
sure you use the same unit as your hardware. Watchdog will issue warnings
once the tempearture increases 90%, 95% and 98% of this temperature.

When using file mode watchdog will try stat the given files. Errors returned
by stat will 
.I not
cause a reboot. For a reboot the stat call has to last at least one minute.
This may happen if the file is located on an NFS mounted filesystem. If your
system relies on an NFS mounted filesystem you might try this option.
However, in such a case the sync option may not work if the NFS server is
not answering.

Watchdog will try periodically to fork itself to see whether the process
table is full. This process will leave a zombie process until watchdog wakes
up again and cathes it.

In ping mode watchdog tries to ping the given addresses. These addresses do
not have to be a single machine. It is possible to ping to a broadcast
address instead to see if at least one machine in a subnet is still living.
Watchdog will send out three ping packages and wait up to <interval> seconds
for the reply with <interval> being the time it goes to sleep between two
times triggering the watchdog device. Thus a unreachable network will not
cause a hard reset but a soft reboot.

With using an external check binary watchdog can run user defined tests.
This may last longer than the time slice defined for the kernel device
without a problem. However, note that in this case error messages are
generated into the syslog facility. If you have enabled softboot on error
the machine will be rebooted if the binary doesn't exit in half the time
watchdog sleeps between two tries triggering the kernel device.

If you specify a repair binary it will be started instead of shutting down
the system. If this binary is not able to fix the problem watchdog will
still cause a reboot afterwards.

If eventually the machine is halted an email is send to notify a human that
the machine is going down.
.LP
.SH SOFT REBOOT
A soft reboot (i.e. controlled shutdown and reboot) is initiated for every
error that is found. Since there might be no more processes available,
watchdog does it all by himself. That means:
.TP
1) Kill all processes with SIGTERM.
.TP
2) After a short pause kill all remaining processes with SIGKILL.
.TP
3) Record a shutdown entry in wtmp.
.TP
4) Save the random seed from /dev/urandom. If the device is non-existant or
the filename to save to is empty this step is skipped.
.TP
5) Turn off accounting.
.TP
6) Turn off quota and swapp.
.TP
7) Unmount all partitions except the root partition.
.TP
8) Remount the root partition read-only.
.TP
9) Shut down all network interfaces.
.TP
10) Finally reboot.
.LP
.SH CHECK BINARY
If the return code of the check binary is not zero watchdog will assume an
error and reboot the system. Be careful with this if you are using the
real-time properties of watchdog since watchdog will wait for the return of
this binary before proceeding. An positive exit code is interpreted as an
system error code (see errno.h for details). Negative values are special to
watchdog:
.TP
-1 reboot the system. This is not exactly an error message but a command to
watchdog. If the return code is -1 watchdog will not try to run a shutdown
script instead.
.TP
-2 reset the system. This is not exactly an error message but a command to
watchdog. If the return code is -2 watchdog will simply refuse to write the
kernel device again.
.TP
-3 max load average exceeded.
.TP
-4 the temperature inside is too high.
.TP
-5 /proc/loadavg contains no (or not enough) data.
.TP
-6 Given file was not changed in the given interval.
.TP
-7 free for personal use
.TP
...
.LP
.SH REPAIR BINARY
The repair binary is started with one parameter: the error number that
caused watchdog in initiate the boot process. After trying to repair the
system the binary should exit with 0 if the system was successfully repaired
and thus there is no need to boot anymore. A return value not equal 0 tells
watchdog to reboot. The return code of the repair binary should be the error
number of the error causing watchdog to reboot. Be careful with this if you
are using the real-time properties of watchdog since watchdog will wait for
the return of this binary before proceeding.
.SH BUGS
None known so far.
.LP
.SH AUTHORS
The original code is an example written by Alan Cox
<alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All
additions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
<johnie@netgod.net> had the idea of testing the load average. He also took
over the Debian specific work. Dave Cinege <dcinege@psychosis.com> brought
up some hardware watchdog issues and helped testing this stuff.
.LP
.SH FILES
.nf
/dev/watchdog  The watchdog device
/var/run/watchdog.pid The PID of the running watchdog
.fi
.SH "SEE ALSO"
.BR watchdog.conf "(5)
Watchdog Code

Branches

Tags

[1e99dd]: / watchdog.8 Maximize Restore History

236 lines (224 with data), 9.5 kB