TASK_KILLABLE

By Jonathan Corbet
July 1, 2008

Like most versions of Unix, Linux has two fundamental ways in which a process can be put to sleep. A process which is placed in the TASK_INTERRUPTIBLE state will sleep until either (1) something explicitly wakes it up, or (2) a non-masked signal is received. The TASK_UNINTERRUPTIBLE state, instead, ignores signals; processes in that state will require an explicit wakeup before they can run again.

There are advantages and disadvantages to each type of sleep. Interruptible sleeps enable faster response to signals, but they make the programming harder. Kernel code which uses interruptible sleeps must always check to see whether it woke up as a result of a signal, and, if so, clean up whatever it was doing and return -EINTR back to user space. The user-space side, too, must realize that a system call was interrupted and respond accordingly; not all user-space programmers are known for their diligence in this regard. Making a sleep uninterruptible eliminates these problems, but at the cost of being, well, uninterruptible. If the expected wakeup event does not materialize, the process will wait forever and there is usually nothing that anybody can do about it short of rebooting the system. This is the source of the dreaded, unkillable process which is shown to be in the "D" state by ps.

Given the highly obnoxious nature of unkillable processes, one would think that interruptible sleeps should be used whenever possible. The problem with that idea is that, in many cases, the introduction of interruptible sleeps is likely to lead to application bugs. As recently noted by Alan Cox:

Unix tradition (and thus almost all applications) believe file store writes to be non signal interruptible. It would not be safe or practical to change that guarantee.

So it would seem that we are stuck with the occasional blocked-and-immortal process forever.

Or maybe not. A while back, Matthew Wilcox realized that many of these concerns about application bugs do not really apply if the application is about to be killed anyway. It does not matter if the developer thought about the possibility of an interrupted system call if said system call is doomed to never return to user space. So Matthew created a new sleeping state, called TASK_KILLABLE; it behaves like TASK_UNINTERRUPTIBLE with the exception that fatal signals will interrupt the sleep.

With TASK_KILLABLE comes a new set of primitives for waiting for events and acquiring locks:

	int wait_event_killable(wait_queue_t queue, condition);
	long schedule_timeout_killable(signed long timeout);
	int mutex_lock_killable(struct mutex *lock);
	int wait_for_completion_killable(struct completion *comp);
	int down_killable(struct semaphore *sem);

For each of these functions, the return value will be zero for a normal, successful return, or a negative error code in case of a fatal signal. In the latter case, kernel code should clean up and return, enabling the process to be killed.

The TASK_KILLABLE patch was merged for the 2.6.25 kernel, but that does not mean that the unkillable process problem has gone away. The number of places in the kernel (as of 2.6.26-rc8) which are actually using this new state is quite small - as in, one need not worry about running out of fingers while counting them. The NFS client code has been converted, which can only be a welcome development. But there are very few other uses of TASK_KILLABLE, and none at all in device drivers, which is often where processes get wedged.

It can take time for a new API to enter widespread use in the kernel, especially when it supplements an existing functionality which works well enough most of the time. Additionally, the benefits of a mass conversion of existing code to killable sleeps are not entirely clear. But there are almost certainly places in the kernel which could be improved by this change, if users and developers could identify the spots where processes get hung. It also makes sense to use killable sleeps in new code unless there is some pressing reason to disallow interruptions altogether.

Index entries for this article
Kernel	Scheduler
Kernel	TASK_KILLABLE

to post comments

TASK_KILLABLE

Posted Jul 3, 2008 3:55 UTC (Thu) by jwb (guest, #15467) [Link] (3 responses)

This is a great idea.  Signals are the worst, stupidest part of Unix (yes, they are even more
stupid than creat) and EINTR has a long history of exposing errors in programs.  I have never
seen any program which I could confidently claim handles all signals correctly.  The nature of
the asynchronous delivery and the completely undefined state of the program which takes the
signal makes it impossible to prove or even convincingly demonstrate that Unix programs are
correct in this regard.

I'd be very happy to see Linux moving over to the BSD kqueue API, where signals are handled in
a program's main i/o loop instead of being delivered to magical handlers.  This greatly
simplifies the programming and makes it possible to have confidence in the correctness of a
program.

TASK_KILLABLE

Posted Jul 3, 2008 5:17 UTC (Thu) by zlynx (guest, #2285) [Link] (1 responses)

signalfd is what you're looking for, although I'm not clear on its merge status.

signalfd

Posted Jul 3, 2008 13:43 UTC (Thu) by i3839 (guest, #31386) [Link]

Assuming the signalfd manpage can be trusted, RTFM. ;-)

> signalfd()  is available on Linux since kernel 2.6.22.  Working support
> is provided in glibc since version 2.8.

agree

Posted Jul 3, 2008 13:27 UTC (Thu) by alex (subscriber, #1355) [Link]

I can only nod in agreement having spent many-many hours trying to get signal handling behave
in an above-the-os DBT. The number of corner cases is quite surprising.

w00t for exception-throwing languages

Posted Jul 4, 2008 11:35 UTC (Fri) by walles (guest, #954) [Link] (5 responses)

Some languages throw exceptions when bad stuff (like an interrupted file write) happens.  They
are easy to write in and is what most things should be written in.

Some languages (like C, assembler, C++) return error codes when bad stuff happens.  These
error codes often get ignored.  And programs written in these languages often have bugs of the
kind that are mentioned in this article.  These languages are good for writing kernels and not
that much else.

Now, if people could just stick to the first kind of language for their user-space apps we
wouldn't be having these problems.  If people could start writing their *kernels* in the first
kind of language, we would have even fewer problems (but there are a bunch of things that have
to be resolved before this happens).

Unfortunately, in practice, many people tend to go with the second kind of language for no
particular reason.

Things are getting better though.  So maybe 10 years from now the uninterruptible sleeps can
be removed from the kernel.

Anyway, w00t for exception-throwing (and bounds-checking and garbage-collecting) languages!

w00t for exception-throwing languages

Posted Jul 5, 2008 23:39 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)

Things are getting better though. So maybe 10 years from now the uninterruptible sleeps can be removed from the kernel.

There will have to be advances on other fronts for us to go that far. The article covers one of the major reasons for uninterruptible sleep today: that your client can't deal with a half-finished operation. But the other reason programs do uninterruptible sleep is that they themselves lack the intelligence to finish a partially done operation -- i.e. there's nowhere to get off the highway before the destination.

I have found it takes some very hard work to make it possible to get out of something in the middle. You have to carefully decide whether to spend time and bug tolerance for that or use it for something else.

Don't forget the ten commandments...

Posted Jul 7, 2008 5:14 UTC (Mon) by walles (guest, #954) [Link]

... for C programmers. Let me quote the sixth commandment for you:

"
If a function be advertised to return an error code in the event of difficulties, thou shalt check for that code, yea, even though the checks triple the size of thy code and produce aches in thy typing fingers, for if thou thinkest ``it cannot happen to me'', the gods shall surely punish thee for thy arrogance.
"

Writing broken code for yourself is OK. Writing broken code for others is not.

w00t for exception-throwing languages

Posted Jul 8, 2008 17:56 UTC (Tue) by leoc (guest, #39773) [Link] (1 responses)

Some languages throw exceptions when bad stuff (like an interrupted file write) happens.

Yes, but bad programmers use those languages too, and seem to actively enjoy doing things like wrapping large amounts of logic with empty or otherwise useless exception handlers.

w00t for exception-throwing languages

Posted Jul 10, 2008 11:02 UTC (Thu) by renox (guest, #23785) [Link]

That's *very* different: catching all possible case is difficult for C programmers so it's
difficult also for code review to find those missing case.

It's much more easy to catch those guilty of wrapping their code with empy catch{} and sending
them to do other things than programming.

w00t for exception-throwing languages

Posted Jul 8, 2022 13:55 UTC (Fri) by NachoGomez (guest, #159570) [Link]

> If people could start writing their *kernels* in the first kind of language, we would have even fewer problems (but there are a bunch of things that have to be resolved before this happens).

Well, 14 years later Linus Torvalds hinted the use of Rust in the Linux kernel by 2023, so it seems the bunch of things are going to be resolved soon ;-)

https://thenewstack.io/rust-in-the-linux-kernel-by-2023-l...

TASK_KILLABLE

Posted Jul 21, 2008 18:06 UTC (Mon) by mcortese (guest, #52099) [Link] (3 responses)

Is it really impossible to get rid of a process stuck in an UNINTERRUPTIBLE wait? What prevents the kernel from just removing it from any queue and freeing its allocated memory?

TASK_KILLABLE

Posted Jul 21, 2008 21:09 UTC (Mon) by nix (subscriber, #2304) [Link] (2 responses)

How can you tell what queues it's on? If it's holding a lock (which it 
generally is, or the sleep would be interruptible) it's probably doing 
that because some data structure protected by that lock is in an 
inconsistent state. That data structure may very well not be per-process. 
How do you clean it up?

(And the answer is not always 'discard it': it may hold an inode lock and 
the inode in question has dirty data associated with it. Discarding that 
would not be a good idea!)

I suspect that uninterruptible sleep will always be with us in *some* form 
(at least until every single data structure in the kernel, and every 
single code flow path, gains cleanup handlers: and we know from C++ 
exceptions just how very easy that is to make work right and how very easy 
it is to trap all code flow paths that may need cleanups of some kind. Oh, 
sorry, did I say 'easy'? That should be 'difficult'.)

TASK_KILLABLE

Posted Jul 24, 2008 17:49 UTC (Thu) by mcortese (guest, #52099) [Link] (1 responses)

If it's holding a lock (which it generally is, or the sleep would be interruptible) it's probably doing that because some data structure protected by that lock is in an inconsistent state.
...whereas a KILLABLE task, despite having data in inconsistent state and not knowing how to deal with most incoming signals (and in that being much like the UNINTERRUPTIBLE variety), despite all that, it still knows how to deal with just one type of signal, kill.

Are there a lot of such tasks out there? (irony not intended, I really want to understand how much this change can improve the kernel)

TASK_KILLABLE

Posted Jul 25, 2008 20:28 UTC (Fri) by nix (subscriber, #2304) [Link]

If it's in KILLABLE state, it *does* know how to deal with signals within 
the kernel, and its in-kernel state can be cleanly unwound. The reason 
that this doesn't propagate up to userspace as an EINTR is simply that 
there is in effect a Unix guarantee that filesystem operations cannot be 
interrupted, and the vast majority of userspace code relies on this 
guarantee and will malfunction if it starts getting EINTRs from tasks. 
(This is what the old 'intr' option did, and boy were the results messy.)

That's why it only responds to SIGKILL: because SIGKILL, by definition, 
doesn't get propagated to userspace, because the process's userspace 
component is killed by the SIGKILL.