EPOLL_CTL_DISABLE, epoll, and API design

By Michael Kerrisk
October 23, 2012

In an article last week, we saw that the EPOLL_CTL_DISABLE operation proposed by Paton Lewis provides a way for multithreaded applications that cache information about file descriptors to safely delete those file descriptors from an epoll interest list. For the sake of brevity, in the remainder of this article we'll use the term "the EPOLL_CTL_DISABLE problem" to label the underlying problem that EPOLL_CTL_DISABLE solves.

This article revisits the EPOLL_CTL_DISABLE story from a different angle, with the aim of drawing some lessons about the design of the APIs that the kernel presents to user space. The initial motivation for pursuing this angle arises from the observation that the EPOLL_CTL_DISABLE solution has some difficulties of its own. It is neither intuitive (it relies on some non-obvious details of the epoll implementation) nor easy to use. Furthermore, the solution is somewhat limiting, since it forces the programmer to employ the EPOLLONESHOT flag. Of course, these difficulties arise at least in part because EPOLL_CTL_DISABLE is designed so as to satisfy one of the cardinal rules of Linux development: interface changes must not break existing user-space applications.

If there had been an awareness of the EPOLL_CTL_DISABLE problem when the epoll API was originally designed, it seems likely that a better solution would have been built, rather than bolting on EPOLL_CTL_DISABLE after the fact. Leaving aside the question of what that solution might have been, there's another interesting question: could the problem have been foreseen?

One might suppose that predicting the EPOLL_CTL_DISABLE problem would have been quite difficult. However, the synchronized-state problem is well known and the epoll API was designed to be thread friendly. Furthermore, the notion of employing a user-space cache of the ready list to prevent file descriptor starvation was documented in the epoll(7) man page (see the sections "Example for Suggested Usage" and "Possible Pitfalls and Ways to Avoid Them") that was supplied as part of the original implementation.

In other words, almost all of the pieces of the puzzle were known when the epoll API was designed. The one fact whose implications might not have been clear was the presence of a blocking interface (epoll_wait()) in the API. One wonders if more review (and building of test applications) as the epoll API was being designed might have uncovered the interaction of epoll_wait() with the remaining well-known pieces of the puzzle, and resulted in a better initial design that addressed the EPOLL_CTL_DISABLE problem.

So, the first lesson from the EPOLL_CTL_DISABLE story is that more review is necessary in order to create better API designs (and we'll see further evidence supporting that claim in a moment). Of course, the need for more review is a general problem in all aspects of Linux development. However, the effects of insufficient review can be especially painful when it comes to API design. The problem is that once an API has been released, applications come to depend on it, and it becomes at the very least difficult, or, more likely, impossible to later change the aspects of the API's behavior that applications depend upon. As a consequence, a mistake in API design by one kernel developer can create problems that thousands of user-space developers must live with for many years.

A second lesson about API design can be found in a comment that Paton made when responding to a question from Andrew Morton about the design of EPOLL_CTL_DISABLE. Paton was speculating about whether a call of the form:

    epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &epoll_event);

could be used to provide the required functionality. The EPOLL_CTL_DEL operation does not currently use the fourth argument of epoll_ctl(), and applications should specify it as NULL (but more on that point in a moment). The idea would be that "epoll_ctl [EPOLL_CTL_DEL] could set a bit in epoll_event.events (perhaps called EPOLLNOTREADY)" to notify the caller that the file descriptor was in use by another thread.

But Paton noted a shortcoming of this approach:

However, this could cause a problem with any legacy code that relies on the fact that the epoll_ctl epoll_event parameter is ignored for EPOLL_CTL_DEL. Any such code which passed an invalid pointer for that parameter would suddenly generate a fault when running on the new kernel code, even though it worked fine in the past.

In other words, although the EPOLL_CTL_DEL operation doesn't use the epoll_event argument, the caller is not required to specify it as NULL. Consequently, existing applications are free to pass random addresses in epoll_event. If the kernel now started using the epoll_event argument for EPOLL_CTL_DEL, it seems likely that some of those applications would break. Even though those applications might be considered poorly written, that's no justification for breaking them. Quoting Linus Torvalds:

We care about user-space interfaces to an insane degree. We go to extreme lengths to maintain even badly designed or unintentional interfaces. Breaking user programs simply isn't acceptable.

The lesson here is that when an API doesn't use an argument, usually the right thing to do is for the implementation to include a check that requires the argument to have a suitable "empty" value, such as NULL or zero. Failure to do that means that we may later be prevented from making the kind of API extensions that Paton was talking about. (We can leave aside the question of whether this particular extension to the API was the right approach. The point is that the option to pursue this approach was unavailable.) The kernel-user-space API provides numerous examples of failure to do this sort of checking.

However, there is yet more life in this story. Although there have been many examples of system calls that failed to check that "empty" values were passed for unused arguments, it turns out that epoll_ctl(EPOLL_CTL_DEL) fails to include the check for another reason. Quoting the BUGS section of the epoll_ctl() man page:

In kernel versions before 2.6.9, the EPOLL_CTL_DEL operation required a non-NULL pointer in event [the epoll_event argument], even though this argument is ignored. Since Linux 2.6.9, event can be specified as NULL when using EPOLL_CTL_DEL. Applications that need to be portable to kernels before 2.6.9 should specify a non-NULL pointer in event.

In other words, applications that use EPOLL_CTL_DEL are not only permitted to pass random values in the epoll_event argument: if they want to be portable to Linux kernels before 2.6.9 (which fixed the problem), they are required to pass a pointer to some random, but valid user-space address. (Of course, most such applications would simply allocate an unused epoll_event structure and pass a pointer to that structure.) Here, we're back to the first lesson: more review of the initial epoll API design would almost certainly have uncovered this fairly basic design error. (It's this writer's contention that one of the best ways to conduct that sort of review is by thoroughly documenting the API, but he admits to a certain bias on this point.)

Failing to check that unused arguments (or unused pieces of arguments) have "empty" values can cause subtle problems long after the fact. Anyone looking for further evidence on that point does not need to go far: the epoll_ctl() system call provides another example.

Linux 3.5 added a new epoll flag, EPOLLWAKEUP, that can be specified in the epoll_event.events field passed to epoll_ctl(). The effect of this flag is to prevent the system from being suspended while epoll readiness events are pending for the corresponding file descriptor. Since this flag has a system-wide effect, the caller must have a capability, CAP_BLOCK_SUSPEND (initially misnamed CAP_EPOLLWAKEUP).

In the initial EPOLLWAKEUP implementation, if the caller did not have the CAP_BLOCK_SUSPEND capability, then epoll_ctl() returned an error so that the caller was informed of the problem. However, Jiri Slaby reported that the new flag caused a regression: an existing program failed because it was setting formerly unused bits in epoll_event.events when calling epoll_ctl(). When one of those bits acquired a meaning (as EPOLLWAKEUP), the call failed because the program lacked the required capability. The problem of course is that epoll_ctl() has never checked the flags in epoll_event.events to ensure that the caller has specified only flag bits that are actually implemented in the kernel. Consequently, applications were free to pass random garbage in the unused bits.

When one of those random bits suddenly caused the application to fail, what should be done? Following the logic outlined above, of course the answer is that the kernel must change. And that is exactly what happened in this case. A patch was applied so that if the EPOLLWAKEUP flag was specified in a call to epoll_ctl() and the caller did not have the CAP_BLOCK_SUSPEND capability, then epoll_ctl() silently ignored the flag instead of returning an error. Of course, in this case, the calling application might easily carry on, unaware that the request for EPOLLWAKEUP semantics had been ignored.

One might observe that there is a certain arbitrariness about the approach taken to dealing with the EPOLLWAKEUP breakage. Taken to the extreme, this type of logic would say that the kernel can never add new flags to APIs that didn't hitherto check their bit-mask arguments—and there is a long list of such system calls (mmap(), splice(), and timer_settime(), to name just a few). Nevertheless, new flags are added. So, for example, Linux 2.6.17 added the epoll event flag EPOLLRDHUP, and since no one complained about a broken application, the flag remained. It seems likely that the same would have happened for the original implementation of EPOLLWAKEUP that returned an error when CAP_BLOCK_SUSPEND was lacking, if someone hadn't chanced to make an error report.

As an aside to the previous point, in cases where someone reports a regression after an API change has been officially released, there is a conundrum. On the one hand, there may be old applications that depend on the previous behavior; on the other hand, newer applications may already depend on the newly implemented change. At that point, there is no simple remedy: to fix things almost certainly means that some applications must break.

We can conclude with two observations, one specific, and the other more general. The specific observation is that, ironically, EPOLL_CTL_DISABLE itself seems to have had surprisingly little review before being accepted into the 3.7 merge window. And in fact, now that more attention has been focused on it, it looks as though the proposed API will see some changes. So, we have a further, very current, piece of evidence that there is still insufficient review of kernel-user-space APIs.

More generally, the problem seems to be that—while the kernel code gets reviewed on many dimensions—it is relatively uncommon for kernel-user-space APIs to be reviewed on their own merits. The kernel has maintainers for many subsystems. By now, the time seems ripe for there to be a kernel-user-space API maintainer—someone whose job it is to actively review and ack every kernel-user-space API change, and to ensure that test cases and sufficient documentation are supplied with the implementation of those changes. Lacking such a maintainer, it seems likely that we'll see many more cases where kernel developers add badly designed APIs that cause years of pain [PDF] for user-space developers.

Index entries for this article
Kernel	Epoll
Kernel	User-space API/Design

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 23, 2012 20:27 UTC (Tue) by Yorick (guest, #19241) [Link] (13 responses)

This is unfortunate. There is nothing more important to review than immutable interfaces (except for security audits). I'm going to be grossly unfair here, but in neglecting basic software engineering principles, the developers here come out as bumbling amateurs. (Of course I've made similar mistakes myself, but with slightly less severe consequences.)

With a cast-in-stone policy, these APIs must be subject to extreme scrutiny. I would like to go further than the editor of the excellent article: There should be working applications present, not just tiny test cases or proofs of concept, in addition to full documentation, before any proposals can be accepted. It's not just a matter of verifying that the APIs work, but that they are useful and complete as well.

The importance of checking unused parameter bits was learned dearly in the 1960s, over and over again, both for hardware and software. We should know this by know.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 23, 2012 20:30 UTC (Tue) by dlang (guest, #313) [Link] (8 responses)

even working applications aren't going to identify all the drawbacks and problems with an API.

The early applications that use an API are going to be written by people who are very familiar with the API and know not only what the API does, but how it _should_ be used.

a few years later, you get applications developed by people who don't know how it _should_ be used, and as they start trying to use it in other ways (some of them very good ways), they expose limitations that the people who were involved with the development, testing, and reviews of the API missed.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 23, 2012 21:01 UTC (Tue) by Yorick (guest, #19241) [Link] (7 responses)

You are right, but there is not much we can do about that other than devising more sophisticated mechanisms for API evolution than the rather simplistic policy in Linux. User-space libraries (with versioned symbols or similar ways of evolving APIs) provide one possible answer, but from what I understand, the kernel maintainers aren't too keen on such layers.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 23, 2012 21:07 UTC (Tue) by dlang (guest, #313) [Link] (6 responses)

I disagree.

If applications will be broken, then the API shouldn't change.

If there are no applications using an API, that API can be removed.

But If applications are using that API, creating new versions of the API doesn't solve the problem. Existing applications are not going to disappear.

This attitude that "the APIs are versioned, we can drop support for old versions" is exactly what's causing the problems in the Linux Desktop Environment world.

It doesn't matter how justified you think you are in getting rid of an old version, if it breaks users it's a regression and you should not do so.

Maintaining lots of different, incompatible versions of an API is a huge amount of work, so just versioning the API isn't nearly enough, and it's questionable if it really helps in the long run.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 23, 2012 23:04 UTC (Tue) by nybble41 (subscriber, #55106) [Link] (1 responses)

In cases like this one, where the problem is just unused parameters which weren't checked, it seems like a simple enough thing to fix. Just create a new API for new applications that follow the rules, and make the old one forward to the new one after masking out the unused parts. For example, epoll_ctl(epfd, op, fd, event) could call epoll_ctl_v2(epfd, op, fd, NULL), ignoring event; anything that wants to use the last parameter would call epoll_ctl_v2 directly. The same principle would work for bitfields. The new API could easily be made automatic for new programs through versioned library symbols.

However, I agree that in general maintaining multiple API versions, where the old one is not a simple subset of the new one, is not likely to go well.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 25, 2012 9:42 UTC (Thu) by dgm (subscriber, #49227) [Link]

> However, I agree that in general maintaining multiple API versions, where the old one is not a simple subset of the new one, is not likely to go well.

For that reason we should be putting much more effort into API design. In fact, tacking into account that API _will_ outlive its implementation (maybe several of them), most of the effort should be towards getting the API right.

Re API (and ABI) design and maintenance

Posted Oct 26, 2012 18:31 UTC (Fri) by davecb (subscriber, #1574) [Link] (3 responses)

Creating new versions is one *step* in a disciplined (anal) process of evolution. It's been done continuously since the days of the mainframe, but people don't realize it's happening.

I spent three years at the (late, lamented) Sun Microsystems doing ABI stability, and ended up sensitized enough that I notice it when it happens.

An easy example was a company's linker, which needed a different data structure for a method call that it had. We added a new "record type" in tghe linker, and converted the compilers to produce only it. We supported the old format for a year, then made its use produce a warning, and a year after that, made it require a link-line option.

We got two complaints, total, about the option. Out of all our customers, only two were so very very far behind that they ran one of the old compilers. A year or two later a hardware change made those compilers produce impossibly lousy code, and the two outliers upgraded to the new compilers. Then we retired the old interface.

If we had moved any faster, we would have annoyed at least two customers. If we had moved any slower in switching the compilers over, we would have made the OS developers unhappy. The time between the first switch and the final stages of the retirement made the maintainers unhappy, but there weren't many bugs in that interface nor were there many user of it, so the time cost of maintaining it was low.

We did have to manage it, and we had to do a fair bit of work behind the scenes to make it invisible to the users, but we succeeded at evolution.

Just as with humans, if you don't evolve, you might just die out. Homo habilis, anyone?

--dave

Re API (and ABI) design and maintenance

Posted Nov 13, 2012 12:39 UTC (Tue) by k3ninho (subscriber, #50375) [Link] (2 responses)

> We did have to manage it, and we had to do a fair bit of work behind the scenes to make it invisible to the users, but we succeeded at evolution.

I suspect that Linus' view on never, ever, binning old binary interfaces will mean that there will be (un)dead interfaces supported forever. My thoughts on a management plan look like:
(*) build a test suite round existing, in-use interfaces to maintain their intended functionality
(*) build a versioning API where a version-aware program can call in to use versioned interfaces
(*) attach version info to the existing APIs and handle a 'this interface is not implemented in your version of Linux' error
(*) develop a plumbing metalanguage to support disabled/legacy interface functionality via newer intefaces
(*) have all the interfaces configurable in the makefile, defaulting to enabled
(*) stop talking and show you some code

K3n.

Re API (and ABI) design and maintenance

Posted Nov 15, 2012 17:15 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

That's pointless. With memory sizes continuing to shoot up as they are, the memory overhead of maintaining old interfaces is minimal: and as long as the interfaces have no significant maintenance overhead, why not maintain them? I mean, Linux still supports uselib(), which is IIRC useless unless you're still using a.out shared libraries and libc4!

Re API (and ABI) design and maintenance

Posted Nov 19, 2012 6:44 UTC (Mon) by k3ninho (subscriber, #50375) [Link]

This article itself supplies an example where an older interface design has caused internal plumbing designs to be less than the best. This will happen again, and will cause a kind-of scarring to the structure of the Linux Kernel. Wouldn't you like to be able to offer end-users of the Kernel a clean set of interfaces, easy to comprehend and unhindered by history's quirks?

K3n.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 24, 2012 4:25 UTC (Wed) by daniels (subscriber, #16193) [Link] (3 responses)

the developers here come out as bumbling amateurs

Err? They made a mistake, which in the context of something as complex and genuinely impressive as the Linux kernel, can be forgiven. How many kernels which scale from embedded to clusters have you written?

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 24, 2012 10:40 UTC (Wed) by Yorick (guest, #19241) [Link] (2 responses)

As the article points out, changes to the internal kernel code tend to get much more scrutiny than those that affect user-visible interfaces, despite the fact that those are the ones that really need to be right from the start. That comes out as sloppy, but is really more a sign of a development process in the need of improvement.

When we suffer from badly thought-out APIs made by Microsoft, say, we pour scorn over the developers who designed the mess and call them incompetent, fairly or not. Linux kernel programmers are not exempt and should not be.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 24, 2012 10:51 UTC (Wed) by mkerrisk (subscriber, #1978) [Link] (1 responses)

That comes out as sloppy, but is really more a sign of a development process in the need of improvement.

Yes. Having watched what goes on for quite a while now, I consider this mainly a process problem, rather than a problem of individual developers (though obviously some do a better job than others).

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 25, 2012 9:45 UTC (Thu) by dgm (subscriber, #49227) [Link]

A good start would be to find out how is that some do a better job that others, and how we can get that to happen more often.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 24, 2012 16:30 UTC (Wed) by mjthayer (guest, #39183) [Link] (10 responses)

This should probably have been a comment on the previous article, but I'm not sure if anyone is watching that any more.

I am sure that there is a good reason why it won't work, but couldn't the original problem be solved, in user space, if the user space file descriptor cache included not just a "should be deleted" flag, but also a reference count of threads currently using a file descriptor? Then, before accessing the descriptor, a thread could check the "should be deleted" flag and if it is set decrease the reference count instead of accessing it, freeing the resources if the count reached zero.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 25, 2012 0:02 UTC (Thu) by kjp (guest, #39639) [Link] (9 responses)

See my comment about 'weak reference' in the comments of the original article. I agree, that this looks fundamentally like a solution in search of a problem.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 25, 2012 21:57 UTC (Thu) by Jandar (subscriber, #85683) [Link] (8 responses)

I had missed this comment in the original article, reading LWN too early has it's disadvantages. There should be a way to step later through unread comments.

The solution using the cookie is simple and elegant. I don't understand the comments about not using the cookie because someone would like to use it otherwise. This line of reason means nobody should use the cookie.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 25, 2012 22:03 UTC (Thu) by dlang (guest, #313) [Link] (7 responses)

take a look at https://lwn.net/Comments/unread

it shows you comments for all stories that you haven't read since the last time you viewed that page (and that you haven't read by going to the specific article page)

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 26, 2012 0:21 UTC (Fri) by Jandar (subscriber, #85683) [Link] (2 responses)

Nearly all content of https://lwn.net/Comments/unread I have already seen.

I read LWN in the "One big page" mode and go to the comments (in a new tab) with the "Comments (xxx posted)" button. After I have read the comments for one article I close the tab. Is there something I can do to make Comments/unread more useful?

What I would really like would be some means to hop from one unread comment to the next within the complete comment-section to see the surrounding context. It could be a link at each unread comment pointing to an anchor at the next. E.g. http://lwn.net/Articles/520198/#Comments-UnRead42.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 26, 2012 0:46 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

If you were logged in when you saw those other comments, and still see them on the /unread page, there is a bug that you should send to lwn@lwn.net

However, I suspect that if you go to the unread page again, you will not see all those comment any more and you will find it much more useful.

you should also look at the greasemonkey script for lwn, I think it does more of what you are looking for.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 26, 2012 1:22 UTC (Fri) by Jandar (subscriber, #85683) [Link]

I was logged in while reading this weekly edition. Next week I make a few tests and see if a bug is reproducible.

I use konqueror not firefox but greasemonkey with fancyLWNComments seems to be a reason to switch for reading LWN. Thanks for pointing me to it.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 26, 2012 5:01 UTC (Fri) by dirtyepic (guest, #30178) [Link] (3 responses)

What would be really useful is a way to mark an article as watched and have any new comments on it sent by email. Seeing all unread comments is just too much noise for me.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 26, 2012 6:00 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

if you are reading while logged in, you should see that when you revisit an article, new posts show up with an orange border while old posts show up with a grey border.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 26, 2012 17:03 UTC (Fri) by Jandar (subscriber, #85683) [Link]

For me there is no visual difference between old and new comments. In "LWN account customization" there is a setting "Old (seen) comment background color" but this doesn't work, neither in konqueror nor in iceweasel.

Greasemonkeys fancyLWNComments is the first working method to tell read and unread apart.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 27, 2012 2:54 UTC (Sat) by dirtyepic (guest, #30178) [Link]

Yep, and that's handy. But I'd like a better way to follow comments than bookmarking individual articles and scanning through the whole list periodically.

EPOLL_CTL_DISABLE, epoll, and API design

Posted Oct 25, 2012 9:05 UTC (Thu) by ncm (guest, #165) [Link]

I'm coming around to wahern's position noted in comments on the previous article: it's an architectural design error for threads to share a resource but not keep track of which are using it. If you design that way, no amount of EPOLL_CTL_DISABLE can make your design good. No end of similar problems will be waiting to surface.

I'm also much more impressed with kjp's solution than is our esteemed author. It attacks the problem at the root, even enabling rescue of such poorly architected designs (anyway until the next flaw uncloaks). Using the field suggested doesn't "burn" it: other uses can piggyback on the same hash node. By such reasoning any use at all would burn it, so no use can be deserving enough, and it never gets used for anything.

Unless I am misunderstanding the argument...

EPOLL_CTL_DISABLE, epoll, and API design

Posted Nov 1, 2012 0:45 UTC (Thu) by kevinm (guest, #69913) [Link] (1 responses)

By now, the time seems ripe for there to be a kernel-user-space API maintainer—someone whose job it is to actively review and ack every kernel-user-space API change, and to ensure that test cases and sufficient documentation are supplied with the implementation of those changes.

Hark, is that the sound of volunteering? ;)

EPOLL_CTL_DISABLE, epoll, and API design

Posted Nov 1, 2012 13:19 UTC (Thu) by Karellen (subscriber, #67644) [Link]

I thought we already had one, and his name was Linus