Toward a kernel events interface

[Posted August 1, 2006 by corbet]

Last week's article on network channels suggested that channels might not be the way of the future at all. Since then, there has been a great deal of discussion on how networking might move forward on many levels, some of which might yet include channels. Your editor plans to gain an understanding of the Grand Unified Flow Cache and related concepts (such as Rusty's plans to thrash up netfilter yet again) for a future article; for now, we'll look at a different aspect of networking (and beyond): a user-space events interface.

Unlike some other operating systems, Linux currently lacks a system call for generalized event reporting. Linux applications, instead, use calls like poll() to figure out when there is work to be done. Unfortunately, poll() does not solve the entire problem, so application event loops must do complicated things to deal with things like signals. Handling asynchronous I/O within a traditional Linux event loop can be especially tricky. If there were a single interface which provided an application with all of the event information it needed, applications would get simpler. There is also the potential for significant performance improvements.

There are two active proposals for event interfaces for Linux: the kevent mechanism and the event channel API proposed by Ulrich Drepper at this year's Ottawa Linux Symposium. Of the two, kevents currently have the advantage for one simple reason: there is an existing, working implementation to look at. So most of the discussion has concerned how kevents can be improved.

The original kevent API is seen as being a bit difficult; it relies on a single multiplexer system call (kevent_ctl()), an approach which is generally frowned upon. The call also requires the application to construct an array with two different types of structures, which is a bit awkward. So one of the first suggestions has been to separate out various parts of the API. The current kevent patch (as of August 1) contains a new system call:

    int kevent_get_events(int ctl_fd, 
                          unsigned int min_nr,
			  unsigned int max_nr,
			  unsigned int timeout,
			  void *buf,
			  unsigned flags);

This call would return between min_nr and max_nr events, storing them sequentially in buf, subject to the given timeout (specified in milliseconds). The flags argument is unused in the current implementation.

There are a number of things which might be improved with this interface, but, as it happens, its final form is likely to look quite different. The current interface still requires frequent system calls to retrieve events; Linux system calls are fast, but, in a high-bandwidth situation, it still would be preferable to spend more time in user space if possible. With a different approach to event reporting, it might just be possible.

The idea which has been discussed is to map an array of kevent structures between kernel and user space. This array would be treated as a circular buffer, perhaps managed using a cache-friendly, channel-like index mechanism. The kernel would place events into the buffer when they occur, and user-space would consume them. Whenever there are events to process, the application could obtain them without entering the kernel at all. Once this mechanism is in place, the kevent_get_events() call could go away, replaced by a simple "wait for events" interface (though glibc would almost certainly provide a synchronous "get events" function). The result should be a very fast interface, especially when the number of events is large.

There are a couple of issues to be worked out, still. One has to do with what happens when the buffer fills. The current asynchronous I/O interface does not allow there to be more outstanding operations than there are available control block structures; that way, there is guaranteed to be space to report on the status of each operation. That can be important, since the place in the kernel which wants to do the reporting is often running at software or hardware interrupt level. If one envisions using kevents to track thousands of open sockets, an unknown number of connection events, etc., however, preallocating all of the event structures becomes increasingly impractical. So something intelligent will have to be done when the buffer fills.

The other issue has to do with "level-triggered" events which correspond more to a specific status than a real event which has occurred. "This socket can be written to" is such an event. When an interface like poll() is used to query whether a write would block, the kernel can check the status and return immediately if the given file descriptor can be written to. Reporting this sort of status through a circular buffer is rather harder to do. So, one way or another, applications will have to explicitly poll for such events.

Given the current level of interest, some way of dealing with these issues seems likely to surface in the near future. That could clear the path for merging kevents into the mainline, perhaps as early as 2.6.20.

Index entries for this article
Kernel	Events reporting
Kernel	Kevent

to post comments

Why just a ring buffer?

Posted Aug 3, 2006 6:54 UTC (Thu) by AnswerGuy (guest, #1256) [Link] (1 responses)

So, regarding the level triggering issue, why just having a ring buffer?

Why not have a structure which consists of one fixed region (set of flags, a fixed array of some sort) ... and a ring buffer?

For example one might have a bit vector representing the clear/blocking state of each file descriptor. These are updated by the kernel as it maintains this memory mapping (along with the head/tail pointers of the ring, of course). Then checking for "would block" is simply a matter of an scaled array dereference and some AND masking.

The question becomes ... what are the optimal set of additional fields to this structure. (For example a large bit vector would be a pain to scan through ... searching for "writable" descriptors ... but having the kernel update some tree of indices might be gilding the lily. (This tree of indices might work like this: let's say you had a tree --- one 32-bit value gives status of each of 32-words ... so each bit in the index says that some of the bits in the corresponding word are set. This saves you from having to loop through all 32-words checking for the first non-zero value --- you can determine if they are all clear in one operation and find the first one that has any set bits with only one register fetch and a small number of quick operations. If you extend this structure to two layers then you can manage 32K bits in 1057 words --- the top index, the 32 middle indices, and the 1024 words for your bit vector at the bottom; and finding the first available "set" bit would be a linear three fetches and a handful of masking or shifting operations and bit checks. The kernel would have two extra stores and a handful of OR operations to update these; with just one index into one 32-word bit vector you could over 1024 file descriptors using 33 words, obviously).

Of course I'm only giving a silly example based one specific use case that might arise from the need described in that penultimate paragraph. I really have no idea how many "level triggered" events/checks would be needed in this unified event interface and I have no idea how critical finding the "first set bit" in such a vector might be nor even how many file descriptors might have to be supported.

My point is that the basic technique doesn't have to be limited to just a ring buffer.

Why just a ring buffer?

Posted Aug 3, 2006 16:34 UTC (Thu) by cventers (guest, #31465) [Link]

The vector you refer to would likely fall apart as you add more and more
CPUs. If you have several threads doing I/O on several processors, each
CPU would be busy doing atomic word updates to the same cache line, which
would get flushed /constantly/. The greater the data density, the more it
hurts, and a *bitmap* is as dense as you can get.

Keep in mind that I haven't actually measured the effects. It is of
course very difficult at times to avoid caching issues (since even a
simple lock suffers from this problem).

Performance aspects

Posted Aug 3, 2006 12:30 UTC (Thu) by pphaneuf (guest, #23480) [Link] (3 responses)

Maybe I just haven't done high enough bandwidth applications, but a system call to get the events from the kernel isn't exactly murder. Hopefully, you do a lot of work in-between the calls to the event dispatching system call.

The problem isn't the system call, really, but the copying of the events from the kernel buffers to the user buffers. In fact, when it is said that the kevent_get_events() call could go away and be replaced by a simple "wait for events" interface, what would be that "interface", if not a system call itself? User processes going to sleep basically have to include a system call! You could only take the system call out in the overload case, where you'd basically run circles in the ring buffer as fast as you could (which can be an interesting feature). How would one indicate to the kernel where it is at in the ring buffer, and how would the userspace code know where the end is (and whether it has to go to sleep or not)? I am not familiar with the interfaces of

Is there such a volume of events themselves that copying them to userspace is significant?

Something I would like to have and which would reduce the system call overhead would be a way of specifying for a specific file descriptor that it should not be marked readable until there is a certain amount of data. Or that it should not be marked writable until there is space for at least a certain amount of available space. This should of course be overridden if there is an error on the file descriptor (say, EOF would mark an fd as readable). This would allow applications that use a certain block size to read whole blocks at a time (with a single read() syscall).

Performance aspects

Posted Aug 3, 2006 15:44 UTC (Thu) by kleptog (subscriber, #1183) [Link] (2 responses)

The thing is, there already are a whole lot of features to deal with this, many of which are not used as often as they should.

For example, the "minimum amount of data" option exists already for ages for socket (SO_LOWWAT or some such). There's POSIX realtime signals which can be queued and even start a new thread at a particular callback when the signal arrives.

All these things are aimed at the really high end though, 99% of programs don't need to worry about this much...

Performance aspects

Posted Aug 3, 2006 16:20 UTC (Thu) by pphaneuf (guest, #23480) [Link]

According to socket(7), there's SO_RCVLOWAT and SO_SNDLOWAT, but they are not changeable on Linux, being always set at 1 (i.e. nothing special).

Waiting for POSIX signals is a system call too (in fact, getting them too). I don't think there's anything about starting new threads like that on Linux.

Performance-wise, other than holding off readiness events until a certain amount is ready (which appears already has an unimplemented API), I'm fairly happy. Back when I used epoll in edge-triggered mode, I would have liked a way to batch calls to epoll_ctl(), when I re-armed my events, but now I just let the kernel-side level-triggered mode do the work.

Performance aspects

Posted Aug 3, 2006 16:37 UTC (Thu) by cventers (guest, #31465) [Link]

Realtime signals and new-thread callbacks were mentioned in Ulrich's OLS
paper. He went on to show how truly inoptimal they were.

API aspects

Posted Aug 3, 2006 13:31 UTC (Thu) by pphaneuf (guest, #23480) [Link]

One thing that has always bothered me with the existing event delivery mechanisms on Unix/Linux was that they did not allow for distributing events to multiple components of application.

But this has been discussed before by Linus and others, a long time ago.

Now, I don't like to cite the Win32 too often, but they have a lot of experience with centralized event delivery, as almost all Win32 programs end up having to deal with that eventually. Specifically, on that platform, you can associate a callback to messages (events), that will be called to handle the message. It is done in a different way than the API Linus had proposed: ccallbacks are really associated with "windows", one can creates an invisible window (those are rather lightweigth objects) to handle events, the GetMessage() function does not automatically call the callbacks (there is a separate DispatchMessage() function to do that, so it can be done selectively by the application getting the messages), one can also get only the messages for a given window (overriding the handling of other windows). There are also a few more message types, including, most significantly, timer messages.

Distribution of timers is also important. As I mentioned an equally long time ago, a timer object that would be represented as a file descriptor would be very useful. As things are, the code calling the event delivery function has sole control over the timeout.

What does all this give us? Well, a classic example is making an asynchronous DNS resolver that is almost as easy to use as gethostbyname(). A DNS resolver is interested in two events, one where it waits for a reply on a file descriptor, and another where it has a timeout after which it either resends its requests or fails. An hypothetical asynchronous resolver on Win32 could simply create an invisible window, bind its socket readiness events to it, register a timer event for the retransmits/failure, send its requests and return. It could take a simple function pointer as a callback that it would call upon finishing its work, and this would all "simply happen". If it returned the handle to its invisible window, you could then make a synchronous version by simply implementing an event loop that filters on the window handle until the callback got called (if you wanted to use it the "normal" asynchronous way, you'd simply ignore the return value, which is easy to do).

This is simply impossible to do on Linux right now without ridiculous overhead (all we're ). Assuming a distributed event dispatching like Linus proposed, one would still need a thread or a subprocess to write to a file descriptor to implement the timeout! A timer file descriptor object would be quite the easy and orthogonal extension, but it just goes hand-in-hand with distributed event dispatching. There are some distributed event dispatchers, like the GLib mainloop, libevent and others, but unification is key.

Note that epoll's file descriptors could be seen as similar to the invisible window, but they have several differences. It is missing the "call the callback" part of Linus' proposal, which is key. If they had this, they could be used in a hierarchy, but could be rather inefficient, as the callback for a lower level epoll would get called, then dispatch its own events, possibly on multiple levels. How would libraries know with which epoll handle register their own? There would need to be a "wait for events on all epoll handles of this process", but then, this would absolutely require the callback mechanism to be centralized (what would the main application event loop do with events meant for the asynchronous DNS resolver?). And the timers are still missing...

The key issue is having a unified event dispatching. The Portland Project is doing some work on unifying mainloops, but mostly in the area of GUI toolkits, but I think this should be a system-level facility, not just for GUI programs (their needs are just much more obvious, as a library tries to popup a GTK+ dialog in a Qt program without falling apart). We could all simply use libevent, say, but the problem is getting everyone to switch boat. A mechanism that would be originating from the kernel and/or the glibc people, with promises of better performance, could be key to adoption.

Toward a kernel events interface

Why *just* a ring buffer?

Why *just* a ring buffer?

Performance aspects

Performance aspects

Performance aspects

Performance aspects

API aspects

Why just a ring buffer?

Why just a ring buffer?