[go: up one dir, main page]

|
|
Log in / Subscribe / Register

OLS: A proposal for a new networking API

Ulrich Drepper has been the maintainer of the core glibc library since 1995; he also represents the community to the POSIX standardization effort. So, when Ulrich proposes a new user-space API, more than the usual number of people are likely to listen. Ulrich has been putting his mind to the problems of high-performance network I/O; the results were presented at his Ottawa Linux Symposium talk.

The current POSIX APIs are, increasingly, not up to the task. The socket abstraction has served us for a long time, but it is a synchronous interface which is not well suited to zero-copy I/O. POSIX does provide an asynchronous I/O interface, but it was never intended for use with networking, and does not provide the requisite functionality. So it has been clear for a while that something better is needed; the developers working on network channels have also been talking about the need for a new networking API.

There are three components to a new networking API, all of which will lead to a more complex - but much more efficient - interface for high-performance situations. The first of those is to address the need for zero-copy I/O. As the data bandwidth through the system increases, the cost of copying data (in CPU utilization and cache pressure) increase. Much of this cost can be avoided by transferring data directly between the network interface and buffers in user space. Direct user-space I/O requires cooperation from both the kernel and the application, however.

Ulrich proposes the creation of an interface for the explicit management of user-space DMA areas. Such an area would be created with a call that looks something like:

    int dma_alloc(dma_mem_t *handle, size_t size, int flags);

If all goes well, the result would be a memory area of the given size, suitable for DMA purposes. Note that user space gets an opaque handle type in return - there is, at this point, no virtual address which is directly accessible to the application.

To use a DMA area for network I/O, the application must associate it with a socket. The call for this operation would look like:

    int dma_assoc(int socket, dma_mem_t handle, size_t size, int flags);

There is still the issue of actually managing memory within this DMA area. An application which is generating data to send over the net would request a buffer from the kernel with a call like:

    int sio_reserve(dma_mem_t handle, void **buffer, size_t size);

If all goes well, the result will be a pointer (stored in *buffer) to an area where the outgoing data can be constructed. For incoming data, the application will receive a pointer to the buffer from the kernel (just how is something we'll get to shortly); the application will own the given buffer until it returns it to the kernel with:

    int sio_release(dma_mem_t handle, size_t size);

Before an application can start to use asynchronous network I/O, however, it must have a way to learn about the results of its operations. To that end, Ulrich proposes the addition of an event reporting API to the kernel. This mechanism, which he calls "event channels," would have an interface like:

    ec_t ec_create(int flags); /* Create a channel */
    ec_next_event();           /* Get the next event */
    ec_to_fd();                /* Send events to a file descriptor */
    ec_delay();                /* Wait for an event directly */

The exact form of this interface (like all of those discussed here) is subject to change. But the core idea is that it is a quick way for the kernel to return notifications of events (such as I/O completions) to user space. Most applications would be likely to use the file descriptor interface, which would allow events to enter an application's main loop via poll() or epoll_wait().

The final step is to make some extensions to the existing POSIX asynchronous I/O interface. The aiocb structure would be extended to include an event channel descriptor; that channel would be used to report the results of asynchronous operations back to user space. Then, an application could initiate data transmission with a call like:

    int aio_send(int socket, void *buffer, size_t size, int flags);

(One presumes there would be an aiocb argument as well, but Ulrich's slides did not show one). This call would start the process of transmitting data from the given buffer, with completion likely happening sometime after the call returns. For data reception, the call would look like:

    int aio_recv(int socket, void **buffer, size_t size, int flags);

The relevant point here being that buffer is a double pointer; the kernel would pick the actual destination for the data and tell the calling application where to look.

The result of all these changes would be a complete programming interface for high-performance, asynchronous network I/O. As an added bonus, the use of an event channel interface would simplify the work of porting applications from other operating systems.

All of these interfaces, says Ulrich, are simply a proposal and subject to massive change. The core purpose is to allow applications to get their work done while giving the kernel the greatest possible latitude to optimize the data transfers. This proposal is not the only one out there; Evgeniy Polyakov's kevent proposal is similar in many ways, though it does not have the explicit management of DMA areas. It may be some time before something is actually adopted - a new API will stay around for many years and should not be added in haste - but the discussion is getting started in earnest.

Index entries for this article
KernelAsynchronous I/O
KernelEvents reporting
KernelKevent
KernelNetworking


to post comments

OLS: A proposal for a new networking API

Posted Jul 23, 2006 7:54 UTC (Sun) by kleptog (subscriber, #1183) [Link] (4 responses)

I think explicit DMA buffer management from user space is a good idea. Right now network cards use DMA extensivly, but so do graphics cards via DRI. Additionally, sound cards, scanners, camaras, printers, MPEG decoders/encoders, tv cards, etc could all benefit from a clear way to manage DMA buffers from user space. Rather than each subsystem coming up with their own allocation/deallocation routines.

OLS: A proposal for a new networking API

Posted Jul 23, 2006 11:01 UTC (Sun) by Los__D (guest, #15263) [Link] (3 responses)

Management should only be pushed to userspace as long as there is a clear reason for it, we don't want a situation where the userspace app is expected to make their own driver to use the device.

OLS: A proposal for a new networking API

Posted Jul 23, 2006 13:07 UTC (Sun) by mattdm (subscriber, #18) [Link]

Even then, there'd be no reason for *every* app to do it -- that's what libraries are for....

OLS: A proposal for a new networking API

Posted Jul 23, 2006 16:51 UTC (Sun) by iabervon (subscriber, #722) [Link]

In cases where the DMA is of bulk uninterpreted or standard data, a driver wouldn't be necessary in userspace. It should work for userspace to ask the kernel a DMA handle that an MPEG-2 stream can be sent to to have the video card decode it with the hardware decoder, and then userspace plays video by passing the video data unmodified to the handle. Obviously, there's more going on, but the kernelspace driver would take care of that.

In the case of networking, the data is going to be the application's data, to be sent over the network, and it wouldn't matter what the device is to the userspace side.

OLS: A proposal for a new networking API

Posted Jul 27, 2006 3:02 UTC (Thu) by ianw (guest, #20143) [Link]

Well, why not? We might call this a "userspace driver", and abstract it to a library.

Suddenly (with an IOMMU) this driver can not crash the system any more than any other userspace process. If it does crash, we have a much higher chance that we can just restart it.

OLS: A proposal for a new networking API

Posted Jul 23, 2006 9:53 UTC (Sun) by flewellyn (subscriber, #5047) [Link] (9 responses)

Sounds good. But, and call me crazy for asking this, why couldn't sockets be extended to
support such a thing? Must we invent a totally new API?

OLS: A proposal for a new networking API

Posted Jul 23, 2006 11:09 UTC (Sun) by nix (subscriber, #2304) [Link] (8 responses)

One of the possible underlying media over which these events may flow *is* a netlink socket.

But fundamentally the BSD sockets API is ugly, quite non-Unixlike, and hard to use and even harder to use correctly, and the thought of using it for *everything* including files is horrible. So no matter what the underlying layer is composed of, we need something better for people to actually *use*. (Ulrich is focused on that, not the implementation details alone: he *is* the libc maintainer.)

The replacement for select() had me cheering. At last, One Inner Loop! :)

OLS: A proposal for a new networking API

Posted Jul 23, 2006 23:58 UTC (Sun) by flewellyn (subscriber, #5047) [Link] (7 responses)

I'm curious what a new, socket-replacing API would look like. Keeping in mind that sockets are more than just "networking", they're "generic IPC", I imagine one improvement would be to eliminate the distinction between "Unix domain" and "network" sockets. Since currently you have to specify one or the other, either limiting your application to local connections only, or using a relatively heavyweight IP or other networking protocol when its overhead wasn't necessary, OR coding in unpleasant and inelegant conditionals to detect which situation you're in, for each connection or message.

Yick. It would be really really nice if, instead of having to say "I want local namespace" or "I want internet namespace", you could just tell the kernel "I want a socket", and the kernel would determine based on the destination of your message what type of internal semantics was needed. That way, you could code for both local performance and network transparency.

Along with that, I could see real value in adding permissions to sockets themselves. Right now, you can get permissions on Unix domain sockets through the filesystem, but internet sockets don't have that protection. If you could, instead, assign permissions to a socket itself, you could have a daemon that would only accept connections from certain users, or with certain credentials such as RSA keys or Kerberos tickets.

Alternatively, imagine making ALL sockets files, including internet domain sockets. Now, a socket would have a file name, no matter what namespace or protocol you use, and you would still get the benefit of permissions on the sockets. Plus, instead of using special system calls to connect with them, all you'd need is calls to open(), read(), write(), and close(). A server program would just call "makesocket()", which would take parameters for address, protocol, port, and so on, to create a socket, and then start reading on it (assuming blocking I/O); a client would call "makesocket()" and then open it to start writing data. The kernel would handle connection details (if TCP or something else connection-oriented), or else just start sending datagrams (UDP or something else connectionless) or byte streams (address was for a local socket), and neither the client nor server programs themselves would need to know or care where the connection was coming from.

Hmm...now that I think about it, this has potential.

OLS: A proposal for a new networking API

Posted Jul 24, 2006 5:47 UTC (Mon) by BrucePerens (guest, #2510) [Link] (2 responses)

I'm curious what a new, socket-replacing API would look like.

open("/net/localhost/http", O_RDWR, 0)

Maybe this is not quite what you were asking for :-) . Of course, it's inspired by Plan 9. Using this, or using the socket calls, you get back the same object: a file descriptor. I think socket calls come from BBN's ARPA-sponsored Unix TCP/IP implementation, filtered through Berkeley BSD. They are indeed non-Unix-like, as are the net devices which live in their own unique name space.

Once you get the FD, you can call magic DMA functions on it as Ulrich proposes, which should be valid for plain files, not just network devices. Sometimes, however, what you need to do will fit the simpler sendfile().

Bruce

OLS: A proposal for a new networking API

Posted Jul 24, 2006 6:59 UTC (Mon) by neilbrown (subscriber, #359) [Link] (1 responses)

The (a) problem with
   open("/net/localhost/http", O_RDWR, 0)
type approaches is that they combine "socket" and "connect" into one
call, and so there is no room for doing anything in between like bind
or setsockopt.  Those things aren't always needed, but sometimes they are.

You could try
  open("/net/local/http:bind=1023;sourceaddr=xx.xx.xx.xx;tcp_syncnt=5")
but I don't think you would get very far.

Unfortunately IPC is simply very different from for file IO and trying
to use the same API will be a problem.  Having an open_url library call
is about the best you can do.

OLS: A proposal for a new networking API

Posted Jul 24, 2006 15:14 UTC (Mon) by mheily (subscriber, #27123) [Link]

Or better yet, extend the path heirarchy to include the IP address and port, then allow setsockopt to operate on file descriptors:

fd = open("/net/localhost/ip/192.168.0.1/1023", O_RDWR, 0);
setsockopt(fd, O_TCPSYNCNT, 5);

You could bind to all interfaces by binding to 0.0.0.0 like so:

fd = open("/net/localhost/ip/0.0.0.0/1023", O_RDWR, 0);

OLS: A proposal for a new networking API

Posted Jul 24, 2006 6:15 UTC (Mon) by drag (guest, #31333) [Link] (3 responses)

The way things go everything should be network capable.

In fact there was a group that did a presentation this year at the Usenix conference that did something very odd with computers buses and IP.

Basicly they made the entire computer and all the communication run IP.
See http://www.usenix.org/events/usenix06/tech/ben-yehuda.html

Of course it's closed to non-members until next year.. (Is anybody here a member?)

The way things are going pretty soon ethernet networks will be faster then local disk I/O. If latencies decrease along with that, or something like that, then shared memory scemes may start to work.

Distributed operating systems maybe? Or something like treating a small cluster of machines as a single Numa machine with single kernel?

If most IPC stuff uses something that could just as easily be local as network'd then that would make something like that more easily workable. Like how X is mostly network transparent.

er.. or something like that.

OLS: A proposal for a new networking API

Posted Jul 24, 2006 9:39 UTC (Mon) by dlang (guest, #313) [Link] (2 responses)

I am a member and went and read the paper.

what they did was to create a pci card that claimed to be a VGA/keyboard/mouse card and remoted that out to another machine via IP (the purpose being to investigate the potential for more 'legacy free' motherboards, including being free of any graphics, keybard, usb, IDE, etc ports)

they found a 40-60% performance hit for doing so, but point out that normal sysadmin tasks don't use these channels so it was deemed acceptable.

while this is a very interesting idea for server manufacturers (and performance can be improved with a custom ASIC instead of the slow FPGA)it's of much less interest for home use (where you don't normally have another machine to serve as the head for your systems, and gamers aren't willing to sacrafice any performance)

while an interesting paper it doesn't seem to be very relavent to the issues being dicussed here.

OLS: A proposal for a new networking API

Posted Jul 24, 2006 10:03 UTC (Mon) by drag (guest, #31333) [Link]

ah, I see.

Thank you. I was curious about what exactly they were talking about.

Old news?

Posted Aug 4, 2006 5:36 UTC (Fri) by ringerc (subscriber, #3071) [Link]

I'm fairly sure I've seen cards on the market that do exactly that - and have for a while. The IBM RSA (Remote Supervisor Adapter) comes to mind, and I think Dell do something similar for their servers.

(By the way: I WANT ONE VERY BADLY. I hate the way x86 servers - even Intel server boards etc - can't handle serial console access for POST and BIOS. It's completely retarded, and one area where Sun has been trouncing `PC' servers for years.).

There are also older approaches, such as the `PC Weasel' VGA+keyboard -> serial console cards that've been around for yonks.

That said, you just wrote a summary, so I'm sure there's more in the details.

OLS: A proposal for a new networking API

Posted Jul 23, 2006 15:55 UTC (Sun) by dps (guest, #5725) [Link] (4 responses)

One of the limitations of poll(2) and select(2) is that to deteremine which file descritpitors met the conditions you need to use a loop. If there are a large number of sockets and high preformance is critical is arguably suboptimal. Swithcing to SIGIO is not a complete solution becuase two fd's changing generate only one SIGIO, unless I have misunderstood something.

An equivialnet interface which more directly indicate the file descriptions that met the conditions would be useful.

One could propose implementing zero copy I/O by marking the pages that read() or write() refer to copy on write, using them directly in kernel space, and giving those that scribble on those pages a copy. I can see that read(2) might need to know this happenned and perform a copy after all. Disclaimer: I have not investigated the limitations of real hardware or size of any mm changes required.

OLS: A proposal for a new networking API

Posted Jul 23, 2006 17:16 UTC (Sun) by cventers (guest, #31465) [Link]

> One could propose implementing zero copy I/O by marking the pages that
> read() or write() refer to copy on write, using them directly in kernel
> space, and giving those that scribble on those pages a copy. I can see
> that read(2) might need to know this happenned and perform a copy after
> all. Disclaimer: I have not investigated the limitations of real
> hardware or size of any mm changes required.

Something like this was already proposed. The trouble is that the faulting
process is fairly expensive, and once you have to do the copy _plus_ the
TLB flushing you've just spent measurably more time than you would have
just doing the copy in the first place.

Copy on write is good in some places. During a fork, you have to
invalidate the TLB anyways, so it doesn't hurt too much to implement CoW
there (especially since many of the pages won't ever be copied, either due
to the application calling execve() or the application being a daemon like
Apache wherein every child only has some portion of non-shared data).

But playing tricks with virtual memory elsewhere (such as in the
networking hot path) is a really bad idea.

Further discussion: http://kerneltrap.org/node/6506

OLS: A proposal for a new networking API

Posted Jul 23, 2006 19:10 UTC (Sun) by busterb (subscriber, #560) [Link]

There are a number of poll/select alternatives in various operating systems that work around this limitation by returning a list of pointers to only the descriptors that have received an event. On Linux, see epoll(4). libevent is a nice library that abstracts away OS-specific mechanisms such as these.

OLS: A proposal for a new networking API

Posted Jul 24, 2006 13:01 UTC (Mon) by kleptog (subscriber, #1183) [Link]

This is where real-time signals come in. They queue, so if two signals were sent you receive two (mostly). You can also arrange to have data attached to the signal so you know who sent it.

I say mostly because there's one caveat: while the signals do queue, they don't queue indefinitly. I think linux cuts the queue at 32. Once the queue is full, the program has to go back to polling the sockets, which is what you're trying to avoid.

I beleive that's the reason they never caught on.

OLS: A proposal for a new networking API

Posted Aug 4, 2006 4:43 UTC (Fri) by efexis (guest, #26355) [Link]

...or even swapping with a fresh page? For example, you allocate a 4k page, construct the message, and 'send' it. At that moment, the page table entry that the user process sees is swapped with the one the kernel/driver sees. The user process now has an empty page at that location, ready to write the next message into, and the kernel has access to the page to send to the device.

Same with recieving; device writes page to memory, kernel figures out where it has to go, and adds it to the virtual address space of that process where the process has requested it. That page is now allocated to that process, so the next read would come into a different address. There's no need for copy-on-write, as a page is no longer needed after being passed along, so an empty page in it's place would suffice.

What about other headers?

Posted Jul 24, 2006 11:50 UTC (Mon) by christian.convey (guest, #39159) [Link] (5 responses)

I hope this question isn't born out of ignorance...

The DMA idea sounds great, but I'm curious how this works when protocol layers will want to add headers to the region of memory that exactly preceeds my application-level data.

For example, suppose I have application-data messages that are 256 byte long. So I request a 256-byte-long user-space DMA region, and it's mapped to my process' VM address range x10000000 - x100000FF. And I then populate all 256 bytes of that region with application-level data.

If the TCP and IP layers are going to bolt their headers onto the beginning of the data I'm sending, won't each of those layers (1) allocate a buffer big enough for that layer's header + the data from the higher protocol level, and then (2) copy the higher-level's data into that new buffer? If so, I don't see how zero-copy is achieved.

It seems to me like we'd almost need the application to announce the purpose for which it intends to use the DMA region, so that the allocator can include extra space at the beginning/end for the network stack to use.
For example (not ideal, but just to clarify my point):
int dma_alloc(AF_INET, SOCK_STREAM, dma_mem_t *handle, size_t size, int flags);

What about other headers?

Posted Jul 24, 2006 15:55 UTC (Mon) by dlang (guest, #313) [Link] (4 responses)

not nessasarily, the kernel support scatter-gather internally which would let you identify the TCP header (in one area of memory) and the data (in another area or areas of memory) and tell the driver where they are, the driver will get all the pieces in one operation (this is even easier with an IOMMU, with that the kernel can program the address space so that the driver doesn't know that the pieces are seperated)

I don't know if this can be done with the userspace DMA work, but I'm not seeing a reason it couldnt'.

What about other headers?

Posted Jul 26, 2006 18:33 UTC (Wed) by mikov (guest, #33179) [Link] (3 responses)

Do all (most) Ethernet adapters support scatter-gather DMA ?

What about other headers?

Posted Jul 26, 2006 19:11 UTC (Wed) by dlang (guest, #313) [Link] (2 responses)

the higher-end cards support scatter-gather, but with an Opteron and it's IOMMU this functionality is available without the card needing to support it (the IOMMU remaps the address space so that the card thinks it's doing a old-fashiond single-area DMA, when it's really pulling/pushing the info to multiple sections of ram)

David Lang

What about other headers?

Posted Jul 27, 2006 1:11 UTC (Thu) by mikov (guest, #33179) [Link] (1 responses)

Thanks for pointing this - I must be falling behind the times because I didn't know about Opteron's IOMMU. I found a good description here: http://www.amd.com/us-en/assets/content_type/white_papers...

Is there equivalent functionality in Intel's chipsets (since they can't put it in the CPU) ? If not, it doesn't seem prudent to rely on it for zero-copy.

What about other headers?

Posted Jul 27, 2006 1:51 UTC (Thu) by dlang (guest, #313) [Link]

My understanding is that the Intel chips do not have an IOMMU at this time.

however this just means that the kernel and drivers have to fall back to the non-zero-copy mechanism if the hardware doesn't support scatter-gather nativly

it's a matter of useing it if it's there, and doing the best you can if it's not (and remember that if you compile the kernel for Opteron-only as opposed to K8 architecture you get more performance becouse you get to leave out the tests for the IOMMU and the code to handle it, so it doesn't always cost you :-)

Memory mapped files

Posted Jul 28, 2006 10:44 UTC (Fri) by addw (guest, #1771) [Link]

Will you be able to do this with memory mapped files ?

If so things like samba & apache could benefit greatly.

OLS: A proposal for a new networking API

Posted Aug 4, 2006 21:37 UTC (Fri) by krause (guest, #39696) [Link]

Why not use the Sockets Extensions API which provide both explicit memory management which is suitable for copy avoidance as well as asychronous communication and event management calls? The API is available today from the OpenGroup and was developed by Socket implementers based on experience with a variety of Socket and AIO / POSIX implementations.


Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds