Ringing in a new asynchronous I/O API

By Jonathan Corbet
January 15, 2019

While the kernel has had support for asynchronous I/O (AIO) since the 2.5 development cycle, it has also had people complaining about AIO for about that long. The current interface is seen as difficult to use and inefficient; additionally, some types of I/O are better supported than others. That situation may be about to change with the introduction of a proposed new interface from Jens Axboe called "io_uring". As might be expected from the name, io_uring introduces just what the kernel needed more than anything else: yet another ring buffer.

Setting up

Any AIO implementation must provide for the submission of operations and the collection of completion data at some future point in time. In io_uring, that is handled through two ring buffers used to implement a submission queue and a completion queue. The first step for an application is to set up this structure using a new system call:

    int io_uring_setup(int entries, struct io_uring_params *params);

The entries parameter is used to size both the submission and completion queues. The params structure looks like this:

    struct io_uring_params {
	__u32 sq_entries;
	__u32 cq_entries;
	__u32 flags;
	__u16 resv[10];
	struct io_sqring_offsets sq_off;
	struct io_cqring_offsets cq_off;
    };

On entry, this structure (with the possible exception of flags as described later) should simply be initialized to zero. On return from a successful call, the sq_entries and cq_entries fields will be set to the actual sizes of the submission and completion queues; the code is set up to allocate entries submission entries, and twice that many completion entries.

The return value from io_uring_setup() is a file descriptor that can then be passed to mmap() to map the buffer into the process's address space. More specifically, three calls are needed to map the two ring buffers and an array of submission-queue entries; the information needed to do this mapping will be found in the sq_off and cq_off fields of the io_uring_params structure. In particular, the submission queue, which is a ring of integer array indices, is mapped with a call like:

    subqueue = mmap(0, params.sq_off.array + params.sq_entries*sizeof(__u32),
    		    PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE,
		    ring_fd, IORING_OFF_SQ_RING);

Where params is the io_uring_params structure, and ring_fd is the file descriptor returned from io_uring_setup(). The addition of params.sq_off.array to the length of the region accounts for the fact that the ring is not located right at the beginning. The actual array of submission-queue entries, instead, is mapped with:

    sqentries = mmap(0, params.sq_entries*sizeof(struct io_uring_sqe),
    		    PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE,
		    ring_fd, IORING_OFF_SQES);

This separation of the queue entries from the ring buffer is needed because I/O operations may well complete in an order different from the submission order. The completion queue is simpler, since the entries are not separated from the queue itself; the incantation required is similar:

    cqentries = mmap(0, params.cq_off.cqes + params.cq_entries*sizeof(struct io_uring_cqe),
    		    PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE,
		    ring_fd, IORING_OFF_CQ_RING);

It's perhaps worth noting at this point that Axboe is working on a user-space library that will hide much of the complexity of this interface from most users.

I/O submission

Once the io_uring structure has been set up, it can be used to perform asynchronous I/O. Submitting an I/O request involves filling in an io_uring_sqe structure, which looks like this (simplified a bit):

    struct io_uring_sqe {
	__u8	opcode;		/* type of operation for this sqe */
	__u8	flags;		/* IOSQE_ flags */
	__u16	ioprio;		/* ioprio for the request */
	__s32	fd;		/* file descriptor to do IO on */
	__u64	off;		/* offset into file */
	void	*addr;		/* buffer or iovecs */
	__u32	len;		/* buffer size or number of iovecs */
	union {
	    __kernel_rwf_t	rw_flags;
	    __u32		fsync_flags;
	};
	__u64	user_data;	/* data to be passed back at completion time */
	__u16	buf_index;	/* index into fixed buffers, if used */
    };

The opcode describes the operation to be performed; options include IORING_OP_READV, IORING_OP_WRITEV, IORING_OP_FSYNC, and a couple of others that we will return to. There are clearly a number of parameters that affect how the I/O is performed, but most of them are relatively straightforward: fd describes the file on which the I/O will be performed, for example, while addr and len describe a set of iovec structures pointing to the memory where the I/O is to take place.

As mentioned above, the io_uring_sqe structures are kept in an array that is mapped into both user and kernel space. Actually submitting one of those structures requires placing its index into the submission queue, which is defined this way:

    struct io_uring {
	u32 head;
	u32 tail;
    };

    struct io_sq_ring {
	struct io_uring		r;
	u32			ring_mask;
	u32			ring_entries;
	u32			dropped;
	u32			flags;
	u32			array[];
    };

The head and tail values are used to manage entries in the ring; if the two values are equal, the ring is empty. User-space code adds an entry by putting its index into array[r.tail] and incrementing the tail pointer; only the kernel side should change r.head. Once one or more entries have been placed in the ring, they can be submitted with a call to:

    int io_uring_enter(unsigned int fd, u32 to_submit, u32 min_complete, u32 flags);

Here, fd is the file descriptor associated with the ring, and to_submit is the number of entries in the ring that the kernel should submit at this time. The return value should be zero if all goes well.

Completion events will find their way into the completion queue as operations are executed. If flags contains IORING_ENTER_GETEVENTS and min_complete is nonzero, io_uring_enter() will block until at least that many operations have completed. The actual results can be found in the completion structure:

    struct io_uring_cqe {
	__u64	user_data;	/* sqe->user_data submission passed back */
	__s32	res;		/* result code for this event */
	__u32	flags;
    };

Where user_data is a value passed from user space when the operation was submitted and res is the return code for the operation. The flags field will contain IOCQE_FLAG_CACHEHIT if the request could be satisfied without needing to perform I/O — an option that may yet have to be reconsidered given the current concern about using the page cache as a side channel.

These structures live in the completion queue, which looks similar to the submission queue:

    struct io_cq_ring {
	struct io_uring		r;
	u32			ring_mask;
	u32			ring_entries;
	u32			overflow;
	struct io_uring_cqe	cqes[];
    };

In this ring, the r.head index points to the first available completion event, while r.tail points to the last; user space should only change r.head.

The interface as described so far is enough to enable a user-space program to enqueue multiple I/O operations and to collect the results as those operations complete. The functionality is similar to what the current AIO interface provides, though the interface is quite different. Axboe claims that it is far more efficient, but no benchmark results have been included yet to back up that claim. Among other things, this interface can do asynchronous buffered I/O without a context switch in cases where the desired data is in the page cache; buffered I/O has always been a bit of a sore spot for Linux AIO.

Advanced features

There are, however, some more features worthy of note in this interface. One of those is the ability to map a program's I/O buffers into the kernel. This mapping normally happens with each I/O operation so that data can be copied into or out of the buffers; the buffers are unmapped when the operation completes. If the buffers will be used many times over the course of the program's execution, it is far more efficient to map them once and leave them in place. This mapping is done by filling in yet another structure describing the buffers to be mapped:

    struct io_uring_register_buffers {
	struct iovec *iovecs;
	__u32 nr_iovecs;
    };

That structure is then passed to another new system call:

    int io_uring_register(unsigned int fd, unsigned int opcode, void *arg);

In this case, the opcode should be IORING_REGISTER_BUFFERS. The buffers will remain mapped for as long as the initial file descriptor remains open, unless the program explicitly unmaps them with IORING_UNREGISTER_BUFFERS. Mapping buffers in this way is essentially locking memory into RAM, so the usual resource limit that applies to mlock() applies here as well. When performing I/O to premapped buffers, the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED operations should be used.

There is also an IORING_REGISTER_FILES operation that can be used to optimize situations where many operations will be performed on the same file(s).

In many high-bandwidth settings, it can be more efficient for the application to poll for completion events rather than having the kernel collect them and wake the application up; that is the motivation behind the existing block-layer polling interface, for example. Polling is most efficient in situations where, by the time the application gets around to doing a poll, there is almost certainly at least one completion ready for it to consume. This polling mode can be enabled for io_uring by setting the IORING_SETUP_IOPOLL flag when calling io_uring_setup(). In such rings, an occasional call to io_uring_enter() (with the IORING_ENTER_GETEVENTS flag set) is mandatory to ensure that completion events actually make it into the completion queue.

Finally, there is also a fully polled mode that (almost) eliminates the need to make any system calls at all. This mode is enabled by setting the IORING_SETUP_SQPOLL flag at ring setup time. A call to io_uring_enter() will kick off a kernel thread that will occasionally poll the submission queue and automatically submit any requests found there; receive-queue polling is also performed if it has been requested. As long as the application continues to submit I/O and consume the results, I/O will happen with no further system calls.

Eventually, though (after one second currently), the kernel will get bored if no new requests are submitted and the polling will stop. When that happens, the flags field in the submission queue structure will have the IORING_SQ_NEED_WAKEUP bit set. The application should check for this bit and, if it is set, make a new call to io_uring_enter() to start the mechanism up again.

This patch set is in its third version as of this writing, though that is a bit deceptive since there were (at least) ten revisions of the polled AIO patch set that preceded it. While it is possible that the interface is beginning to stabilize, it would not be surprising to see some significant changes yet. One review comment that has not yet been addressed is Matthew Wilcox's request that the name be changed to "something that looks a little less like io_urine". That could yet become the biggest remaining issue — as we all know, naming is always the hardest part in the end. But, once those details are worked out, the kernel may yet have an asynchronous I/O implementation that is not a constant source of complaints.

For the curious, Axboe has posted a complete example of a program that uses the io_uring interface.

Index entries for this article
Kernel	Asynchronous I/O
Kernel	io_uring

What about async metadata

Posted Jan 16, 2019 6:58 UTC (Wed) by mokki (subscriber, #33200) [Link] (12 responses)

Lots of software would also benefit from async directory listings, file open/close etc.

Did anything ever come out of the various syslets/fibrils/sys_indirect ways of making most syscalls async?

There was lots of discussion in 2007, see: https://lwn.net/Articles/221913/ and
https://lwn.net/Articles/259068/

What about async metadata

Posted Jan 16, 2019 13:00 UTC (Wed) by Sesse (subscriber, #53779) [Link] (9 responses)

I sometimes wonder whether something as mundane as tar should use parallel I/O to use RAID efficiently (assuming files smaller than the chunk size).

What about async metadata

Posted Jan 16, 2019 14:22 UTC (Wed) by farnz (subscriber, #17727) [Link]

Not just RAID - any device that can do multiqueue I/O will benefit from parallel I/O, such as an NVMe SSD (which can have 65,536 parallel queues to the device).

What about async metadata

Posted Jan 17, 2019 1:03 UTC (Thu) by dw (subscriber, #12017) [Link] (7 responses)

I have it on my todo list to write a fully CPU/IO-parallel ZIP implementation (because it's fairly straightforward), with an article around it highlighting most of the traditional UNIX tooling is utterly obsolete on pretty much all modern devices. Naturally it can't really benefit from the work here due to the parent comment, but yeah, the problem is very real, and frankly an entirely ridiculous state of affairs

What about async metadata

Posted Jan 17, 2019 12:30 UTC (Thu) by Sesse (subscriber, #53779) [Link] (1 responses)

So you want to demonstrate that something is obsolete by implementing… an obsolete compression algorithm? :-)

(zlib/deflate is still around pretty much only due to huge transition costs, and a fragmented market among the alternatives. Try something like zstd if you want to make a clean break.)

What about async metadata

Posted Jan 17, 2019 12:33 UTC (Thu) by dw (subscriber, #12017) [Link]

There's always newer and better technology around, but tech is only useful when it's compatible with what you already have :) And ZIPs are eeeverywhere

What about async metadata

Posted Jan 22, 2019 10:09 UTC (Tue) by epa (subscriber, #39769) [Link] (4 responses)

It would be convenient to have a system call that declares 'I plan to read this file in the near future'. The kernel would make a best effort to get that file into the page cache, using background I/O, while your process continues. So if you are about to zip up a directory, call plan_to_read() on each file, then continue reading them sequentially as normal. It wouldn't be quite as fast as a true parallel implementation, but for some tasks it could give you 80% of the performance gains without having to rewrite your creaky old sequential code.

What about async metadata

Posted Jan 22, 2019 11:35 UTC (Tue) by dw (subscriber, #12017) [Link] (2 responses)

Isn't this basically what posix_fadvise() gives us already? But IIRC that interface currently or previously blocked while readahead happened.

For zipping, imagine something like a 100k item maildir of tiny 1.5kb messages. While the compression is still relatively expensive, a huge chunk of the operation will be wasted on ceremonial serialized filesystem round-trips (open/close/read/stat/getdents/etc). To avoid that I'm not sure there is any way around it except a whole bunch of threads keeping as many FS operations in flight (either doing the CPU bits or any IO bits for uncached data) to get even close to a genuinely busy computer.

What about async metadata

Posted Jan 22, 2019 12:22 UTC (Tue) by epa (subscriber, #39769) [Link]

Yes, I was thinking of a few large files, where the overhead really is in I/O and not in bookkeeping.

How about a generalized stat() that lets you open a directory and get info on all the files it contains? That would save a lot of time, and not just for parallel code. Network filesystems, for example.

What about async metadata

Posted Jan 22, 2019 12:38 UTC (Tue) by epa (subscriber, #39769) [Link]

You mentioned posix_fadvise(). That is useful but not quite the stupidly simple interface I had in mind. It requires an open file handle. I envisaged a call that takes a filename and nothing else, works entirely in the background, and does not fail (not even if the file doesn't exist or whatever; it just does nothing in that case).

You could then sprinkle these calls all over your code -- including scripting languages -- and get a handy speedup without having to do any real programming.

What about async metadata

Posted Feb 26, 2019 1:53 UTC (Tue) by josh (subscriber, #17465) [Link]

> It would be convenient to have a system call that declares 'I plan to read this file in the near future'.

The readahead system call does that.

What about async metadata

Posted Jan 16, 2019 15:27 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

Honestly, the easiest path to something syslets-like is to express the kernel program in eBPF. The nice thing about this approach is that eBPF is expressive enough to include arbitrary error-handling and cleanup logic.

What about async metadata

Posted Jan 16, 2019 15:48 UTC (Wed) by smurf (subscriber, #17840) [Link]

It's far less disruptive (to performance) to simply use a thread or two to do your directory-level operations. Actual data R/W on the other hand can take every performance improvement you can throw at it.

Ringing in a new asynchronous I/O API

Posted Jan 16, 2019 10:03 UTC (Wed) by kitanatahu (guest, #44605) [Link] (12 responses)

In my opinion it would be a hell of a lot more useful if it were pollable using the already existing polling mechanisms, including epoll, and it wouldn't get "bored".

Ringing in a new asynchronous I/O API

Posted Jan 16, 2019 18:07 UTC (Wed) by axboe (subscriber, #904) [Link] (11 responses)

io_uring will grow support for IORING_OP_POLL, but it's outside the scope of the initial implementation. This will work similarly to the recently added IOCB_CMD_POLL support for aio.

Apart from that, I do think you're mixing up the polling with the io polling. One provides a way to signal when data is ready, the other skips IRQs in favor of busy polling for completion events.

Ringing in a new asynchronous I/O API

Posted Jan 17, 2019 3:12 UTC (Thu) by samroberts (subscriber, #46749) [Link] (10 responses)

I think you can forgive the OP for calling what epoll() does "polling", though it doesn't fit your definition, given its the name of the syscall!

The point stands: io_uring should be easily useable with poll/select/epoll so it can be integrated with existing event loop based code, networking code in particular are heavy users of these calls. Specifically, this fd

> The return value from io_uring_setup() is a file descriptor that can then be passed to mmap() to map the buffer into the process's address space.

should be epoll()able.

Ringing in a new asynchronous I/O API

Posted Jan 17, 2019 3:28 UTC (Thu) by axboe (subscriber, #904) [Link] (9 responses)

I'm addressing what is a misconception in the original comment. It referred to the "bored" part, which is when you have the kernel side doing polling for you. That has nothing to do with epoll(), and epoll() would not be a solution for this at all.

If the ring_fd should be pollable, in terms of epoll, absolutely. That would be trivial to add. It would NOT work for IORING_SETUP_IOPOLL for obvious reasons, as you can't sleep for those kinds of completions. But for "normal", IRQ driven IO, adding epoll() support for the CQ side of the ring_fd is straight forward. On the SQ ring side, there's nothing to epoll for. The application knows if the ring is writeable (eg can hold new entries) without entering the kernel.

Outside of that, my IOCB_CMD_POLL reference has to do with this:

https://lwn.net/Articles/743714/

and adding IORING_OP_POLL for similar functionality on the io_uring side.

Ringing in a new asynchronous I/O API

Posted Jan 17, 2019 16:26 UTC (Thu) by axboe (subscriber, #904) [Link] (8 responses)

Just to follow up on this, if you check the latest repo, the ring_fd is now both pollable (in terms of poll(2)/epoll(2), not io-pollable), and io_uring also supports IORING_OP_POLL to offer the same functionality that aio does in that regard.

I believe this caters to both of your needs.

Ringing in a new asynchronous I/O API

Posted Jan 17, 2019 23:16 UTC (Thu) by nix (subscriber, #2304) [Link] (7 responses)

This is such a nice interface, I'm wondering if heavy ioctl() users might be able to reuse some of these ideas to redo their ioctl madness into a nice high-performance command/response ring :) I mean yeah it's not exactly simple for userspace to use, but if the complexity is wrapped away from users its properties are seriously slobbersome.

(yes, I help maintain one of those monsters, making heavy use of ioctl() passing intricate structures into and out of the kernel, and the massive use of ioctl() is one thing I at least am hoping to get rid of in the process of getting it ready for upstreaming.)

Ringing in a new asynchronous I/O API

Posted Jan 17, 2019 23:34 UTC (Thu) by axboe (subscriber, #904) [Link] (6 responses)

Since it already has all the mechanics to do buffered IO async, any ioctl could easily be channeled through the API. We're already grabbing the files/mm of the original process. Depending on how many arguments you need to the ioctl, we'd need to tweak the sqe a bit.

With liburing, it should be _very_ easy for applications to use. If you go native, yes, you need to be a bit more careful, and it's more hairy. But even with the basic support liburing has now, you just do:

{
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;

io_uring_queue_init(queue_depth, &ring, 0);

sqe = io_uring_get_sqe(&ring);
sqe->opcode = IORING_OP_READV;
sqe->fd = fd;
[...]

io_uring_submit(&ring);

io_uring_wait_completion(&ring, &cqe);
}

as a very basic example.

Ringing in a new asynchronous I/O API

Posted Jan 18, 2019 4:21 UTC (Fri) by axboe (subscriber, #904) [Link] (1 responses)

Might even makes sense to just move away from an ioctl, and provide an sqe entry into the driver through the file_operations, for instance.

Ringing in a new asynchronous I/O API

Posted Jan 19, 2019 18:57 UTC (Sat) by nix (subscriber, #2304) [Link]

We're in "passing massive structures and/or arrays of structures" land, and trying to *not* produce a horror show like perf_event_open() is fairly high on my priority list! (Though, really, arguing against myself: the reason perf_event_open() is horrifying is that it does a lot, and honestly the same would be true of any attempt to wrap it in a uring, or in anything else. It's not really a win to move from 'this ioctl/syscall has security holes because the interface has too many edges to test exhaustively' to 'this uringed interface has security holes because the interface has too many edges to test exhaustively', alas. :( )

Ringing in a new asynchronous I/O API

Posted Jan 19, 2019 19:00 UTC (Sat) by nix (subscriber, #2304) [Link] (2 responses)

Ooh that's honestly nicer than plain read()/write() would be. No -EINTR worries, no short reads, the only bit that makes me squint is the cqe/sqe acronyms, which are perhaps too concise to be easily understandable (it's a bit longer, but maybe _cmd and _stat suffixes would be easier to read? You're really only granting yourself one variable letter is the existing scheme, which is... a small naming budget.)

Ringing in a new asynchronous I/O API

Posted Jan 19, 2019 20:27 UTC (Sat) by zdzichu (subscriber, #17118) [Link] (1 responses)

sqe and cqe are for “submit” and“completion” queues. Your confusion demonstrates the need for less brief identificators :)

Ringing in a new asynchronous I/O API

Posted Jan 20, 2019 10:20 UTC (Sun) by nix (subscriber, #2304) [Link]

Yeah, exactly. I was confused despite being sure I would be confused and checking the parent article before posting (my fault, not the article's).

Ringing in a new asynchronous I/O API

Posted Jan 24, 2019 12:35 UTC (Thu) by joib (subscriber, #8541) [Link]

What is the mechanics of doing buffered AIO? I faintly recall all those old attempts (syslets, fibrils, whatever) failed partly because there was no natural in-kernel context for progressing those IO's? An in-kernel threadpool is of course always a possibility, but is that noticeably better than a user-space threadpool?

Physically-contiguous buffers

Posted Jan 16, 2019 16:14 UTC (Wed) by abatters (✭ supporter ✭, #6932) [Link] (3 responses)

This looks awesome.

Regarding IORING_REGISTER_BUFFERS, would it be possible to have the kernel use high-order allocations (if available) to get larger physically-contiguous buffers? If you are using the same buffers for DMA over and over again, then setting up a larger physically-contiguous buffer would reduce the number of scatterlist entries required for each I/O. For example, imagine you are submitting 1 MiB per I/O. With 4096-byte pages, that would take 256 scatterlist entries. But with high-order allocations, you might be able to do it with just a few scatterlist entries, saving a lot of overhead. This could be done either by having the kernel allocate the memory using high-order allocations and then letting userspace mmap() it, or by trying to compact the memory when it is submitted to IORING_REGISTER_BUFFERS.

Physically-contiguous buffers

Posted Jan 16, 2019 18:03 UTC (Wed) by axboe (subscriber, #904) [Link] (1 responses)

Not only that, but you could also pre-map the SG lists, instead of having to do map SG and unmap SG for each IO. Right now the registered buffers only avoid the get_user_pages() and put_pages() for each IO, which is (by far) the biggest overhead. But if we fix the kernel parts as well, then we can avoid the dma map/unmap for each IO. That'd bypass the split as well, some quick mental math shows we should be able to kill ~5% of the overhead on my box with that.

In general we have various pieces of low hanging fruit on the block layer side, which are readily apparent now that we have an efficient interface into the kernel. Work in progress! But I'd like to wrap up io_uring first.

Physically-contiguous buffers

Posted Jan 16, 2019 23:31 UTC (Wed) by ms-tg (subscriber, #89231) [Link]

This sort of comment by the patch author is a major example of the value of LWN

Physically-contiguous buffers

Posted Jan 17, 2019 9:45 UTC (Thu) by nilsmeyer (guest, #122604) [Link]

And you could potentially allocate the buffers as huge pages, right?

Ringing in a new asynchronous I/O API

Posted Jan 16, 2019 16:15 UTC (Wed) by me@jasonclinton.com (subscriber, #52701) [Link] (8 responses)

> It's perhaps worth noting at this point that Axboe is working on a user-space library that will hide much of the complexity of this interface from most users.

What's the procedure for a user-space library tightly coupled to a kernel API, like this one, getting into glibc (or any of the rest of the libcs)?

Ringing in a new asynchronous I/O API

Posted Jan 16, 2019 16:19 UTC (Wed) by nix (subscriber, #2304) [Link]

You can just leave it as a separate library, like libaio is now, and libattr, and libacl, and many others. (Though if it's useful and glibc wants to use it itself, it's not unimaginable that it might find its way in there in time -- but glibc has harsh backward-compatibility constraints that argue in favour of a trial period in an external library in any case, until we know whether the API the library provides works well.)

Ringing in a new asynchronous I/O API

Posted Jan 16, 2019 18:05 UTC (Wed) by axboe (subscriber, #904) [Link] (6 responses)

System calls should go into libc, but anything else will reside in liburing. You can clone that here:

git://git.kernel.dk/liburing

though not a lot of items are in there yet. It does contain helpers to setup/teardown the ring, and submit/complete helpers for applications that don't want (or need) to muck with the ring itself. This will grow some more features, the intent is that most applications will _probably_ end up using that instead of handling all the details themselves.

Ringing in a new asynchronous I/O API

Posted Jan 17, 2019 23:12 UTC (Thu) by nix (subscriber, #2304) [Link] (5 responses)

Nice library name. The plan is to lure people into using it, right? :)

Ringing in a new asynchronous I/O API

Posted Jan 17, 2019 23:30 UTC (Thu) by axboe (subscriber, #904) [Link]

Don't give away all my secrets :-)

Ringing in a new asynchronous I/O API

Posted Jan 18, 2019 10:36 UTC (Fri) by NAR (subscriber, #1313) [Link] (3 responses)

My first association/misread was uring - urine and that's not that nice name. Maybe I just need new glasses.

Ringing in a new asynchronous I/O API

Posted Jan 18, 2019 12:06 UTC (Fri) by zdzichu (subscriber, #17118) [Link] (2 responses)

Not only yours.

Ringing in a new asynchronous I/O API

Posted Jan 18, 2019 17:22 UTC (Fri) by axboe (subscriber, #904) [Link] (1 responses)

It's the brain making that leap, since uring isn't a word it currently recognizes. This will go away as it becomes a bit more ubiquitous. I see no reason to change the name.

Ringing in a new asynchronous I/O API

Posted Jan 24, 2019 16:09 UTC (Thu) by Wol (subscriber, #4433) [Link]

Can't you stick a "t" in there? uring -> turing

Cheers,
Wol

Ringing in a new asynchronous I/O API

Posted Jan 16, 2019 19:24 UTC (Wed) by arjan (subscriber, #36785) [Link] (1 responses)

void *addr; /* buffer or iovecs */

hmm that makes 32/64 compat funky.. wonder if it really should just be a u64

Ringing in a new asynchronous I/O API

Posted Jan 16, 2019 19:30 UTC (Wed) by axboe (subscriber, #904) [Link]

It is, check the git repo!

Ringing in a new asynchronous I/O API

Posted Jan 16, 2019 20:09 UTC (Wed) by HIGHGuY (subscriber, #62277) [Link]

Would the interface support operations that require 2 file descriptors, like splicing from a file to a socket and vice versa? I have the impression it doesn’t but could have easily missed something.

Slightly more complex operations could be useful in combination with primitives like the P2P PCIe transfers that are being worked on to avoid going through main memory altogether.

Ringing in a new asynchronous I/O API

Posted Jan 20, 2019 13:22 UTC (Sun) by zse (guest, #120483) [Link] (1 responses)

Nice that we're finally getting a decent AIO option.

I haven't found the complete list of opcodes that are proposed, so don't know if this is already in the works, but I'd think you'll also need synchronization primitives (e.g. a barrier so that all io ops before it need to complete before those after the barrier can start).

In general this proposal kind of reminds me of the command queues you have for graphics hardware (OpenGL/Vulkan). I'm wondering if there is potential for (partial) unification or at least mutual inspiration...

Ringing in a new asynchronous I/O API

Posted Jan 24, 2019 12:37 UTC (Thu) by joib (subscriber, #8541) [Link]

For inspiration in the IO world, there's IOCP (https://en.wikipedia.org/wiki/Input/output_completion_port ).

How to register more files while using some registered files?

Posted May 11, 2019 10:35 UTC (Sat) by hnakamur (guest, #123503) [Link]

My understanding is you can register new set of file descriptors after unregistering all of old ones.

io_uring_register(ring_fd, IORING_REGISTER_FILES, fds, nr_files);
io_uring_register(ring_fd, IORING_UNREGISTER_FILES);
io_uring_register(ring_fd, IORING_REGISTER_FILES, fds2, nr_files2);

But, what to do if you want to add some more file descriptors while using some of already registered
file descriptors?

Ringing in a new asynchronous I/O API

Posted May 17, 2019 19:30 UTC (Fri) by crzbear (guest, #132097) [Link]

this sounds awesome

is there any particular reason the kernel has to allocate those buffers
couldn't they be passed from userspace in the setup call
and then the kernel maps those into its address space

while this might obviously lead to not properly aligned buffers,
the kernel can check that and return with an error if needed

this would do away with the mmapping

Guidance on using io_uring to support 60,000+ TCP connections with <1ms RTT

Posted Nov 9, 2023 2:30 UTC (Thu) by Tushar (guest, #167911) [Link]

Hi,

I am working on building a new application which is required to support 60,000+ tcp connections on a single server (preferably a single POSIX thread) with <1ms RTT. I am considering using io_uring for this.

I have not found any data for similar application of io_uring for other applications. Do you have some benchmark that I might refer to to see if something like this may be possible.

Thanks for your time and help.