The iov_iter interface

By Jonathan Corbet
December 9, 2014

One of the most common tasks in the kernel is processing a buffer of data supplied by user space, possibly in several chunks. Perhaps unsurprisingly, this is a task that kernel code often gets wrong, leading to bugs and, possibly, security problems. The kernel contains a primitive (called "iov_iter") meant to make this task simpler. While iov_iter use is mostly confined to the memory-management and filesystem layers currently, it is slowly spreading out into other parts of the kernel. This interface is currently undocumented, a situation this article will attempt to remedy.

The iov_iter concept is not new; it was first added by Nick Piggin for the 2.6.24 kernel in 2007. But there has been an effort over the last year to expand this API and use it in more parts of the kernel; the 3.19 merge window should see it making its first steps into the networking subsystem, for example.

An iov_iter structure is essentially an iterator for working through an iovec structure, defined in <uapi/linux/uio.h> as:

    struct iovec
    {
	void __user *iov_base;
	__kernel_size_t iov_len;
    };

This structure matches the user-space iovec structure defined by POSIX and used with system calls like readv(). As the "vec" portion of the name would suggest, iovec structures tend to come in arrays; as a whole, an iovec describes a buffer that may be scattered in both physical and virtual memory.

The actual iov_iter structure is defined in <linux/uio.h>:

    struct iov_iter {
	int type;
	size_t iov_offset;
	size_t count;
	const struct iovec *iov; /* SIMPLIFIED - see below */
	unsigned long nr_segs;
    };

The type field describes the type of the iterator. It is a bitmask containing, among other things, either READ or WRITE depending on whether data is being read into the iterator or written from it. The data direction, thus, refers not to the iterator itself, but to the other part of the data transaction; an iov_iter created with a type of READ will be written to.

Beyond that, iov_offset contains the offset to the first byte of interesting data in the first iovec pointed to by iov. The total amount of data pointed to by the iovec array is stored in count, while the number of iovec structures is stored in nr_segs. Note that most of these fields will change as code "iterates" through the buffer. They describe a cursor into the buffer, rather than the buffer as a whole.

Working with struct iov_iter

Before use, an iov_iter must be initialized to contain an (already populated) iovec with:

    void iov_iter_init(struct iov_iter *i, int direction,
		       const struct iovec *iov, unsigned long nr_segs,
		       size_t count);

Then, for example, data can be moved between the iterator and user space with either of:

    size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
    size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);

The naming here can be a little confusing until one gets the hang of it. A call to copy_to_iter() will copy bytes data from the buffer at addr to the user-space buffer indicated by the iterator. So copy_to_iter() can be thought of as being like a variant of copy_to_user() that takes an iterator rather than a single buffer. Similarly, copy_from_iter() will copy the data from the user-space buffer to addr. The similarity with copy_to_user() continues through to the return value, which is the number of bytes not copied.

Note that these calls will "advance" the iterator through the buffer to correspond to the amount of data transferred. In other words, the iov_offset, count, nr_segs, and iov fields of the iterator will all be changed as needed. So two calls to copy_from_iter() will copy two successive areas from user space. Among other things, this means that the code owning the iterator must remember the base address for the iovec array, since the iov value in the iov_iter structure may change.

Various other functions exist. To move data referenced by a page structure into or out of an iterator, use:

    size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
			     struct iov_iter *i);
    size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
			       struct iov_iter *i);

Only the single page provided will be copied to or from, so these functions should not be asked to copy data that would cross the page boundary.

Code running in atomic context can attempt to obtain data from user space with:

    size_t iov_iter_copy_from_user_atomic(struct page *page, struct iov_iter *i,
					  unsigned long offset, size_t bytes);

Since this copy will be done in atomic mode, it will only succeed if the data is already resident in RAM; callers must thus be prepared for a higher-than-normal chance of failure.

If it is necessary to map the user-space buffer into the kernel, one of these calls can be used:

    ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
                               size_t maxsize, unsigned maxpages, size_t *start);
    ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages, 
    	    			     size_t maxsize, size_t *start);

Either function turns into a call to get_user_pages_fast(), causing (hopefully) the pages to be brought in and their locations stored in the pages array. The difference between them is that iov_iter_get_pages() expects the pages array to be allocated by the caller, while iov_iter_get_pages_alloc() will do the allocation itself. In that case, the array returned in pages must eventually be freed with kvfree(), since it might have been allocated with either kmalloc() or vmalloc().

Advancing through the iterator without moving any data can be done with:

    void iov_iter_advance(struct iov_iter *i, size_t size);

The buffer referred to by an iterator (or a portion thereof) can be cleared with:

    size_t iov_iter_zero(size_t bytes, struct iov_iter *i);

Information about the iterator is available from a number of helper functions:

    size_t iov_iter_single_seg_count(const struct iov_iter *i);
    int iov_iter_npages(const struct iov_iter *i, int maxpages);
    size_t iov_length(const struct iovec *iov, unsigned long nr_segs);

A call to iov_iter_single_seg_count() returns the length of the data in the first segment of the buffer. iov_iter_npages() reports the number of pages occupied by the buffer in the iterator, while iov_length() returns the total data length. The latter function must be used with care, since it trusts the len field in the iovec structures. If that data comes from user space, it could cause integer overflows in the kernel.

Not just iovecs

The definition of struct iov_iter shown above does not quite match what is actually found in the kernel. Instead of a single field for the iov array, the real structure has (in 3.18):

    union {
	const struct iovec *iov;
	const struct bio_vec *bvec;
    };

In other words, the iov_iter structure is also set up to work with the BIO structures used by the block layer. Such iterators are marked by having ITER_BVEC include in the type field bitmask. Once such an iterator is created, all of the above calls will work with it as if it were an "ordinary" iterator using iovec structures. Currently, the use of BIO-based iterators in the kernel is minimal; they can only be found in the swap and splice() code.

Coming in 3.19

The 3.19 kernel is likely to see a substantial rewrite of the iov_iter code aimed at reducing the vast amount of boilerplate code needed to implement all of the above-mentioned functions. The code is indeed shorter afterward, but at the cost of introducing a fair amount of mildly frightening preprocessor macro magic to generate the needed boilerplate on demand.

The iov_iter code already works if the "user-space" buffer is actually located in kernel space. In 3.19, things will be formalized and optimized a bit. Such an iterator will be created with:

    void iov_iter_kvec(struct iov_iter *i, int direction,
		       const struct kvec *iov, unsigned long nr_segs,
		       size_t count);

There will also be a new kvec field added to the union shown above for this case.

Finally, some functions have been added to help with the networking case; it will be possible, for example, to copy a buffer and generate a checksum in the process.

The end result is that the iov_iter interface is slowly becoming the standard way of hiding many of the complexities associated with handling user-space buffers. We can expect to see its use encouraged in more places in the future. It only took seven years or so, but iov_iter appears to be reaching a point of being an interface that most kernel developers will want to be aware of.

Index entries for this article
Kernel	iov_iter

to post comments

The iov_iter interface

Posted Dec 11, 2014 8:41 UTC (Thu) by viro (subscriber, #7872) [Link] (1 responses)

Some additions (and I'll probably need to write more or less coherent description of the entire thing):

1) iov_iter_count() works for any type, and it's fast - much faster than iov_length() (that one is iovec-only).

2) bvec is not bio-backed; it's bio_vec-backed. I.e. it's an array of triples <page, offset, length>. Note that pages do _not_ need to be mapped - primitives do everything themselves. bvec ones are basically sglists. struct bio_vec is a misnomer - something like struct part_of_page would be more accurate. bio uses those, but unlike bio they are not related to block layer.

3) we are actually pretty careful about validation of iovec-backed ones - they are guaranteed to pass access_ok() on all components and to have total length of segments fit into signed 32bit integer.

4) right now most of the ->aio_read()/->aio_write() instances got replaced by ->read_iter()/->write_iter(); the sockets are the major exception and they are going to get converted in 3.20. Along with ->aio_read/->aio_write, ->splice_write is almost extinct (also derived from ->write_iter). ->splice_read is trickier, but hopefully it (and ->sendpage) will get dealt with. I very much hope to get rid of the piles of read/write methods. For me the original reason to get into that area was the locking mess with ->splice_write() vs. ->aio_write() - basically, they kept breeding AB-BA violations of locking order *and* code duplication from hell. What we had kinda-sorta worked if ->i_mutex had been the only lock involved, but anything trickier (XFS, cluster filesystems, etc.) got very ugly very fast.

5) a lot of recent work had been in networking code - some of that will be in 3.19, some - 3.20. The basic problem was that sendmsg and recvmsg took an iovec and buggered the hell out of it; some left it unchanged, some drained it (i.e. incremented ->iov_base/decremented ->iov_len as they dealt with more and more data). Some were even nastier and drained them partially - e.g. eat the first 4 bytes, no matter how much got sent. For syscalls it didn't matter - we copied iovecs from userland and discarded them in the end. But for _kernel_ users of that stuff (network filesystems, tun/tap, etc.) the things were very unpleasant. In some cases the code knew which protocol family it was dealing with and could use the knowledge of what specific ->sendmsg() or ->recvmsg() instance did to iovec. In most of those places, however, it had to construct a throwaway copy of iovec and then drain the original itself. There's quite a collection of unhappy comments along those lines. Worse yet, there were actual bugs when a copy was needed but hadn't been made.

With the ongoing rewrite (mostly done; ->recvmsg() conversion is in net-next, ->sendmsg() one is in vfs.git#iov_iter-net and will miss this merge window) we get iov_iter (->msg_iter) instead of ->msg_iov/->msg_iovlen in kernel-side struct msghdr. Underlying iovec (or bio_vec, or kvec - the code is type-agnostic now) is never modified and ->msg_iter is always advanced by the amount of bytes actually transferred. That, of course, allows to avoid a lot of PITA in the kernel callers. I have some patches in that direction in a local queue, still need more massage.

_Maybe_ we'll end up dropping 'size' argument of ->sendmsg() and ->recvmsg(); it's redundant (always equal to the value of iov_iter_count(&msg->msg_iter) at the entry into method), but I'm not sure if it's worth doing.

6) next fun work in that area will be around ->splice_read() and page-stealing in particular. I have some of that plotted out, but that's a story for larger posting (or article, for that matter). So's the general background of the whole thing, actually - a coherent description of that would be interesting (to write down, if nothing else), but it's way too large for LWN comment...

The iov_iter interface

Posted Dec 15, 2014 15:23 UTC (Mon) by willy (subscriber, #9762) [Link]

> 2) bvec is not bio-backed; it's bio_vec-backed. I.e. it's an array of triples <page, offset, length>. Note that pages do _not_ need to be mapped - primitives do everything themselves. bvec ones are basically sglists. struct bio_vec is a misnomer - something like struct part_of_page would be more accurate. bio uses those, but unlike bio they are not related to block layer.

We currently have three data structures for exactly this purpose, struct skb_frag_struct and struct page_frag being the other two. Note that both of the others are slightly better packed than bio_vec on 32-bit machines.

The iov_iter interface

Posted Feb 4, 2018 9:44 UTC (Sun) by m0r1k (guest, #122374) [Link]

>> size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
>> size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);

>> The naming here can be a little confusing until one gets the hang of it. A call to copy_to_iter() will copy bytes data from the buffer at addr to the user-space buffer indicated by the iterator. So copy_to_iter() can be thought of as being like a variant of copy_to_user() that takes an iterator rather than a single buffer.
>> Similarly, copy_from_iter() will copy the data from the user-space buffer to addr. The similarity with copy_to_user() continues through to the return value, which is the number of bytes not copied

actually, these functions will return a number of BYTES WERE COPIED, I have lost two days because of this mistake in the current documentation
please fix it

linux-4.12.9, lib/iov_iter.c:

size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
{
    char *to = addr;
    if (unlikely(i->type & ITER_PIPE)) {
        WARN_ON(1);
        return 0;
    }
    iterate_and_advance(i, bytes, v,
        __copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
                 v.iov_len),
        memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
                 v.bv_offset, v.bv_len),
        memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
    )

    return bytes;
}
EXPORT_SYMBOL(copy_from_iter);

#define iterate_and_advance(i, n, v, I, B, K) {         \
    if (unlikely(i->count < n))             \
        n = i->count;                   \
    if (i->count) {                     \
        size_t skip = i->iov_offset;            \
        if (unlikely(i->type & ITER_BVEC)) {        \
            const struct bio_vec *bvec = i->bvec;   \
            struct bio_vec v;           \
            struct bvec_iter __bi;          \
            iterate_bvec(i, n, v, __bi, skip, (B))  \
            i->bvec = __bvec_iter_bvec(i->bvec, __bi);  \
            i->nr_segs -= i->bvec - bvec;       \
            skip = __bi.bi_bvec_done;       \
        } else if (unlikely(i->type & ITER_KVEC)) { \
            const struct kvec *kvec;        \
            struct kvec v;              \
            iterate_kvec(i, n, v, kvec, skip, (K))  \
            if (skip == kvec->iov_len) {        \
                kvec++;             \
                skip = 0;           \
            }                   \
            i->nr_segs -= kvec - i->kvec;       \
            i->kvec = kvec;             \
        } else {                    \
            const struct iovec *iov;        \
            struct iovec v;             \
            iterate_iovec(i, n, v, iov, skip, (I))  \
            if (skip == iov->iov_len) {     \
                iov++;              \
                skip = 0;           \
            }                   \
            i->nr_segs -= iov - i->iov;     \
            i->iov = iov;               \
        }                       \
        i->count -= n;                  \
        i->iov_offset = skip;               \
    }                           \
}

size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
    const char *from = addr;
    if (unlikely(i->type & ITER_PIPE))
        return copy_pipe_to_iter(addr, bytes, i);
    iterate_and_advance(i, bytes, v,
        __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
                   v.iov_len),
        memcpy_to_page(v.bv_page, v.bv_offset,
                   (from += v.bv_len) - v.bv_len, v.bv_len),
        memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
    )

    return bytes;
}
EXPORT_SYMBOL(copy_to_iter);

static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
                struct iov_iter *i)
{
    struct pipe_inode_info *pipe = i->pipe;
    size_t n, off;
    int idx;

    if (!sanity(i))
        return 0;

    bytes = n = push_pipe(i, bytes, &idx, &off);
    if (unlikely(!n))
        return 0;
    for ( ; n; idx = next_idx(idx, pipe), off = 0) {
        size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
        memcpy_to_page(pipe->bufs[idx].page, off, addr, chunk);
        i->idx = idx;
        i->iov_offset = off + chunk;
        n -= chunk;
        addr += chunk;
    }
    i->count -= bytes;
    return bytes;
}

static size_t push_pipe(struct iov_iter *i, size_t size,
            int *idxp, size_t *offp)
{
    struct pipe_inode_info *pipe = i->pipe;
    size_t off;
    int idx;
    ssize_t left;

    if (unlikely(size > i->count))
        size = i->count;
    if (unlikely(!size))
        return 0;

    left = size;
    data_start(i, &idx, &off);
    *idxp = idx;
    *offp = off;
    if (off) {
        left -= PAGE_SIZE - off;
        if (left <= 0) {
            pipe->bufs[idx].len += size;
            return size;
        }
        pipe->bufs[idx].len = PAGE_SIZE;
        idx = next_idx(idx, pipe);
    }
    while (idx != pipe->curbuf || !pipe->nrbufs) {
        struct page *page = alloc_page(GFP_USER);
        if (!page)
            break;
        pipe->nrbufs++;
        pipe->bufs[idx].ops = &default_pipe_buf_ops;
        pipe->bufs[idx].page = page;
        pipe->bufs[idx].offset = 0;
        if (left <= PAGE_SIZE) {
            pipe->bufs[idx].len = left;
            return size;
        }
        pipe->bufs[idx].len = PAGE_SIZE;
        left -= PAGE_SIZE;
        idx = next_idx(idx, pipe);
    }
    return size - left;
}

----

Regards, Roman Chechnev