The evolution of pipe buffers

[Posted January 18, 2005 by corbet]

Last week, this page looked at the new circular buffer structure used to implement Unix pipes in 2.6.11-rc1, and noted that the plan was to evolve that structure into something more general. Since then, Linus has taken a couple more steps; it must be time to catch up.

One change which has already been merged is the addition of a set of operations for pipe buffers:

    struct pipe_buf_operations {
	int can_merge;
	void *(*map)(struct file *, struct pipe_inode_info *, 
                     struct pipe_buffer *);
	void (*unmap)(struct pipe_inode_info *, struct pipe_buffer *);
	void (*release)(struct pipe_inode_info *, struct pipe_buffer *);
    };

The can_merge flag addresses one of the issues raised last week: coalescing of writes into existing pages in the buffer. If can_merge is non-zero, coalescing will be performed. Otherwise, each write to a pipe buffer will result in the creation of a new circular buffer entry, and, by default, the allocation of a new page.

The map() and unmap() methods are charged with controlling the visibility of pipe buffer pages in the kernel's virtual address space. The default map() operations for buffers implementing Unix pipes is quite simple:

    static void *anon_pipe_buf_map(struct file *file, 
                                   struct pipe_inode_info *info, 
                                   struct pipe_buffer *buf)
    {
            return kmap(buf->page);
    }

Since the mapping operation has been abstracted out, there are now fewer assumptions regarding how data is really stored within a pipe buffer. This opens the door to different pipe implementations, such as pipes which implement a direct window into device memory.

The release() method should clean things up when the pipe buffer is no longer needed.

Linus has also created an initial implementation of a splice() system call, though this work is clearly not ready for merging at this point. This system call looks like:

    long sys_splice(int fdin, int fdout, size_t len, unsigned long flags);

fdin and fdout are two file descriptors; a call to sys_splice() will result in len bytes being copied from fdin to fdout, one of which is expected to be a pipe. The flags argument is not currently used by the sample implementation.

To make sys_splice() work, Linus added two new methods to the ever-expanding file_operations structure:

    ssize_t (*splice_write)(struct inode *in_pipe, struct file *out, 
                            size_t len, unsigned long flags);
    ssize_t (*splice_read)(struct file *in, struct inode *out_pipe, 
                           size_t len, unsigned long flags);

The patch includes a generic splice_read() implementation suitable for filesystem-backed file descriptors. It simply populates the page cache with some pages from the file, then loads those pages into the pipe buffer represented by out_pipe. Like ordinary read() and write() methods, the splice variants can transfer fewer bytes than requested. Linus's version will stop at the maximum capacity of a pipe buffer - 16 pages, currently.

As Linus acknowledges, there are a number of shortcomings to the current implementation - it is incomplete, the interfaces are ugly, and it will oops the system if anything goes wrong. It is, however, an indication of where he expects this work will lead. Stay tuned.

Index entries for this article
Kernel	Circular buffers
Kernel	Pipes
Kernel	splice()

to post comments

the realities of splice

Posted Jan 20, 2005 4:17 UTC (Thu) by bronson (guest, #4806) [Link] (4 responses)

The code we saw last week was sleek and pretty. This week, in making it more useful, it appears to have grown all sorts of hair. It will be interesting to see how well it can be cleaned back up.

Is splice only intended for pipes? I hope that eventually I could splice, say, a file to a socket (a la sendfile), socket to a socket, etc.

the realities of splice

Posted Jan 20, 2005 4:40 UTC (Thu) by bradfitz (subscriber, #4378) [Link]

I hope that eventually I could splice, say, a file to a socket (a la sendfile), socket to a socket, etc.

That's the plan, as I've been reading it.

the realities of splice

Posted Jan 20, 2005 5:13 UTC (Thu) by dlang (guest, #313) [Link]

from his posts in the 19th it will remain pipe <-> other

so you can do file -> pipe -> socket but not file -> socket

the realities of splice

Posted Jan 20, 2005 14:29 UTC (Thu) by alangley (guest, #23266) [Link] (1 responses)

Actually, you have to wonder why it's a new syscall at all. The interface is almost exactly the same as sendfile, so why has Linus added another syscall at all?

(and the outfd, infd arguments are round the wrong way at the moment)

AGL

the realities of splice

Posted Oct 11, 2006 15:19 UTC (Wed) by Niam (guest, #41009) [Link]

So I don't understand qhat is this new syscall for!!
Sendfile is seems much better, 'cos it can manipulate with any fd...

Only a "plus" - it's faster[but it works with pipes!!!].
It seems to me that better to modify sendfile call for pipes mode.

Now, if I write new progrom I should write
#ifdef __splice
splice(..)
#elif __tee
tee(...)
#else
sendfile(..)
#endif

I can't see what are them really for...