KS2007: Containers

By Jonathan Corbet
September 10, 2007

LWN.net Kernel Summit 2007 coverage

For the first time in a few years, virtualization was not on the agenda at the 2007 kernel summit. The related field of containers, however, was deemed worth talking about. The virtualization problem has been mostly solved, at least at the kernel level, but there is still a lot of work to do in the containers area.

Paul Menage talked about the process containers patch, which has recently been rebranded "control groups." The control groups API is currently being used by the CFS scheduler, cpusets, and the memory controller code. Work in progress includes rlimits and an interface to the process freezer used by the suspend/resume code. Controlling the freezer via control groups allows user space to freeze specific groups of processes, which, in turn, is very useful when implementing checkpointing and live migration. In particular, with control groups, it will be possible to freeze an entire group of processes in an atomic way.

Control groups have very little overhead when not in use. There is an approximately 1% hit on the fork() and exec() calls when control groups are being used. The control groups code is managed by way of a virtual filesystem. This filesystem is a user-space API which must be managed carefully; there needs to be consistency across the various controllers which can work with control groups. To that end, parts of this interface are being pushed into generic code when possible. One other issue is the use of control groups within containers. It would be nice if a containerized system could manage control groups for processes within the container, but that is not yet implemented.

Eric Biederman talked about the container situation in general. Implementing containers requires the creation of container-specific namespaces for all of the global resources found on the system. Namespaces for time, SYSV interprocess communication primitives, and users are in the mainline now. There is a process ID namespace patch in -mm which is getting close. Network namespaces are in development now. Resources which still need to have namespaces created for them include system time (important to keep time from moving backward when containers are migrated from one system to another) and devices.

Each namespace which is created requires an option to the clone() system call to say whether it should be shared or not. It seems that there may not be enough clone bits to go around; how that problem will be solved is not clear.

So, how close are we to having a working container solution? It is still somewhat distant, says Eric. But, when it's done, the support for containers in Linux will be more general and more capable than the options which are available now. It is, he says, a more general solution than OpenVZ, and, unlike Solaris Zones, it will have network namespaces. An important milestone will be the incorporation of PID namespaces, which will make it possible to start actually playing with Linux containers. That code should, with luck, be merged before too long, though it is proving to be a bit of a challenge: kernel code has process IDs hidden away in a number of unexpected places.

Stay tuned; perhaps, by the next kernel summit, containers will be considered to be a solved problem as well.

Index entries for this article
Kernel	Containers

to post comments

KS2007: Containers

Posted Sep 10, 2007 22:59 UTC (Mon) by kolyshkin (guest, #34342) [Link]

By the way, slides used for this session are available here.

An important milestone will be the incorporation of PID namespaces, which will make it possible to start actually playing with Linux containers. That code should, with luck, be merged before too long

(Most of) PID namespaces code are already in -mm tree.

It is, he says, a more general solution than OpenVZ

Yes, in a sense that one can only use parts of container functionality (like only have a PID namespace, or a network namespace) -- which makes sense in some situations. Currently, OpenVZ kernel only lets you use just some parts separately (like beancounters, or fair CPU scheduler), and this is only from the kernel side -- user-level tools can only deal with "full scale" containers. From the other side, checkpointing is only possible when container is a closed object, so "half-containers" can not be checkpointed.

So, how close are we to having a working container solution?

A big part here is resource management. Memory controller that is now in -mm is just the very beginning -- there is a whole lot more than RSS and page cache (from the other side, Pavel Emelyanov already sent kernel memory controller patchset as an RFC). Group-based CFQ scheduling is not yet merged AFAIK. Group I/O scheduling (based on Jens Axboe's CFQ) will probably be sent for review soon; but scheduling delayed writes requires some dirty page tracking mechanism that only exists in OpenVZ for now (described in Pavel's paper), a discussion of how to implement that for mainstream is not even started.

At the end -- there are a lot of issues to be solved, but given the latest progress, most of the functionality could be there in a year or so, so I more or less agree with your optimistic forecast. :)

When containers are ready, we can start work on checkpointing.

What is a network namespace?

Posted Sep 11, 2007 19:33 UTC (Tue) by cajal (guest, #4167) [Link] (2 responses)

I'm puzzled by this quote "unlike Solaris Zones, it will have network namespaces." What is a network namespace?

What is a network namespace?

Posted Sep 12, 2007 9:13 UTC (Wed) by zdzichu (subscriber, #17118) [Link] (1 responses)

It's an ability to have different network stacks running along. It's network stack virtualization. And, contrary to comment above, it's available in Solaris 10u4 and OpenSolaris. It's nicknamed project Crossbow.

What is a network namespace?

Posted Sep 12, 2007 14:42 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

Odd. I don't think I actually made that comment.

KS2007: Containers

Posted Sep 12, 2007 14:41 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

When I claimed the current kernel infrastructure is more general then
vserver and OpenVZ what I meant is that we have to support the entire
kernel and everything it can do, and doing it with code that can pass
a code review by the kernel community. Ensuring that architecture and
subarchitecture will work, and that every weird kernel subsystem will work
appears to me to be more then the out of tree projects have tackled.

Doing this this with namespaces makes decomposes the problem so we can
have an incremental merge (simplifying things). It also makes things a
little harder as we have to handle all of the weird partial interactions.

The question asked of me is how long until we have in kernel support that
is equal to OpenVZ, or Solaris Zones. Getting there pretty much requires
us to get everything complete and will take a while.

If you only need a subset of that functionality (like a lot of projects)
we should have something interesting when the we get things like the pid
namespace merged.

Having the additional resource management seems to be a big part of the
existing out of tree solutions because when you load the machine heavily
you have more contention between users. However for some uses like a
better chroot for rpm installs or an isolated set of process for
checkpoint restart you don't need additional resource management.

For global resources there are two approaches that a designer can choose
from. Namespaces where you allow two instances of the same global name to
exist in different namespaces. Pure isolation (which is almost
exclusively what vserver provides) which only allows you to see a subset
of the global names. If you are not supporting process migration they
are about the same. Without namespaces process migration is in trouble
because there is no guarantee that you can restore your global identifiers
and keep running.

What little I know of Solaris Zones is that they grew out of efforts to
improve chroot type solutions, and thus do primarily global resource
isolation and do not provide namespaces. The implication of that is that
Solaris Zones do not provide an easy path to container migration from one
machine to another. However everything is evolving and even if my
understanding was right at one time, Solaris may have changed since then.

As for the question of what are network namespaces. They are a way to
make it appear to user space as if you have multiple network stacks. Each
logical stack with it's own routing tables, firewall tables, network
devices and the works. Fundamentally they aren't to hard to implement but
they need a bit of work on how the network stack handles global data.

Eric

KS2007: Containers

Posted Sep 12, 2007 20:51 UTC (Wed) by kolyshkin (guest, #34342) [Link]

Gerrit Huizenga's coverage of the same containers session is here:
http://gh-linux.blogspot.com/2007/09/linux-kernel-summit-...