The unified control group hierarchy in 3.16

By Jonathan Corbet
June 11, 2014

The idea of reworking the kernel's control group implementation is not exactly new; see this article from early 2012, for example. However, that talk has not yet translated into much in the way of user-visible changes to the kernel. That situation will change in the 3.16 release, which will include the new unified control group hierarchy code. This article will be an overview of how the unified hierarchy will work at the user level.

At its core, the control group subsystem is simply a way of organizing processes into hierarchies; controllers can then be applied to the hierarchies to enforce policies on the processes contained therein. From the beginning, control groups have allowed the creation of multiple hierarchies, each of which can contain a different mix of processes. So one could, for example, create one hierarchy and attach the CPU scheduler controller to it. Another hierarchy could be created for the memory controller; it could contain the same processes, but with a different organization. That would allow memory usage policy to be applied to different groupings of the same processes.

This flexibility has a certain appeal, but it has its costs. It can be expensive for the kernel to keep track of all the controllers that apply to a given process. Controllers also cannot effectively cooperate with each other, since they may be operating on entirely different hierarchies. In some cases (memory and block I/O bandwidth control, for example), better cooperation is needed to effectively control resource use. And, in the end, there has been little real-world use of this feature. So the plan has long been to get rid of the multiple-hierarchy feature, though it has always been known that this change would take a long time to effect fully.

Work on the unified control group hierarchy has been underway for some time, with much of the preparatory work being merged into the 3.14 and 3.15 kernels. In 3.16, this feature will be available, but only to users who ask for it explicitly. To use the unified hierarchy, the new control group virtual filesystem should be mounted with a command like:

    mount -t cgroup -o __DEVEL__sane_behavior cgroup <mount-point>

Obviously, the __DEVEL__sane_behavior option is not intended to be a permanent fixture. It may still be some time, though, before the unified hierarchy becomes available as a default feature.

It is worth noting that the older, multiple-hierarchy mode continues to work even if the unified hierarchy mode is used; it will be kept around for as long as it seems to be needed. The unified hierarchy can be instantiated alongside older hierarchies, but controllers cannot be shared between the unified hierarchy and any others. The care that has been taken in this area should allow users to experiment with the unified mode while avoiding changes that would break existing systems.

In current kernels, controllers are attached to control groups by specifying options to the mount command that creates the hierarchy. In the unified hierarchy world, instead, all controllers are [Control group hierarchy] attached to the root of the hierarchy. (Strictly speaking that's not quite true; controllers attached to old-style hierarchies will not be available in the unified hierarchy, but that's a detail that can be ignored for now). Controllers can be enabled for specific subtrees of the hierarchy, subject to a small set of rules. For the purposes of illustrating these rules, imagine a control group hierarchy like the one shown on the right; groups A and B live directly under the root control group, while C and D are children of B.

Each control group in the hierarchy has (in its associated control directory) a file called cgroup.controllers that lists the controllers that can be enabled for children of that group. Another file, cgroup.subtree_control, lists the controllers that are actually enabled; writing to that file can turn controllers on or off. It is worth repeating that these files manage the controllers attached to the children of the group; in the unified hierarchy, a control group is thought of as delegating its resources to subgroups for management. There are some interesting implications resulting from this design.

One of those is that a control group must apply a controller to all of its children or none. If the memory controller is enabled in B's cgroup.subtree_control file, it will apply to both C and D; there is no way (from B's point of view) to apply the controller to only one of those subgroups. Further, a controller can only be enabled in a specific control group if it is enabled in that group's parent; a controller cannot be enabled in group C unless it is already enabled in group B. That suggests that all controllers that are actually meant to be used must be enabled in the root control group, at which point they will apply to the entire hierarchy. It is, however, possible to disable a controller at a lower level. So, if the CPU controller is enabled in the root, it can be disabled in group A, exempting all of A's descendant groups from CPU control.

Another new rule is that the cgroup.subtree_control file can only be used to change the set of active controllers if the associated group contains no processes. So, for example, if group B has controllers enabled in its cgroup.subtree_control file, it cannot contain any processes; those processes must all be placed into group C or D. This rule prevents situations where processes in the parent control group are competing with those in the child groups — situations that current controllers handle inconsistently and, often, badly. The one exception to the "no processes" rule is the root control group.

One other control file found in the unified hierarchy is called cgroup.populated; reading it will return a nonzero value if there are any processes in the group (or its descendants). By using poll() on this file, a process can be notified if a control group becomes completely empty; the process would presumably respond by cleaning up and removing the group. Current kernels, instead, create a helper process to provide the notification; this technique has been frowned on for years.

The unified hierarchy will allow a privileged process to delegate access to control group functionality by changing the owner of the associated control files. But this delegation only works to an extent: a unprivileged process with access to the control files can create child control groups and move processes between groups, but it cannot change any controller settings. This policy is there partly to keep unprivileged processes from disrupting the system, but the intent is also to restrict access to the more advanced control knobs. These knobs are currently deemed to expose too much information about the kernel's internals, so there is a desire to avoid having applications depend on them.

All of this work has been extensively discussed for years, with most of the major users of control groups having had their say. So it should be suitable for most of the known uses today, but that is no substitute for actually seeing things work. The 3.16 kernel will provide an opportunity for interested users to try out the new mode and find out which problems remain; actual migration by users to the new scheme cannot be expected to happen for a few more development cycles at the earliest, though. But, at some point, the control group rework will cease being something that's mostly talked about and become just another big job that eventually got done.

Index entries for this article
Kernel	Control groups

to post comments

The unified control group hierarchy in 3.16

Posted Jun 12, 2014 6:25 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Well, cgroups unified hierarchy is getting better. It looks like it's going to be possible to delegate a subtree to a container by bind-mounting it and making sure that the container's cgroups controller has adequate capabilities.

I can live with that, it's certainly better than The One Daemon To Rule Them All approach that systemd loves and wants.

Now, ability to grant permissions to move processes between cgroups to unprivileged users is baffling. It's not really of much use at all, without corresponding ability to change knobs. I understand that developers are hesitant to allow manipulation of some settings, but perhaps they can divide settings into 'good' and 'bad' sets and allowing unrestricted access only to the 'good' set?

The unified control group hierarchy in 3.16

Posted Jun 12, 2014 16:17 UTC (Thu) by raven667 (subscriber, #5198) [Link] (1 responses)

This seems to have been the plan from when the start when a unified hierarchy was announced, that they would take away delegation and then add it back piece by piece as the security implications were better understood and the implementation refactored. I think this is evidence that all the people who thought the kernel cgroups maintainers were conspiring with the systemd maintainers to pee in everyones favorite breakfast cereal may have been mistaken.

The unified control group hierarchy in 3.16

Posted Jun 12, 2014 16:31 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

There were plans to make cgroups tree modifiable only by a _single_ process. As far as I remember, there were plans to add a special 'pid' file to write the pid of the authorized process.

And systemd was all for it. See here as an example: https://lwn.net/Articles/555922/
>This hierarchy becomes private property of systemd. systemd will set
>it up. Systemd will maintain it. Systemd will rearrange it. Other
>software that wants to make use of cgroups can do so only through
>systemd's APIs. This single-writer logic is absolutely necessary

There were similar (in spirit) messages from Tejun Heo.

So I guess somebody hit the cgroups developers hard enough to make them see the light and re-introduce a sane delegation mechanism.

And now cgrouproot can live in /proc?

Posted Jun 15, 2014 17:40 UTC (Sun) by alison (subscriber, #63752) [Link] (3 responses)

What is the rationale for cgroups having their own mount in the first place? Assuredly cgroups are part of UAPI to the kernel, and as such they'd make more sense in /proc than /sys. With just one hierarchy, having cgroups in /proc would be more consistent with what's already there.

And now cgrouproot can live in /proc?

Posted Aug 4, 2014 14:55 UTC (Mon) by kloczek (guest, #6391) [Link] (2 responses)

Originally procfs was about managing processes.
But you know .. Linux is now mature OS so it cannot change suddenly UAPI (despite that in Documentation directory still you can find document listing why Linux does not need stable KAPI/UAPI).
Linux has some kind of schizophrenia. In procfs you can find even some old attempts to try maintain not only processes and threads but groups of processes as well like /proc/<PID>/task/* but who cares that current attempt to catch up something which is working more than decade in other OSes is breaking something existing.
Cgroups development started at 2007. Who cares that after 7 years still is useless on providing very basic functionalities?
Let's give the chance new kernel developers generation to contribute to growing Linux kernel entropy .. isn't it?

And now cgrouproot can live in /proc?

Posted Aug 4, 2014 16:49 UTC (Mon) by dlang (guest, #313) [Link]

> despite that in Documentation directory still you can find document listing why Linux does not need stable KAPI/UAPI

that document says that the internal API of the kernel is not stable.

the User API to the kernel is very stable.

And now cgrouproot can live in /proc?

Posted Aug 5, 2014 23:39 UTC (Tue) by nix (subscriber, #2304) [Link]

Um, /proc/$pid/task *is* how procfs provides information on threads. It's not 'groups of processes', it's groups of *kernel tasks*, i.e. schedulable entities: what POSIX calls threads. There is nothing in procfs to track groups of processes in any other sense (you can't even follow the pid -> ppid hierarchy via the directory hierarchy or via symlinks, you have to parse /proc/$pid/status).

You really don't know very much about Linux at this level, do you?

The unified control group hierarchy in 3.16

Posted Jun 16, 2014 22:24 UTC (Mon) by kleptog (subscriber, #1183) [Link] (1 responses)

While I understand that having lots of independent control groups makes it difficult, especially if controllers need to cooperate, I don't see why the conclusion needs to be only a single group. Less groups, yes, but only one?

For example, having a hierarchy for the processes like systemd wants and a hierarchy for resources seems like it could work. And would satisfy more people I believe.

The unified control group hierarchy in 3.16

Posted Jun 17, 2014 8:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

There are perhaps two systemd cgroups controllers that should be separate: cpuacct and freezer.

Cpuacct is harmless - it's a read-only accounting tool and can be used without too much consideration for its overhead or cross-controller interactions.

Freezer is different - it's often necessary to stop multiple processes atomically and they very well might be on separate levels.

The unified control group hierarchy in 3.16

Posted Jun 17, 2014 9:22 UTC (Tue) by MrWim (subscriber, #47432) [Link]

What I'd quite like to see is a clone/unshare flag which would put the child/process into it's own cgroup as a sub-cgroup of the cgroup it was originally in. This could be an unprivileged operation even if moving processes between cgroups requires privileges.

This would be useful for things like make where you might want to by default avoid slowing your other applications when you pass -j20.

I would find it most useful for unit test runners where you want to be certain that you've killed all the processes that were started by the test when the test ends. Essentially it would be process groups that actually work.

How many years will take Linus&co to develop Solaris contractfs+project

Posted Aug 4, 2014 14:37 UTC (Mon) by kloczek (guest, #6391) [Link] (2 responses)

Anyone knows how long it may take?
Why Linux still is suffering on NIH (Not Invented Here) syndrome?
Why something so simple like managing tasks and processes must be driven by yet-another-stupid-fs?
Why no one from Linux developers is able to sit down study existing implementation of solutions of some problems, after this develop on first step consistent base API with plan how to extend base functionalities, and after this stick to agreed/approved plan?
Why .. ?
Why .. ?
.
.

Ten years after developing DTrace on Solaris most of the time spend on SystemTap, LTT, LTTng and many other attempts can be put in garbage and now more people on Linux is using DTrace delivered by commercial company.

Why Linux developers are trying again and again repeating the same errors and expecting that at some point it will Work(tm)?

How many years will take Linus&co to develop Solaris contractfs+project

Posted Aug 4, 2014 16:48 UTC (Mon) by dlang (guest, #313) [Link]

you seem to be trolling, but I will answer one thing

> Ten years after developing DTrace on Solaris

Sun licensed DTrace in a way that is deliberately incompatible with the GPLv2 license of the Linux kernel. As a result, it can't legally be distributed for Linux.

So blame this one on Sun/Oracle not Linux developers.

How many years will take Linus&co to develop Solaris contractfs+project

Posted Aug 5, 2014 23:37 UTC (Tue) by nix (subscriber, #2304) [Link]

now more people on Linux is using DTrace delivered by commercial company

Well, I'd be very interested to hear where you got this information from. I'm one of the DTrace for Linux developers, and, y'know, I don't have that information. Possibly my bosses have it, but if so they haven't told me. To be honest I have no idea how anyone could know this sort of thing without horrendously invasive spying on users, or wildly unreliable usage surveys which have as far as know not been conducted.

(But maybe you mean some other Linux DTrace developed by a commercial company? Or perhaps you mean not 'more people than use SystemTap / perf / something else' but rather 'more people than used to use it', which is trivially true if it is used by anyone at all, since it has not always existed.)