The past, present, and future of control groups

By Jonathan Corbet
November 20, 2013

Much has been said about the problems surrounding control groups and the changes that will need to be made with this kernel subsystem. At the 2013 Korea Linux Forum, control group co-maintainer Tejun Heo provided a comprehensive overview of how we got into the current situation, what the problems are, and what is being done to fix them.

The idea behind control groups is relatively simple: divide processes into a hierarchy of groups and use those groups to provision resources in the system. The reality has turned out to be rather messier. So, Tejun asked: how did we get to this point? To begin with, he said, much of what is being done with control groups is new; all of it is new to Linux in particular, and some is new in general. So the community did not have any sort of model to follow when designing this new feature.

Beyond that, though, it is worth looking at who did the work. Control groups started as a new interface to the "cpuset" mechanism, which is used to partition the CPUs in a system among groups of processes. Few people, Tejun said, cared much about this feature, which is used mostly by the high-performance computing crowd. So few kernel developers paid much attention to what was being done.

Then control groups gained the memory controller, a mechanism for restricting the amount of memory used by each group. The core memory management developers did not really care about this work, so they did not participate in it and did not want to hear about it. The block controller came about the same way; Tejun does work in the block subsystem, but he had no real interest in the block controller and mostly just wanted it to stay out of his way. This environment led to a situation where controllers were written by developers without extensive experience in the subsystems they were working with; those controllers had to work on a non-interference basis so that the core developers could ignore them. As a result, controllers have been "bolted onto the side" of the existing kernel subsystems.

The result, Tejun said, is "not pretty." Even worse, though, is that the barriers between the controllers and the subsystems they work with inevitably broke down over time. So control groups are, as a whole, isolated and poorly integrated with the rest of the kernel, but they still manage to complicate the development of the rest of the kernel.

The developers who did all this work were good programmers, Tejun said, but they were not all that experienced with kernel development. So the code they produced was "kind of alien," not conforming to the usual coding style and practices. They repeated a lot of mistakes that the community has made and fixed in the past — repetition that could have been avoided with more review, but, he said, few people were paying active attention to the work being done in this area.

Mistakes were made

What kinds of mistakes were made? Start with hierarchy support — or the lack thereof — in a number of controllers. Control groups allow the organization of processes into a true hierarchy, with policies applied at various levels in the tree. But making a truly hierarchical controller is hard, so a number of controller developers simply didn't bother; instead, they ignored the tree structure and treated every group as if it were placed directly under the root. This was not a good decision, Tejun said; if a controller could not be made hierarchical, it should have at least refused to work with nested control groups. That would have indicated to users that things wouldn't work as they might expect and avoided the creation of a non-hierarchical interface that must now be supported.

The ".use_hierarchy" flag added by the memory controller to enable hierarchical behavior in subtrees was an especially confusing touch, he said.

Another clear mistake was the "release_agent" mechanism. The idea was to notify some process when a control group becomes empty; it was a good idea, he said, in that it allows that process to clean up groups that are no longer in use. But it was implemented as a user-mode helper — every time a control group becomes empty, the kernel creates a new process to run the release agent program. This is an expensive and complex operation for the simple task of sending a notification to user space. The rest of the kernel had moved away from this kind of process-based notification years ago, but the control group developers reimplemented it. We have much better notification mechanisms that should have been used instead, but nobody who could have pointed that out was paying attention when this code was merged.

Yet another problem is the heavy entanglement with the virtual filesystem (VFS) layer. Many years ago, the original sysfs implementation was also deeply tied to the VFS with the idea that it would simplify things. But that didn't work; the results were, instead, lots of memory used and locking-related problems. So sysfs was reworked to look a lot like a distributed filesystem, and things have worked better ever since. When the control group developers set out to create their administrative filesystem, though, they repeated the sysfs mistake. So now control groups have a number of related problems, such as locking hassles whenever an operation needs to work across multiple groups. Tejun is now working on separating things properly; some of that work was merged for the 3.13 kernel.

In engineering, Tejun said, nothing is free; everything comes down to a tradeoff between objectives. Or, in other words, "extremes suck," but control groups went to an extreme with regard to flexibility. Allowing multiple, independent hierarchies is the biggest example; this feature results in a situation where the kernel cannot tell which control group a given process belongs to. Instead, that membership is expressed by a list of arbitrary length. Controllers are all entirely separate from each other, with no coordination between them; they also behave in inconsistent ways. All this flexibility makes it difficult to make changes to the code, since there is no way to know what might break.

Flexibility also led to a range of implementation issues beyond the lack of hierarchical support in some controllers. The core code is complex and fragile. Developers took a lot of shortcuts in areas like security, leading to problems like denial-of-service issues. But, perhaps worst of all, the kernel community committed to a new ABI for control groups without really thinking about it; as a result, we ended up with a lot of accidental features. The ability to assign a process's threads to different control groups is one of those — most controllers only make sense at the process level. The control interface is filesystem-based, but no thought went into permissions, so it is possible to change the ownership of subdirectories, essentially delegating ownership of a subtree of groups to another user. The control group developers have, for all practical purposes, created a new set of system calls without the kind of review that system calls must normally go through.

What now?

The first step has been to fix the controllers that do not support the full control group hierarchy. Unfortunately, they cannot simply be fixed in place without breaking existing users. So there will have to be a "version 2" of the control interface that users can opt into. Controllers must be fully hierarchical or they will simply be unavailable in the new interface. The interface change will also allow the developers to enforce a certain degree of consistency between controllers. It will be possible, Tejun said, to mix use of the old and new interfaces without breaking things.

The multiple control group hierarchies will be going away. Most users will not really notice the change, but some were using multiple hierarchies to avoid enabling expensive controllers for processes that don't need them. In the new scheme, that need will be met by making it possible to enable or disable specific controllers at any level of the hierarchy. But all controllers will see the same process hierarchy; among other things, that will make it possible for them to cooperate more effectively. The resulting system will not be as flexible as multiple hierarchies are, but there seems to be an emerging consensus that it will suffice for the known use cases out there.

A lot of controllers will need updates to work in the new scheme, he said. There are a number of people working in the problem and the work is "70-80% there" at this point.

There will be, Tejun said, "no more faking things that we can't do properly." That is especially true when it comes to security which, he said, is a matter of noting and dealing with all of the relevant details — something that has not been done in the control group subsystem. In particular, the whole concept of delegating subtrees of the control group hierarchy to untrusted users is "broken"; there is no way to prevent denial-of-service attacks (or worse) under that scenario. To allow users to move to the new API without breaking things, it will still be possible to do this kind of delegation by changing the ownership of control group directories, but, he said, it will not be secure, just like it is not secure now.

A more secure approach might be the use of a trusted user-space agent process — something that is likely to be necessary in the future anyway. A number of these agents already exist: systemd is one, Google has its own, Ubuntu has one based on Google's code, and Android has an agent as well. In the Android case, Google actually has to "break the kernel" to make it work the way it wants. There is a need for some kind of common scheme so that processes can interoperate with any agent without having to know which one it is.

Tejun had hoped to have a prototype implementation of a reworked control group subsystem available by about now, but that has not happened. It may be ready by the end of the year, with, hopefully, the work being complete around the middle of 2014.

In summary, he repeated that control groups embody a lot of functionality that has not existed in Linux before. When he looks at the current code, he often gets angry at the mistakes that were made, but he is quite confident that he is making plenty of horrible mistakes of his own. So he fully expects future developers to be just as angry with him. That just goes with the territory. The important thing, he said, is to minimize the commitment that is made to user space; in that way, he hopes, we will not get locked into too many mistakes in the future.

[Your editor thanks the Linux Foundation for travel assistance to attend the Korea Linux Forum.]

Index entries for this article
Kernel	Control groups
Conference	Korea Linux Forum/2013

to post comments

The past, present, and future of control groups

Posted Nov 21, 2013 5:08 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

> In particular, the whole concept of delegating subtrees of the control group hierarchy to untrusted users is "broken"; there is no way to prevent denial-of-service attacks (or worse) under that scenario.
I've heard that multiple times, but so far I haven't seen examples of these attacks except the dumb: "set the <SOMETHING> share to 11 and screw the siblings".

The past, present, and future of control groups

Posted Nov 22, 2013 10:50 UTC (Fri) by error27 (subscriber, #8346) [Link] (6 responses)

That is the attack.

People want to use control groups to divide resources between users.

The past, present, and future of control groups

Posted Nov 25, 2013 16:34 UTC (Mon) by nix (subscriber, #2304) [Link] (3 responses)

Well, obviously if you delegate control of a subtree to a user, that user has to be trusted to make changes to that subtree. That's why you did it! If the user then proceeds to make changes, those changes are not an 'attack': they are the intended use of the API.

You might as well say that kill(2) is insecure because it allows you to send signals to processes that they might not be ready for.

The past, present, and future of control groups

Posted Nov 25, 2013 16:50 UTC (Mon) by corbet (editor, #1) [Link] (1 responses)

The point is that said user should not be able to affect processes outside of the subtree, and things don't work that way now. There are also, apparently, denial-of-service issues and such — nothing keeps that user from creating an arbitrary number of control groups and running the kernel out of memory, for example. Or so I understand it, but I'm often confused.

The past, present, and future of control groups

Posted Nov 26, 2013 1:59 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

The only example of this I've seen so far is the ability to steal resources from sibling cgroups. But that's easily fixed by including an intermediate group to insulate delegated subtree from its siblings.

I.e.:

root -- cgroup1 (1000)
     -- cgroup2 (1000)

In this case if I delegate the 'cgroup2' to a malicious user, then they can set the resource share to 1000000 and starve the 'cgroup1'.

But that can be easily fixed:

root -- cgroup1 (1000)
     -- delegate2 (1000) - cgroup2

In this case I can safely delegate "cgroup2" to an untrusted user, they won't be able to starve other cgroups because its resource allocation is limited by the intermediate 'delegate2' group.

Now, some cgroups don't properly support hierarchy so it's not possible to create such hierarchy but these cgroups are being fixed right now.

The past, present, and future of control groups

Posted Nov 26, 2013 4:05 UTC (Tue) by raven667 (subscriber, #5198) [Link]

If kill was suid root then maybe it would be a more apt comparison.

The past, present, and future of control groups

Posted Nov 27, 2013 22:54 UTC (Wed) by heijo (guest, #88363) [Link] (1 responses)

WTF?

Any intelligent person would design an hierarchical interface so that a node can only give to children what he could have been able to get himself.

BTW, *of course*, the user that is being contained by a cgroup must only be able to create and modify subcgroups and not the constraints of the cgroup itself.

Are the cgroup developers so incompetent to not realize this?

The past, present, and future of control groups

Posted Nov 30, 2013 20:48 UTC (Sat) by foom (subscriber, #14868) [Link]

It apparently wasn't actually designed to have edit permissions shared with non-privileged users, that was just an accidental side-effect of the API being filesystem directories with settable permissions.

Which is why the kernel devs are trying to take that ability away now.

The past, present, and future of control groups

Posted Nov 21, 2013 12:11 UTC (Thu) by sorokin (guest, #88478) [Link] (1 responses)

> In particular, the whole concept of delegating subtrees of the control group hierarchy to untrusted users is "broken" [...]
> A more secure approach might be the use of a trusted user-space agent process — something that is likely to be necessary in the future anyway.

Could someone explain why secure delegation of cgroup subtrees should require user-space agent and can not be made in kernel?

From my understanding, the main function of the kernel is ability to share resources between different processes securely, so no one could affect the others. So we have filesystems as the ability to share disk storage, network stack as the ability to share network throughput.

To me the necessity of having user-space agent for delegating cgroups subtrees sounds similar to necessity of having user-space agent for sharing network interface e.g. kernel provides exclusive access to network interface to one process that provides it to the rest of the system.

The past, present, and future of control groups

Posted Nov 21, 2013 19:52 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I think your comparison is apt. I think the issue is that the cgroup developers aren't too keen on complicating the kernel cgroup internals with all of the policy knobs and namespace separation that would be needed to securely share this resource. Maybe after a few more years of operational experience the cgroup developers will feel more up to the task of handling this in-kernel rather than userspace.