Resource beancounters

[Posted August 29, 2006 by corbet]

Your editor remembers a time when "the computer" was a single, large machine shared among many users. This large machine was, one might say, not quite as powerful as the systems we work on - or carry around to play music on - today, so sharing it between dozens (or more) people was bound to lead to conflicts. Accordingly, most timesharing systems in those days implemented complex resource quota mechanisms to keep users in bounds. When these systems worked well, they let people get their work done while minimizing violence in the hallways.

It is probably safe to say that almost all deployed Linux systems spend most of their time serving a single user or task. There is little need to keep users from stepping on each others' toes within a single system; instead, they can fight over the use of external resources like network bandwidth. So patches which implement such mechanisms (such as the class-based kernel resource management system) have generally not gotten very far. The driving need to fence users within a portion of a system's resources just has not been there.

Virtualization and containers may change that situation, however. The purpose of these systems is to isolate users from each other. But if one container is able to use a disproportionate amount of some vital system resource, the others will feel its presence. The illusion of having a machine to one's self loses some of its credibility if that machine, say, has no memory available to it. As these projects gather steam, they are motivating another look at resource usage management structures.

CKRM, now known as resource groups, may well make a resurgence. In the mean time, however, another approach has been proposed in the form of the resource beancounters patch. The beancounter developers appear to have tried to take a lighter-weight approach, but this patch still ends up touching a number of places in the kernel.

The core object in this mechanism is, yes, the "beancounter." Each beancounter in the system tracks the resource usage of a group of processes - presumably all of the processes running within a specific container. Beancounters contain a reference count, a unique ID, and an array of resource values; for each tracked resource, this array contains a pair of limits, current usage, historical minimum and maximum use, and a count of how many times an attempt to increase usage of that resource was denied. Each process in the system contains a pointer to its (probably shared) beancounter object. There is also a second beancounter, called fork_bc, which is used for any child processes created with fork().

A new system call, get_bcid(), returns the ID number for the current process's beancounter object. A suitably privileged user can call:

    int set_bcid(bcid_t id);

to change its current and fork IDs to a new value. Privileged processes can also change any process's limits with:

    int set_bclimit(bcid_t id, unsigned long resource, unsigned long *limits);

Here, resource identifies which resource limit is being changed, and limits points to an array of two values holding the "barrier" and "limit" values. The barrier value is intended to be a sort of soft limit, where some allocations might fail, but others are allowed to proceed.

In the posted patch, only one resource is tracked: kernel memory. For this resource, the "barrier" limit applies to most allocations; once the barrier is hit, allocation attempts will fail. The allocation of page tables and related structures, however, can go all the way to the "limit" value. So, while a process may start to see operations failing as a result of excessive kernel memory use, it should still be able to have its page faults handled normally while it tries to recover.

The kernel allocates memory in many places, and not all of those should be charged to the process that happens to be running at the time. The beancounter patch adds a couple of new GFP flags to make the difference explicit. In the default case, memory allocations are not charged to any specific beancounter. Whenever an allocation function is called with the __GFP_BC flag set, however, the current beancounter will be charged. An additional flag (__GFP_BC_LIMIT) specifies that the higher limit value is to be used. There is also a SLAB_BC flag which can cause all allocations from a given slab cache to be charged. Finally, there is a new vmalloc_bc() function which performs the appropriate accounting.

Needless to say, finding every allocation which should be tracked and charged to a beancounter would be a large task. The current patch does not even try; instead, it marks enough specific allocations to catch some of the larger uses of kernel memory and show how the whole system works. That may be as far as it gets; getting driver writers, for example, to think about whether their memory allocations should be charged seems like an uphill battle.

Whether this patch set will get any further than CKRM (sorry, "resource groups") remains to be seen. There are some concerns about how accounting for shared resources are handled - does the process group which first faults in the C library get charged for the whole thing, giving others a free ride? Then, many developers will continue to see no real need for this sort of accounting structure. The growing use of virtualization techniques may just be the factor which pushes this kind of patch into the kernel, however.

Index entries for this article
Kernel	Beancounters
Kernel	Class-based resource management
Kernel	Virtualization

to post comments

Resource beancounters

Posted Aug 31, 2006 14:28 UTC (Thu) by utoddl (guest, #1232) [Link] (1 responses)

Wow. They've reinvented another flavor of process groups. Kind of like the old AFS PAG, but it tracks resource allocations instead of tracking authentication tokens. And then there's the keyring based implementation of process authentication groups that OpenAFS is moving toward and that NFSv4+ (and any other externally hosted authenticated resource) is going to need. I'm sure there are others. How many different ways of grouping processes do we need, and does some of this code overlap?

Resource beancounters

Posted Sep 16, 2006 6:41 UTC (Sat) by devx (guest, #40551) [Link]

There is a small misunderstanding. Beancounters have nothing to do with tasks directly and don't do task groupping.
Look, there are a lot of resources which can be shared: pages, IPCs, files etc. Task groupping doesn't help anyhow since the same file can belong to 2 different tasks in 2 different resource groups.
So tasks if accounted are just the same kind of abstract objects just like any other are (files, sockets, ...).

And beancounters do not track and do not have a list of the objects.
Instead, beancounters do:
- accounting
- limiting
- beancounters are referenced to _by_ all the charged objects (not otherwise) to make uncharging correct (objects freeing can be done in arbitrary context).

Resource beancounters

Posted Aug 31, 2006 22:32 UTC (Thu) by giraffedata (guest, #1954) [Link] (1 responses)

There are some concerns about how accounting for shared resources is handled - does the process group which first faults in the C library get charged for the whole thing, giving others a free ride?

The few times I've tried to resolve the kernel memory accounting problem, I've had to quit because most of the resource is shared.

Also, kernel code usually can't tolerate having no memory available to it. The only reason the kernel works at all today is that things are done to make it unlikely that there isn't a single page of memory available; but with local limits like this, it would happen a lot.

Resource beancounters

Posted Sep 16, 2006 6:50 UTC (Sat) by devx (guest, #40551) [Link]

>> There are some concerns about how accounting for shared resources is
>> handled - does the process group which first faults in the C library get
>> charged for the whole thing, giving others a free ride?
FYI: BC accounting of user memory takes into account _fractions_ of pages. i.e. if 2 users share the same glibc and map the same page into its address space, then both will be charged 1/2 of the page.

> The few times I've tried to resolve the kernel memory accounting problem,
> I've had to quit because most of the resource is shared.
First, BC accounts only that kernel memory which is user triggerable. i.e. allocated on demand. This is required to prevent DoS. And it is not quite clear how and to whom charge all the other memory allocations, so this looks reasonably enough. e.g. whom should we charge memory allocated by interrupts?
Second, there can be different polices on how to account shared resources. Usually it is handled as "charge to the creator".

> Also, kernel code usually can't tolerate having no memory available to
> it. The only reason the kernel works at all today is that things are
> done to make it unlikely that there isn't a single page of memory
> available; but with local limits like this, it would happen a lot.
This is not 100% true.
Well, at least if we consider the objects which are allocated on user request. One of OpenVZ stress tests is to run 100 VPSs with random BC limits. These triggers different error paths in kernel and we submit patches to mainstream in case of problems. So this works quite stable and tested.

Resource beancounters

Posted Sep 7, 2006 0:46 UTC (Thu) by mtrob (guest, #1404) [Link]

Hmmm, I've been using this system for many years now in Linux. The product is Virtuozzo from SW Soft who recently put Open VZ out there for the kernel folks to work with. By the way, it works exceedingly well. And for a situation where you can use a common kernel virtualization (para-virtualization) it just beats the pants off VMWare and Xen.