Leading items

Welcome to the LWN.net Weekly Edition for May 14, 2020

This edition contains the following feature content:

Subinterpreters for Python: a debate over whether the subinterpreter feature should be made available in the Python library.
Private loop devices with loopfs: fixing loop-device problems with containers in particular.
Blocking userfaultfd() kernel-fault handling: userfaultfd() can be helpful to attackers; here is a proposal to make it a bit less so.
O_MAYEXEC — explicitly opening files for execution: an openat2() option for greater control over the execution of file contents.
Completing and merging core scheduling: the first article from the 2020 OSPM Summit looks at the present and future of the proposed core-scheduling feature.
What's coming in Go 1.15: an overview of changes in the upcoming Go language release.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Subinterpreters for Python

By Jake Edge
May 13, 2020

A project that has been floating around in the Python world for a number of years is now working its way toward inclusion into the language—or not. "Subinterpreters", which are separate Python interpreters that can currently be created via the C API for extensions, are seen by some as a way to get a more Go-like concurrency model for Python. The first step toward that goal is to expose that API in the standard library. But there are questions about whether subinterpreters are actually a desirable feature for Python at all, as well as whether the hoped-for concurrency improvements will materialize.

PEP 554

Eric Snow's PEP 554 ("Multiple Interpreters in the Stdlib") would expose the existing subinterpreter support from the C API in the standard library. That would allow Python programs to use multiple separate interpreters; the PEP also proposes to add a way to share some data types between the instances. The eventual goal is to allow those subinterpreters to run in parallel, but the implementation is not there yet.

In particular, giving each subinterpreter its own global interpreter lock (GIL) is not (yet) on the table. The GIL prevents multiple threads from executing Python bytecode at the same time. It exists mainly because the CPython memory-management code and garbage collector are not thread-safe. But the existence of the GIL has meant that other features, C-based extensions for example, depend on it for proper functioning. There have been efforts to remove the GIL from Python along the way, including the Gilectomy project. Subinterpreters are seen by some as another way of addressing the "GIL problem".

The PEP proposes adding an interpreters module to the standard library that will allow the creation of subinterpreters as follows:

    interp = interpreters.create()

Interpreters can then run code passed as a string to the run() method. Data is not shared between these interpreters unless it is done explicitly by using "channels" created this way:

    recv, send = interpreters.create_channel()

As might be guessed, simple objects (e.g. bytes, strings, integers) can then be sent and received using the send() and recv() methods of the corresponding channel objects.

The run() method blocks until the subinterpreter completes, though it can be executed in a separate thread as an example from the PEP that uses the threading module shows:

interp = interpreters.create()
def run():
    interp.run('print("during")')
t = threading.Thread(target=run)
print('before')
t.start()
print('after')

Because the GIL is shared between all of the interpreters, however, the concurrency gains are minimal. In the most recent revisions, the PEP tries to make it clear that exposing the feature from the C API is worth doing regardless of what happens with the GIL:

To avoid any confusion up front: This PEP is unrelated to any efforts to stop sharing the GIL between subinterpreters. At most this proposal will allow users to take advantage of any results of work on the GIL. The position here is that exposing subinterpreters to Python code is worth doing, even if they still share the GIL.

PEP 554 has been around since 2017, but Snow thinks it is getting ready for "pronouncement" (a decision to accept or reject it) now. While he believes there is value to exposing the interface in its own right, the PEP has had trouble separating itself from the ongoing GIL work; PEP 554 could perhaps be added to Python 3.9, though the GIL changes are not complete. In mid-April, Snow posed a question to the python-dev mailing list, wondering if it made sense to hold off on the PEP until 3.10 because there is no per-interpreter GIL.

Many folks have conflated PEP 554 with having a per-interpreter GIL. In fact, I was careful to avoid any mention of parallelism or the GIL in the PEP. Nonetheless some are expecting that when PEP 554 lands we will reach multi-core nirvana.

While PEP 554 might be accepted and the implementation ready in time for 3.9, the separate effort toward a per-interpreter GIL is unlikely to be sufficiently done in time. That will likely happen in the next couple months (for 3.10).

So...would it be sufficiently problematic for users if we land PEP 554 in 3.9 without per-interpreter GIL?

His main concern is that users will be confused and frustrated by encountering subinterpreters with a shared GIL, which will have lots of limitations; that might lead them to not reconsider the feature when those limitations are lifted for 3.10. He listed four options for proceeding: merging it without the GIL changes, the same but mark it as a "provisional" module, not merging until the GIL changes are ready, and the same but adding a 3.9-only subinterpreters module to the Python Package Index (PyPI). He was in favor of the first or the second option.

C extensions

But others are concerned that adding subinterpreter support to the standard library will put additional burdens onto the developers of C-based extensions. Those extensions sometimes use global variables, which do not play well with subinterpreters—whether they are created via the existing C API or the proposed standard library interpreters module. That means that using subinterpreters could lead to strange, hard-to-find problems when combined with extensions.

CPython core developer Nathaniel Smith, who is also a core developer of the C-based extension NumPy, was particularly unhappy with the proposal:

I think you've been downplaying the impact of subinterpreter support on the existing extension ecosystem. All features have a cost, which is why PEPs always require substantial rationales and undergo intense scrutiny. But subinterpreters are especially expensive. Most features only affect a small group of modules (e.g. async/await affected twisted and tornado, but 99% of existing libraries didn't care); OTOH subinterpreters require updates to every C extension module. And if we start telling users that subinterpreters are a supported way to run arbitrary Python code, then we've effectively limited extension authors options to "update to support subinterpreters" or "explain to users why they aren't writing a proper Python module", which is an intense amount of pressure; for most features maintainers have the option of saying "well, that isn't relevant to me", but with subinterpreter support that option's been removed.

NumPy core developer Sebastian Berg chimed in as well. He suggested that it could take up to a solid year of work to support subinterpreters in NumPy. He also said that the proposal to raise an exception when subinterpreters import extensions that are not subinterpreter-ready is helpful, though it likely will still lead to bugs being filed against the extensions. The PEP proposes to raise ImportError for any extension that does not support PEP 489 ("Multi-phase extension module initialization"); multi-phase initialization eliminates the problems with global state variables for the extensions by moving them into their own module-specific dictionary object.

Both Smith and Berg are skeptical of the existing C-level subinterpreter support. Berg said: "I believe you must consider subinterpreters basically a non-feature at this time. It has neither users nor reasonable ecosystem support", while Smith said that he might write a PEP to propose that subinterpreters be completely eliminated from Python. Snow replied to Berg that there are existing users, however:

FWIW, at this point it would be hard to justify removing the existing public subinterpreters C-API. There are several large public projects using it and likely many more private ones we do not know about.

That's not to say that alone justifies exposing the C-API, of course. :)

Benefits?

Beyond the concerns about extensions, though, Smith is not convinced of the benefits for concurrency that could eventually come from subinterpreter support. PEP 554 is careful not to directly connect the interpreters module with the eventual plan to stop sharing the GIL between subinterpreters, though it is clearly the eventual goal for some. Smith is skeptical of that plan as well:

In talks and informal conversations, you paint a beautiful picture of all the wonderful things subinterpreters will do. Lots of people are excited by these wonderful things. I tried really hard to be excited too. (In fact I spent a few weeks trying to work out a subinterpreter-style proposal myself way back before you started working on this!) But the problem is, whenever I look more closely at the exciting benefits, I end up convincing myself that they're a mirage, and either they don't work at all (e.g. quickly sharing arbitrary objects between interpreters), or else end up being effectively a more complex, fragile version of things that already exist.

Berg concurred to a certain extent. He said that there is a need for a wider vision, beyond the PEP's smaller goals, to explain what the plans are for subinterpreters so that a fuller picture can be considered. Snow agreed that there was a need for better documentation, an informational PEP or other justification document, though that has not appeared as yet. Ultimately, the decision on the PEP rests with Antoine Pitrou, who is the delegate for the PEP. He is generally favorably inclined toward it:

Mostly, I hope that by making the subinterpreters functionality available to pure Python programmers (while it was formally an advanced and arcane part of the C API), we will spur of bunch of interesting third-party experimentations, including possibilities that we on python-dev have not thought about.

He had some concrete suggestions on things to improve in the API and suggested that the feature be added provisionally (effectively option two in Snow's original message). He also explicitly solicited more feedback. Mark Shannon reviewed the PEP and said that he was in favor of the idea, but that it did not make sense to add the module to the standard library without showing that it would be beneficial for parallelism:

My main objection is that without per-[subinterpreter] GILs (SILs?) PEP 554 provides no value over threading or multi-processing. Multi-processing provides true parallelism and threads provide shared memory concurrency.

If per-[subinterpreter] GILs are possible then, and only then, sub-interpreters will provide true parallelism and (limited) shared memory concurrency.

The problem is that we don't know whether we can implement per-[subinterpreter] GILs without too large a negative performance impact. I think we can, but we can't say so for certain.

Snow disagreed, not surprisingly, but Shannon put together a table comparing different existing approaches to concurrency in Python with PEP 554 and an "ideal" communicating sequential processes (CSP) model. Go's concurrency model is roughly based around CSP; adding it to Python has also been tried along the way. Shannon said:

There are lot of question marks in the PEP 544 column. The PEP needs to address those.

As it stands, multiprocessing a better fit for CSP than PEP 554.

IMO, sub-interpreters only become a useful option for concurrency if they allow true parallelism and are not much more expensive than threads.

Snow sees concurrency as something of a side issue, but he is thinking of taking up the suggestion by Berg and others to more fully document the complete plan:

I really want to keep discussion focused on the proposed API in the PEP. Honestly I'm considering taking up the recommendation to add a new PEP about making subinterpreters official. I never meant for that to be more than a minor point for PEP 554.

There was plenty of other discussion, but Snow eventually deferred the PEP until the 3.10 time frame:

FYI, after consulting with the steering council I've decided to change the target release to 3.10, when we expect to have per-interpreter GIL landed. That will help maximize the impact of the module and avoid any confusion. I'm undecided on releasing a 3.9-only module on PyPI. If I do it will only be for folks to try it out early and I probably won't advertise it much.

It is an interesting feature and one that numerous core developers think could really help the performance of Python programs on multiple cores. But, without the GIL changes, it is difficult to know for sure whether it will be a substantial win. As Smith put it: "[...] the new concurrency model in PEP 554 has never actually been used, and it isn't even clear whether it's useful at all. Designing useful concurrency models is *stupidly* hard." We will have to wait to see if subinterpreters can clear that hurdle.

Comments (18 posted)

Private loop devices with loopfs

May 7, 2020

This article was contributed by Marta Rybczyńska

A loop device is a kernel abstraction that allows a file to be presented as if it were a physical block device. The typical use for a loop device is to mount a filesystem image stored in a file. Loop devices are global and shared between users, which causes a number of problems for container workloads where the instances are expected to be isolated from each other. Christian Brauner has been working on this problem; he has posted a patch set solving it by adding a small virtual filesystem called loopfs.

Loop devices typically appear under /dev with names like /dev/loopN. The special /dev/loop-control file can be used to create and destroy loop devices or to find the first available loop device. Associating a file with a specific device, or setting other parameters like offsets or block sizes, is done with ioctl() calls on the device itself. The loop(4) man page has the details on how it all works.

Users generally need not deal with specific devices, though; they can be managed behind the scenes with a special form of the mount command:

    mount /tmp/myimage.img /mnt/disk -o loop

This causes mount to locate an available loop device, associate it with /tmp/myimage.img, then mount that loop device onto /mnt/disk. Some administrators may prefer a different form of the same mount command that gives more control:

    mount /tmp/myimage.img /mnt/disk -o loop=/dev/loop1

In this mode, the administrator specifies the exact loop device to use. An administrator who needs more control over loop devices may also use the losetup command to query and set up loop-device properties.

As noted above, loop devices are global and shared between users; /dev/loop3 is the same device in all namespaces. If an application needs a private device, it has no way to request one. Loop devices are also, obviously, shared between containers, so one container can monitor the operations — or access the data — of the others.

A number of different use cases for loop devices were raised in the discussion of this patch set. Dmitry Vyukov gave one example: separating test processes from each other when they are using loop devices. He described the problems he has run into:

Currently all loop devices and loop-control are global and cause test processes to collide, which in turn causes non-reproducible coverage and non-reproducible crashes.

Brauner gave a number of examples from the container world. For example, systemd-nspawn does not support loop devices as they cannot be discovered dynamically and owned by a container. Chromium OS does not allow the use of loop devices. Kubernetes has also run into problems resulting from the global nature of loop devices: a file can remain bound to a device after its user has exited.

loopfs

Loopfs is a new, in-kernel, virtual filesystem that implements the loop devices and the loop-control file. This filesystem can be mounted multiple times; the loop devices in each instance are independent from all other loop devices in all other instances. This allows private loop devices for applications and containers. Both the loop devices and the loop-control file in loopfs accept the same operations as the legacy ones.

One use of loopfs is to provide compatibility with old-style applications, but with virtualized loop-device files. In this case, the administrator can mount the filesystem and then replace the default loop control files with those from loopfs. Consider the following example, adapted from the patch cover letter:

    # Mount a new loopfs instance in /dev/loopfs/
    mount -t loop loop /dev/loopfs/

    # Replace the standard loop control file with the ones from loopfs
    ln -sf /dev/loopfs/loop-control /dev/loop-control

    # Find the first available loop device
    loopdev=`losetup -f`     	  # will be something like /dev/loop0
    deventry=`basename $loopdev`  # now just "loop0"

    # Redirect that loop device to loopfs
    ln -sf /dev/loopfs/$deventry /dev/$deventry

    # mount an image
    mount -o loop /image.img /mnt/disk

There is a knob provided to control the maximum number of loop devices that can be created in any given loopfs instance; it can be found as /proc/sys/user/max_loop_devices.

Christoph Hellwig disagreed with the loopfs approach, stating that the code is too big for the benefit it provides. Brauner explained the additional use cases it allows, but the discussion stopped there. There have not been other substantial complaints about this proposal.

Loopfs doesn't just allow an independent loop-device pool, it also opens a way to allow unprivileged users to mount loop devices. This can be enabled by combining loopfs with Brauner's earlier work on system-call interception, which uses seccomp to establish a separate process to make decisions on which operations can be allowed. In such a setup, the unprivileged user can run mount as usual; the privileged process intercepting the system call will perform the actual operation.

Jann Horn outlined one possible problem with loop-device usage by unprivileged applications: most filesystem implementations are not prepared to deal with malicious filesystem images. While some work has been done, filesystem images are still generally treated as trusted data; that is why previous attempts to allow unprivileged filesystem mounting have run into opposition in the past. If an attacker has the ability to modify the image on the fly — as they would if they had access to the loop device providing that image — the problem would be compounded.

Stéphane Graber pointed out that an implementation based on system-call interception does not have to mount filesystems directly; a FUSE-based mount could be used instead. That would prevent any filesystem-level vulnerabilities from turning into kernel vulnerabilities. The LXD implementation allows both types of mount.

Next steps

Loopfs seems to solve a problem that users experience in practice. It has had three iterations in a week's time, addressing the comments given during the review. It may still take some time to find its way into the mainline kernel, but it is clear is that there numerous users waiting for a solution to the loop-device sharing issue.

Comments (18 posted)

Blocking userfaultfd() kernel-fault handling

By Jonathan Corbet
May 8, 2020

The userfaultfd() system call is a bit of a strange beast; it allows user space to take responsibility for the handling of page faults, which is normally a quintessential kernel task. It is thus perhaps not surprising that it has turned out to have some utility for those who would attack the kernel's security as well. A recent patch set from Daniel Colascione is small, but it makes a significant change that can help block at least one sort of attack using userfaultfd().

A call to userfaultfd() returns a file descriptor that can be used for control over memory management. By making a set of ioctl() calls, a user-space process can take responsibility for handling page faults in specific ranges of its address space. Thereafter, a page fault within that range will generate an event that can be read from the file descriptor; the process can read the event and take whatever action is necessary to resolve the fault. It should then write a response describing that resolution to the same file descriptor, after which the faulting code will resume execution.

This facility is normally intended to be used within a multi-threaded process, where one thread takes on the fault-handling task. There are a number of use cases for userfaultfd(); one of the original cases was handling live migration of a process from one machine to another. The process can be moved and restarted on the new system while leaving most of its memory behind; the pages it needs immediately can then be demand-faulted across the net, driven by userfaultfd() events. The result is less downtime while the process is being moved.

Since the kernel waits for a response from the user-space handler to resolve a fault, page faults can cause an indefinite delay in the execution of the affected process. That is always the case, of course; for example, a process generating a fault on memory backed by a file somewhere else on the network will come to an immediate halt for an unknown period of time. There is a difference with userfaultfd(), though: the time it takes to resolve the fault is under the process's direct control.

Normally, there are no problems that can result from that control; the process is simply slowing itself down, after all. But occasionally page faults will be generated in the kernel. Imagine, for example, just about any system call that results in the kernel accessing user-space memory. That can happen as the result of I/O, from a copy_from_user() call, or any of a number of other ways. Whenever the kernel accesses user-space memory, it has to be prepared for the relevant page(s) to not be present; the kernel has to incur and handle a page fault, in other words.

An attacker can take advantage of this behavior to cause execution in the kernel to block at a known point for a period of time that is under said attacker's control. In particular, the attacker can use userfaultfd() to take control of a specific range of memory; they then ensure that none of the pages in that range are resident in RAM. When the attacker makes a system call that tries to access memory in that range, they will get a userfaultfd() event helpfully telling them that the kernel has blocked and is waiting for that page.

Stopping the kernel in this way is useful if one is trying to take advantage of some sort of race condition or other issue. Assume, for example, that an attacker has identified a potential time-of-check-to-time-of-use vulnerability, where the ability to change a value in memory somewhere at the right time could cause the kernel to carry out some ill-advised action. Exploiting such a vulnerability requires hitting the window of time between when the kernel checks a value and when it acts on it; that window can be quite narrow. If the kernel can be made to block while that window is open, though, the attacker suddenly has all the time in the world. That can make a difficult exploit much easier.

Attackers can be deprived of this useful tool by disallowing the handling in user space of faults incurred in kernel space. Simply changing the rules that way would almost certainly break existing code, though, so something else needs to be done. Colascione's patch addresses this problem in two steps, the first of which is to add a new flag (UFFD_USER_MODE_ONLY) for userfaultfd() which states that the resulting file descriptor can only be used for handling faults incurred in user space. Any descriptor created with this flag thus cannot be used for the sorts of attacks described above.

One could try politely asking attackers to add UFFD_USER_MODE_ONLY to their userfaultfd() calls, but we are dealing with people who are not known for their observance of polite requests. So the patch set adds a new sysctl knob, concisely called vm/unprivileged_userfaultfd_user_mode_only, to make the request somewhat less polite; if it is set to one, userfaultfd() calls from unprivileged users will fail if that flag is not provided. At that point, kernel-space fault handling will no longer be available to attackers attempting to gain root access. The default value has to be zero, though, to maintain compatibility with older kernels.

The only response to this patch set so far came from Peter Xu, who pointed out that the existing vm/unprivileged_userfaultfd knob could be extended instead. That knob can be used to disallow userfaultfd() entirely for unprivileged processes by setting it to zero, though its default value (one) allows such access. Xu suggested that setting it to two would allow unprivileged use, but for user-space faults only. This approach saves adding a new knob.

Beyond that, the suggested change seems uncontroversial. It's a small patch that has no risk of breaking things for existing users, so there does not appear to be any real reason to keep it out.

Comments (20 posted)

O_MAYEXEC — explicitly opening files for execution

By Jonathan Corbet
May 11, 2020

Normally, when a kernel developer shows up with a proposed option that doesn't do anything, a skeptical response can be expected. But there are exceptions. Mickaël Salaün is proposing the addition of a new flag (O_MAYEXEC) for the openat2() system call that, by default, will change nothing. But it does open a path toward tighter security in some situations.

Executing a file on a Unix-like system requires that said file have an applicable execute-permission bit set. The file must also not reside on a filesystem that has been mounted with the noexec option. These checks can prevent the execution of unwanted code on a tightly controlled system, but there is a significant hole in this protection: interpreters that will happily read and execute code found in a file. If a file contains Perl code, for example, it cannot be executed by typing its name if it fails either of the above two tests. If an attacker is able to pass that file as a parameter to a perl -e command, though, its contents will still be executed.

The new O_MAYEXEC flag is a way for language interpreters (or other programs, such as dynamic linkers, that execute code) to indicate to the kernel that a file is being opened with the intent of executing its contents. This flag is totally ignored by open() which, because it never checked for invalid flags, is difficult to extend in general. The newer openat2() system call, instead, does fail when unknown flags are passed to it; it has been extended to recognize O_MAYEXEC. But, by default, nothing will change if that flag is present.

The patch set also adds a new sysctl knob called fs.open_mayexec_enforce that can bring out a change in behavior. Its default value (zero) naturally preserves current behavior so that nobody's system is broken by mistake. If, instead, bit 0 is set, an openat2() call with O_MAYEXEC will fail if the filesystem holding the target file was mounted with the noexec option. Bit 1 will cause such an open to fail if the file lacks execute permission. Setting both bits will thus cause O_MAYEXEC opens to fail in the situations where a direct attempt at execution would also fail.

Integrity measurement is another subsystem that can benefit from O_MAYEXEC. The kernel's integrity-measurement subsystem can be configured to block the execution of files that do not meet the integrity criteria but, once again, passing a file directly to an interpreter can bypass this check. This patch set adds a hook by which files opened with O_MAYEXEC can be passed to the integrity-measurement code for vetting before an open is allowed to succeed.

Finally, as one might expect, security modules can also note the existence of this flag and respond accordingly. It would be relatively straightforward to write a policy for SELinux or Smack that prevents execution-by-interpreter of files that lack a certain label (or to prevent such execution entirely, of course).

The above discussion skips over one little detail, though: this mechanism will only work if the programs that execute code from files cooperate and provide the O_MAYEXEC flag. That would require getting patches into various language interpreters, linkers, etc. to properly mark the opening of any files that might lead to the execution of code. Actions such as opening files passed on the command line, importing code in modules, and more would need this flag. Getting all of the commonly installed interpreters patched is likely to be a project that takes some time, even if all of the relevant projects go along with the idea.

The good news is that some projects, at least, are aware of the issue. The Python project, for example, has been working since (at least) 2017 to provide audit information to the underlying operating system; that work is currently formalized as PEP 578 ("Python runtime audit hooks"), which was approved in May 2019 and appears to be on track for the Python 3.9 release. Simply supporting O_MAYEXEC doesn't require the addition of an entire new subsystem, though, so adding this support to other interpreters need not be a multi-year project.

This patch set is in its fifth revision as of this writing. It has changed considerably as the result of review comments. The original version, posted at the end of 2018, predated openat2() and relied on the Yama security module for enforcement. Developers seem relatively happy with the current version, though, so this feature may be getting close to being ready to enter the mainline. Only then can the task of adding support to various interpreters begin.

Comments (46 posted)

Completing and merging core scheduling

By Jonathan Corbet
May 13, 2020

OSPM

Core scheduling is a proposed modification to the kernel's CPU scheduler that allows system administrators to control which processes can be running simultaneously on the same processor core. It was originally proposed as a security mechanism, but other use cases have shown up over time as well. At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), a group of some 50 developers gathered online to discuss the current state of the core-scheduling patches and what is needed to get them into the mainline kernel.

Status update

Vineeth Pillai started off by noting that, while mitigations have been developed for the known Spectre vulnerabilities, they do not provide complete protection on systems where simultaneous multi-threading (SMT or "hyperthreading") is in use. SMT creates the illusion of multiple CPUs running on shared hardware, increasing performance. The sharing of the underlying processor, however, provides numerous opportunities for speculative-execution vulnerabilities and the creation of covert channels. The only way to truly protect a system against these vulnerabilities is to disable SMT, which comes at a high performance cost for some workloads.

Many Spectre vulnerabilities can be mitigated through selective cache flushing, Pillai said, but that is of limited utility when SMT is in use. Performing a cache flush will remove any information placed there before the flush, making it inaccessible afterward. To be effective, though, tasks that do not trust each other must not share the core between flushes. Without SMT, the kernel can perform a flush whenever it switches tasks, ensuring that this never happens. On an SMT core, though, each CPU schedules independently, so this kind of flushing discipline cannot be maintained.

The way to avoid this problem is to keep tasks that don't trust each other from running on the same core — which means disabling SMT on current kernels. The alternative is core scheduling, where tasks in a given trust domain can be explicitly grouped together. The scheduler maintains core-wide knowledge of these trust domains and ensures that only tasks in the same domain run simultaneously on any given core. If there is no trusted companion for the highest-priority task, sibling CPUs can be forced idle while that task runs rather than run an untrusted task there.

One open area, Pillai said, was in the area of load balancing, which doesn't currently work well with core scheduling. This could perhaps be improved by selecting a single run queue to hold the shared information needed for core scheduling. When a scheduling event happens, the highest-priority task would be chosen as usual. Then any sibling processors can be populated with matching tasks from across the system, should any exist.

Core scheduling currently uses CPU control groups for grouping; there is a cpu.tag field that can be set to assign a "cookie" identifying the scheduling group a task belongs to. This was done for a quick and easy implementation, he said, and need not be how things will work in the end. There is a red-black tree in each run queue, ordered by cookie value, that is used to select tasks for sibling processors.

The patch series is up to version 5, which includes some load-balancing improvements. Earlier versions did not understand load balancing at all, so if a task was migrated to a CPU running (incompatible) tagged tasks, it could end up being starved for CPU time. A sixth revision is coming soon, he said.

One challenge that has to be dealt with is comparing the priority of tasks across siblings. Within a run queue, a task's vruntime value is used to determine whether it should run next. This value is a sort of virtual run time, indicating how much CPU time the task has received relative to others (though it is scaled by the process priority and adjusted in various other ways), but this value is specific to each run queue. A vruntime in one run queue cannot be directly compared to a vruntime in another queue.

One possible solution is to normalize these values. Each queue maintains a min_vruntime value, which is the lowest vruntime of any task in that queue. If a specific task's vruntime is normalized by subtracting the local min_vruntime, it can then be compared to a value in another run queue by adding that queue's min_vruntime. A solution based on this turned out to have starvation issues, though, leading to the creation of a core-wide vruntime instead; unfortunately, there are still starvation problems with this implementation, and discussions are ongoing.

Work for the sixth revision includes some attention to the process selection code, which currently only picks the highest-priority task to run. That can cause starvation issues as well. There are also problems with tasks that go into the kernel (to handle an interrupt, for example) then end up being co-scheduled with an untrusted task; this will be expensive to fix. Some thought is going into the API, and perhaps switching to prctl() to set a task's grouping.

Pillai concluded by asking whether this work should be merged into the mainline kernel. There are a number of arguments for that. Core scheduling shows better performance than just disabling SMT for a number of production use cases. It is controlled by a configuration option and is off by default even when configured in; there is no impact on performance when core scheduling is disabled. On the other hand, it's still only a partial mitigation for the problem, it has some fairness issues, there is code cleanup needed, and it still lacks a widely accepted API.

IRQ leak mitigation and accounting

Joel Fernandes then took over to talk about one of the remaining Spectre mitigation issues: interrupt leaks. Google (his employer) wants to use core scheduling with Chrome OS, since there are some tangible benefits. In tests with the camera running on a Chromebook, core scheduling reduced key-press latency considerably while improving the camera's frame rate by 3%. But developers there are concerned that interrupts (of both the hard and soft variety) can cause untrusted tasks to run simultaneously with the kernel's interrupt handlers, exposing the kernel to data leaks. Core scheduling currently only works at the task level, with no control over interrupt handling, so it cannot address this problem.

The proposed solution is the just-posted IRQ leak mitigation patch. It works by tracking whenever a CPU within a core calls irq_enter() — when it starts to handle an interrupt, in other words. The first such call within a core will, if another CPU is running an untrusted task, cause an inter-processor interrupt forcing the other CPU to go idle. The scheduler itself also must be modified so that, when it switches from one task to another, it checks to see if another CPU is handling interrupts at the moment. If so it will wait until the coast is clear and caches have been flushed.

There were some questions about how Chrome OS uses core scheduling. It seems that all "system tasks" are allowed to run together, outside of any core-scheduling group. Browser-based tasks (which are nearly all user tasks in Chrome OS) are each put into their own group, and thus run isolated. In other words, the untrusted tasks are specially marked by the system. Peter Zijlstra remarked that this means tasks default to the trusted state, which seems insecure; he suggested that the default be untrusted instead.

Juri Lelli asked about other scheduling classes; what happens if there is a realtime FIFO task in the system? Zijlstra answered that the usual ordering will be followed; the FIFO task, since it has the highest priority, will be picked first. Non-realtime tasks in the same group can then run on sibling processors, if they exist, though that would be a bit unusual since such tasks could interfere with the realtime task.

Dario Faggioli talked for a bit about SUSE's use case for core scheduling: making sure that accounting for virtualized guests is accurate. A typical host system is running a lot of tasks, many of which represent the virtual CPUs (vCPUs) of different virtual machines. The scheduler can mix up those vCPUs in any way it likes, regardless of how they correspond to the virtual machines they emulate.

Since tasks running on sibling CPUs are contending for the underlying hardware, they can affect each other's performance. Two vCPUs may appear to spend the same amount of time running on sibling CPUs, but one of those two may have actually consumed much more run time than the other. The result is unfair accounting of CPU time. Core scheduling can help by ensuring that vCPUs from the same virtual machine run on the same core, so that they only contend against each other; that makes the accounting more fair.

Things can be improved further by defining the virtual machines so that some of their vCPUs are set up as SMT siblings, allowing the guest operating system to optimize its scheduling accordingly. That only works, though, if the virtual machine's description bears some relationship to the physical reality. Once again, core scheduling can make that happen.

The security use case also applies to virtualization, Faggioli said. Core scheduling helps there, but does not yet appear to be a complete solution; the interrupt situation discussed by Fernandes is one place where work still needs to be done. A full solution is likely to need technologies like address-space isolation as well.

Performance

Tim Chen presented a number of performance benchmarks that emulate several different use cases. A set of virtualization tests showed the system running at 96% of the performance of an unmodified kernel with core scheduling enabled; the 4% performance hit hurts, but it's far better than the 87% performance result measured for this workload with SMT turned off entirely. Some tests with the sysbench benchmark gave similar results for core scheduling, but turning off SMT cut performance nearly in half. The all-important kernel-build benchmark showed almost no penalty with core scheduling, while turning off SMT cost 8%.

Julien Desfossez presented results from a MySQL benchmark showing performance dropping by 60% when core scheduling is used. Painful, but turning off SMT is far worse. A CPU-intensive benchmark based on Linpack showed core-scheduling performance that was slightly better than mainline, while turning off SMT incurs a 90% performance hit.

Faggioli ran a set of tests on a 256-CPU AMD system, which does not need core scheduling for Spectre mitigation at all. He was interested in the performance cost of having core scheduling built into the kernel but turned off. Kernbench ran at 98.6% of mainline with 128 jobs, up to 99.9% with 256 jobs. Various other tests yielded similar numbers.

There was some vague discussion of fairness testing — ensuring that all tasks get equal amounts of CPU time. The results were described as "messy" and hard to interpret, but the overall impression is that core scheduling yields less fair results overall.

Discussion

The final part of this three-hour session was given over to unstructured discussion. Zijlstra, who will probably make the decision over whether core scheduling will be merged or not, started by saying that he would like to see some better documentation. In particular, he wants clear information on where core scheduling helps and where it doesn't; in which situations will it be helpful to turn core scheduling on? There are some things to work out, he said, including fairness and some problems with CPU hotplug. That can be sorted out, but the documentation is necessary to be able to move forward with this work.

Dhaval Giani said that there are some cases that just don't work; not all problems are amenable to solution in the scheduler. Address-space isolation may also be needed to have a complete solution to the (known) Spectre problems. Zijlstra repeated that documentation covering what does work is needed. Then users can decide whether they care enough to turn it on for their specific situations.

Aaron Lu said that there are still problems around vruntime. If two tasks have the same tag but differing weights (priorities), the core vruntime will become that of the lower-weight task since that task will not run as much. The difference between the two can become large. Zijlstra answered that unbounded divergence of vruntime between tasks is not a good thing, but renormalization is expensive. It is also unneeded. Once a sibling CPU has been idled, there is only one run queue that matters; that would be a good time to synchronize vruntime values. Lu expressed a desire to see a patch implementing that; Zijlstra expressed a weary willingness to try to find time to create one.

Vincent Guittot raised the load-balancing issue; the results can be unfair, he said. If there are five tasks on a four-CPU system, one of those tasks will end up running slower than the others. He will be talking more about this issue later in the conference. In any case, Zijlstra said, the vruntime issues need to be worked out before load balancing can be resolved.

As the session wound down, Giani tried to put together a list of the things that need to be worked out; these included vruntime, CPU hotplug, and the fairness issues. Pillai added starvation of untagged tasks to the list. Zijlstra asked if that problem was for tasks with a cpu.tag value of zero, which means no tag at all; Pillai said yes, but the special zero tag means that the task does not go into the core-scheduling red-black tree. Zijlstra suggested adding those tasks to the tree, which would remove the exceptional case and make things work again.

Fernandes raised a related issue: that red-black tree contains task vruntime values that are used when selecting compatible tasks, but those values are not updated as the tasks run. Pillai said that this is a problem, old vruntime values can cause the scheduler to select the wrong task to run. Zijlstra said selecting a task for execution should remove it from this tree, as is done for the run-queue red-black tree. Doing this would slow things down, but it may be worth it; the penalty should be relatively small for virtualization workloads, since vCPUs are not rescheduled that often.

The session ended with Zijlstra saying that this work looks ready to proceed, and that the remaining issues can be worked out on the mailing list.

Comments (4 posted)

What's coming in Go 1.15

May 12, 2020

This article was contributed by Ben Hoyt

Go 1.15, the 16th major version of the Go programming language, is due out on August 1. It will be a release with fewer changes than usual, but many of the major changes are behind-the-scenes or in the tooling: for example, there is a new linker, which will speed up build times and reduce the size of binaries. In addition, there are performance improvements to the language's runtime, changes to the architectures supported, and some updates to the standard library. Overall, it should be a solid upgrade for the language.

Since the release of Go 1.0, the Go team has consistently shipped improvements to the tooling and the standard library with each version, but has always been conservative about language changes. Many other languages ship significant language features every release, but Go has only shipped a few minor ones in the versions since 1.0.

This is a conscious design choice: since the 1.0 release, the emphasis from the team has been stability and simplicity. The Go 1 compatibility promise guarantees that all programs written for Go 1.0 will continue to run correctly, unchanged, for all 1.x versions. Go programmers usually see this as a good thing — their programs continue to "just work", but generally get consistently faster.

In the upcoming 1.15 version, changes to the language specification are basically non-existent as expected; the improvements are in the tooling, the performance of the compiler, and in the standard library. As tech lead Russ Cox noted, the core developers are planning to be extra-conservative in 1.15 given the pandemic:

We don't know how difficult the next couple months will be, so let's be conservative and not give ourselves unnecessary stress by checking in last-minute subtle changes that we'll need to debug. Leave them for the start of the next cycle, where they'll get proper soak time.

[...] Go 1.15 is going to be a smaller release than usual, and that's okay.

On May 1, Go 1.15 entered feature freeze, and the Go team plans to make the final release on August 1, keeping to the regular sixth-month release cycle.

The Go development model is rather different than that of most open-source languages. The language was designed at Google and most of the core developers work there (so ongoing development is effectively sponsored by Google). The language has a permissive, BSD-style license, and development is done in the open, with general discussion on the golang-dev mailing list. Changes or new features are proposed and discussed in the GitHub repository's issues, and code review is done via comments on the Gerrit code changes (called "changelists" or "CLs").

A new linker

One of the largest tooling changes in 1.15 is the completely rewritten linker. The new linker design document, authored by Go core contributor Austin Clements in September 2019, details the motivation for the rewrite and the improvements it will bring. There are three major structural changes in the new linker:

Moving work from the linker to the compiler: this enables parallelization, as compiles are done in parallel across multiple CPUs (or machines), but the link step almost always has to be done in serial at the end of the build. Additionally, the results of the compiler are cached by the Go tooling.
Improving key data structures, primarily by avoiding strings. The current linker uses a big symbol table indexed by string; the new design avoids strings as much as possible by using a symbol-numbering technique.
Avoiding loading all input object files into memory at once: this makes the new linker use less memory for large programs, and allocate less memory overall (the current linker spends over 20% of its time in the garbage collector).

Now that Ken Thompson, author of the original linker, has retired, there's also the matter of maintainability. As Clements put it:

The original linker was also simpler than it is now and its implementation fit in one Turing award winner’s head, so there’s little abstraction or modularity. Unfortunately, as the linker grew and evolved, it retained its lack of structure, and our sole Turing award winner retired.

Given the sweeping long-term changes, this work is being done on a branch (dev.link) that is merged into master only at stable points. Than McIntosh, who is working on the new linker, described what has already been done for 1.15: most of the structural improvements in the design document have been completed, including the new object file format and tighter symbol representation. Builds are already faster and use less memory than in 1.14, but some features (for example, using the DWARF 5 debugging format) will have to wait for 1.16.

Clements added more detail on the parallelization efforts, as well as the gradual way the work is being phased in:

We [...] also made many other improvements along the way like parallelizing key phases and removing a lot of unnecessary I/O synchronization. In order to best build on all of the past work on the linker, we did this conversion as a "wavefront", with a phase that converted from the new representation to the old representation that we pushed further and further back in the linker. We're not done yet: that conversion phase is still there, though exactly when it happens and what it does depends on the platform. For amd64 ELF platforms, it's quite late and does relatively little. For other platforms, it's not quite as far back and does more, so the wins aren't as big yet. Either way, there's more to look forward to for 1.16.

For now, the linker still converts the output back to the old in-memory representation for the last part of the linking. Presumably in a future version of Go these last steps will be moved into the new linker and the conversion phase will be removed entirely, reducing link time and memory usage further.

Smaller binaries

Related are several improvements that reduce the size of executables built with Go 1.15. As Brad Fitzpatrick showed, the new linker eliminates a lot more unused code, bringing Fitzpatrick's (rather artificial) test program down from 8.2MB in Go 1.14 to 3.9MB in 1.15. For more realistic programs, binary sizes go down by 3.5% or as much as 22%. A web server program that I run went down from 21.4MB to 20.3MB, a reduction of 4.9%.

The biggest contributors to this are the unused code elimination in the new linker, along with several targeted improvements, such as Clements's CL 230544, which reduces the number of stack and register maps included in the executable. These maps are used by Go's garbage collector (GC) to determine what objects are alive, but are now only needed at call sites, instead of for every instruction. This change reduces the size of the go binary by 5.7%; it also speeds up compiles and links by a significant amount.

Due to Go's ability to inspect types at runtime (using the reflect package), Go binaries contain a significant amount of type information. CL 231397 by Cherry Zhang only includes a symbol's type information in the output if it's converted to an interface (only values converted to an interface can be used with reflection). This change reduces the size of a hello-world program by 7.2%.

There are a few other minor improvements to binary size, such as Brad Fitzpatrick's CL 228111, which avoids including both the TLS client and server code in the output if only one of them is used, reducing the size of a TLS dial hello world program by 3.2%.

Performance improvements

Go 1.15 introduces many minor performance improvements, but two of the more notable ones are from prolific non-Google contributor Josh Bleecher Snyder. CL 216401 avoids allocating memory when converting small integers to an interface value, giving a 2% improvement in compile-to-assembly times. Converting to interface values is like "boxing" in other languages; the optimization is similar in spirit to Python's small integer caching, though it happens in Go far less often due to static typing.

The second of Snyder's changes is CL 226367 in the internals of the compiler and runtime, which allows the compiler to use more x86 registers for the garbage collector's write-barrier calls. Go uses a write barrier (kind of like a lock) to maintain data integrity on the heap when the GC is running concurrently with user code (this detailed analysis of Go's GC has more information). This results in slightly smaller binaries and a 1% improvement in compile times.

Michael Knysze significantly increased throughput of memory allocation for large blocks by redesigning the memory allocator's "mcentral" data structure to reduce lock contention. The new allocation code is more than twice as fast for blocks of 12KB or larger.

Tooling and ports

The Go "modules" feature (Go's dependency management system) was first introduced in Go 1.11, and support for a module mirror or "proxy" was added in 1.13. Version 1.15 adds support for a fallback proxy, allowing the go tool to fall back to a secondary host if the first one fails when downloading module source code. Fallbacks are specified using the GOMODCACHE environment variable's new "|" separator.

Go 1.15 removes two older ports: darwin/386 and darwin/arm, which provided 32-bit binaries on macOS and other Apple operating systems. Fitzpatrick notes that macOS Catalina doesn't support running 32-bit apps, so removing those ports will help free up macOS build machines as well as shrinking the compiler slightly. These ports were announced as deprecated in the Go 1.14 release, and will be removed in Go 1.15.

On the other hand, the linux/arm64 port was upgraded to a "first class port", which means that broken builds for linux/arm64 will block releases; official binaries as well as install documentation are provided by the Go team. As Fitzpatrick noted, Linux 64-bit Arm is now at least as important as 32-bit Arm, which is already a first-class port.

On Windows, Go 1.15 now generates executables that use address-space layout randomization (ASLR) by default. ASLR uses position-independent code to randomize the addresses of various data areas on startup, making it harder for attackers to predict target addresses and create memory-corruption exploits.

Standard library additions

Go's standard library is large and fairly stable; in Go 1.15 only relatively minor features were added.

The standard library's testing package is quite minimalist — the Go philosophy is to avoid domain-specific languages for writing tests and assertions, and instead to just write plain Go, which the developer already knows. But the core developers found creating a temporary directory useful enough to approve adding a TempDir() method that lazily creates a temporary directory for the current test and deletes it automatically when the test is finished.

The net/url package adds a new URL.Redacted() method that returns the URL as a string, but with the password redacted (replaced by xxxxx). URLs with passwords such as https://username:password@example.com/ are not usually used in browsers anymore, but are still surprisingly common in scripts and tools. Redacted() can be used to log URLs more securely, in line with RFC 3986's guidelines to not render the part after the : as clear text.

A new time/tzdata package was added to allow embedding a static copy of the time zone database in executables. Because it adds about 800KB to the executable, it's opt-in: either by importing the time/tzdata package, or by compiling with the timetzdata build tag. The embedded database can make time zone database access more consistent and reliable on some systems (particularly Windows), and it may also be useful in virtualized environments like Docker containers and the Go playground.

Parting thoughts

Go uses GitHub issues to track all bugs and feature requests, so you can scan the list of closed issues in the Go 1.15 milestone for further exploration of what's in the release. The 1.15 final release is still over 2 months away, but you can easily test your own code against the latest version using the gotip tool, or wait for the binary beta release — scheduled for June 1. Bugs found now will almost certainly be fixed before the 1.15 final release.

Comments (31 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>