The stable-kernel process

By Jonathan Corbet
September 16, 2019

The stable kernel process is a perennial topic of discussion at gatherings of kernel developers; the 2019 Linux Kernel Maintainers Summit was no exception. Sasha Levin ran a session there where developers could talk about the problems they have with stable kernels and ponder solutions.

Levin begin by saying that he has been working on the complaints he got the year before. One of those was that the automatic patch-selection system "goes nuts" and picks the wrong things. It has been retrained twice in the last year and has gotten better at only selecting fixes. About 50% of recent stable releases has been made up of patches explicitly tagged for stable updates; the other half has come from the automated system.

One ongoing problem, he said, is that a lot of patches tagged for stable are not being backported properly. If a simple backport effort fails, Greg Kroah-Hartman sends an email to the people involved, who then have an opportunity to do the backport. But, by the time that happens, developers have moved on and are often unwilling to revisit that old work. Peter Zijlstra said that he tends to ignore email about backport failures; he's not sure what else he should do with them. The answer, Levin said, is to send a working backport.

Dave Miller said that he does all the backports himself for the last two stable releases. But then people come back asking for backports to old kernels like 4.4. He just doesn't have the time to try to backport changes that far. As a result, a lot of poor work gets into those older kernels. Thomas Gleixner said that he had to give up on backporting many of the Spectre fixes to the 4.9 kernel. Even some of the more recent fixes for speculative-execution problems are nearly impossible to backport despite being much cleaner code. Kroah-Hartman said that there are people who are paid to do that sort of work; it's not something that kernel developers should have to worry about.

Levin said that he is trying to improve the backport process in general. He now gets alerts for patches that fix other patches that have been shipped in a stable update; those are earmarked for fast processing. He is also putting together a "stable-next" tree containing patches from linux-next that have been tagged for stable. It is intended to be an early-warning system for changes that will be headed toward the stable kernels in the near future.

Jan Kara complained that he recently applied a fix to the mainline that had a high possibility of creating user-space regressions. He had explicitly marked it as not being suitable for the stable updates, but it was included anyway. Levin replied that it is easy to miss those notes, along with other types of information like prerequisite patches for a given fix. There needs to be a better structure for that kind of information; he will be proposing some sort of tag to encapsulate it.

That said, Levin made it clear that he would rather include even the patches that have been explicitly marked as being unsuitable for stable updates. If there are bugs in those patches, users will encounter them anyway once they upgrade. Holding the scarier patches in this way just trains users to fear version upgrades, which is counter to what the community would like to see.

Ted Ts'o asked about the test coverage for stable releases; Kroah-Hartman answered that is is probably more comprehensive than the testing that is applied to the mainline. There are a lot of companies running tests on stable release candidates and reporting any problems they find. This testing goes well beyond basic build-and-boot tests, he said.

The final topic covered was running subsystem tests on backports. The BPF subsystem, for example, has a lot of tests that are known not to work on older kernels, so nobody should be trying to do that. But fixes to tests are backported, so the tests shipped with a given kernel version should always run well with that kernel.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting travel to this event.]

Index entries for this article
Kernel	Development model/Stable tree
Conference	Kernel Maintainers Summit/2019

to post comments

The stable-kernel process

Posted Sep 17, 2019 14:15 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (1 responses)

That's interesting. When I was in charge of kernel 2.4 about a decade ago, this backport problem was a real concern. Not only most developers didn't want or have the energy to work on 2.4 anymore considering how different it was from 2.6, but in addition due to the very long cooking time of 2.5, 2.4 was deeply rooted into certain infrastructures which still had to support evolving parts around (typically compilers on the admin's workstation).

We ended up causing quite some breakage by trying to lift it up to support compilers available in updated distros, with reports of gcc 4.x breaking certain drivers. And that was a mandatory step, at least to make sure the rare few developers occasionally willing to provide some help with their code had a chance to actually build their backport.

Then we faced the first difficult decisions regarding certain security issues : when the fix lacks all the infrastructure needed to be backported, should we invent something else at the risk of breaking everything, or should we just prominently document the feature as insecure ? The places where this kernel was deployed at the end of its life were more related to deeply embedded devices like displays in train stations than devices directly connected to the net or with local hostile users. So it made sense to favor reliability and availability over a usually false sense of security (because addressing one visible issue and leaving many others open is not the most efficient security strategy).

This experience told me that when you start to lag in backports, next ones become increasingly difficult. And when you're caught failing some backports and causing some breakage in field, you know you went too far and you've probably already broken some other stuff but you don't know what. That's when I realized that while the reliability/stability curve of a stable kernel increases with time, it only lasts some time and it can start to decrease again. You'd rather not use such a software anymore when its curve is below what it was when you decided it was stable enough for your deployment. Luckily, hardware replacements added pressure on the users to switch, but VMs could also mitigate this necessity.

This situation in my opinion is caused by the lack of a moral contract between developers and users: users expect all their bugs to get fixed for free and forever. Developers expect users to upgrade often. The reality should be more nuanced. Users need to be reasonable and accept that as time goes, more and more fixes will not be backported, including the security-related ones they value so much. And they need to weigh the risk of hitting such backport bugs versus the cost of evaluating a new major version. At some point the latter wins. And developers should simply make it clear that past 2, 3, 4 years, bug fixes can take a very long time to appear, sometimes several weeks, or even months, because effective stability with older kernels is far more important than fixing theorical issues. This is already what happens with old kernels released every 4-6 months, but as it's not clearly stated, some users still hope that an urgent vulnerability will lead to an immediate release.

If I had still been in charge of 3.10 when spectre/meltdown were released, I would likely not even have tried to backport the fixes and instead made it clear that those who cared about it in that old a system were probably running the wrong version and that it was time for them to choose between living with a small risk or evaluate a more modern version.

It's way more difficult for distros since distros do not decide who their users are, but still have to do the backports, possibly for nobody. But on the other hand distros are more likely to propose a smooth upgrade path as their userland is more under control. Perhaps making such a message clear that after 3 or 4 years some vulnerabilities could possibly require a dist-upgrade would be acceptable to most users, as conversely these users know exactly what they need and every time they can skip an update that's irrelevant to them, they do!

For my use cases, I'm always running the latest LTS branch on my laptop, which means a major upgrade once a year, and let this kernel cool down 3-12 months before deploying it to my systems that I don't want to upgrade often (servers and embedded devices). I don't want to see any machine run an unmaintained version anymore and am actively chasing them, for good: usually it takes less time to run make config, build and flash a new kernel than it can take to try to troubleshoot spurious bugs. For now this seems to be the approach that gives me the best results.

The stable-kernel process

Posted Sep 18, 2019 12:12 UTC (Wed) by timtas (guest, #2815) [Link]

I use the same approach, going for the newest "longtern" kernel as soon as it is released. Like this, I hope to receive as many fixes as possible, for a rather recent but still well-tested kernel. Not all that adventurous I agree, but under given restraints (50 years old, not an expierienced kernel developer, having to do other work for food), I have to settlte for that. Would like to help more in the future though, once I upgraded my home server to 4.19, I'll try to run a vm on it with latest stable.

The stable-kernel process

Posted Oct 12, 2019 12:30 UTC (Sat) by nix (subscriber, #2304) [Link]

[very late, behind in LWNery]

That said, Levin made it clear that he would rather include even the patches that have been explicitly marked as being unsuitable for stable updates. If there are bugs in those patches, users will encounter them anyway once they upgrade. Holding the scarier patches in this way just trains users to fear version upgrades.

But... surely what this will actually do is introduce more instability into stable kernel upgrades, leading people to fear and delay stable upgrades (with the security and stability fixes they contain) as much as they currently fear and delay major-kernel updates? That seems to be precisely what we don't want to happen.

Again, there should be a division drawn here between filesystem/blockdev-layer changes and everything else. With a very few exceptions (very unlucky mm changes, etc), changes in other parts of the kernel lead, at most, to instability that goes away on a reboot: filesystem breakages can damage persistent datastructures and lose the user's data. When filesystem changes are marked unsuitable for -stable, it matters. (There is a reason xfs is currently excluded from the autosel machinery!)