[go: up one dir, main page]

|
|
Log in / Subscribe / Register

Six years with the 4.9 kernel

By Jonathan Corbet
January 12, 2023
The release of the 4.9.337 stable kernel update on January 7 marked the end of an era: after just over six years of maintenance, the 4.9.x series will receive no more updates. This kernel saw a lot of change after Linus Torvalds made the "final" release and left the building; it's time for a look at the "stable" portion of this kernel's life to see what can be learned.

The development cycle that led up to the 4.9 release saw the addition of 16,214 non-merge changesets contributed by 1,719 developers (a record at the time) working for (at least) 228 companies. In the six years between 4.9 and 4.9.337, instead, it gained 23,391 non-merge changesets from 4,037 developers working for at least 503 companies. The 4.9.337 release contains 114,000 more lines of code than 4.9 did. Rather than being the end of a kernel's development life, the final release from Torvalds is really just the beginning of a new and longer phase — at least, for long-term-support kernels.

Contributors

The top contributors of fixes to 4.9.x were:

Top bug-fix contributors to 4.9.x
DeveloperChangesetsPct
Greg Kroah-Hartman 4702.0%
Eric Dumazet 3951.7%
Johan Hovold 3561.5%
Dan Carpenter 3261.4%
Takashi Iwai 2951.3%
Arnd Bergmann 2861.2%
Thomas Gleixner 1960.8%
Jason A. Donenfeld 1710.7%
Eric Biggers 1590.7%
Colin Ian King 1380.6%
Christophe JAILLET 1340.6%
Nathan Chancellor 1250.5%
Hans de Goede 1200.5%
Geert Uytterhoeven 1170.5%
Xin Long 1130.5%
Yang Yingliang 1080.5%
Jan Kara 1020.4%
Randy Dunlap 1010.4%
Linus Torvalds920.4%
Johannes Berg 920.4%
Peter Zijlstra 910.4%
Al Viro 900.4%
Florian Fainelli 890.4%
Theodore Ts'o 880.4%

While Greg Kroah-Hartman shows as the top contributor of changesets, it is worth remembering that 337 of them are simply setting the version number for each stable release. His appearance there is thus an artifact of how the stable kernels are produced — not that he doesn't play a major role in this process, of course, as will be seen below.

The most active employers of contributors to 4.9.x were:

Employers supporting 4.9.x fixes
CompanyChangesetsPct
(Unknown)21779.3%
(None)21499.2%
Google19408.3%
Red Hat19118.2%
Intel15536.6%
SUSE11815.0%
Huawei Technologies10504.5%
IBM8343.6%
(Consultant)7673.3%
Linux Foundation6973.0%
Linaro6252.7%
Arm4341.9%
Oracle3871.7%
Mellanox3141.3%
Samsung2861.2%
Broadcom2601.1%
Linutronix2341.0%
Facebook2261.0%
Renesas Electronics2010.9%
NXP Semiconductors1960.8%

It can be interesting to compare these numbers to the statistics for the 4.9 release. There are many of the same names there, but the ordering is different. The biggest contributors of work for a mainline release may not be the biggest contributors of fixes after that release is made.

Backports

The stable rules require that changes appear in the mainline before being added to a stable update, so most (or all) of the patches counted above were written for the mainline. Backporting them to 4.9 is a different level of work on top of that. This task can be as simple as applying a patch unmodified to a different tree, or as complex as rewriting it altogether. Either way, there is clearly a lot of work involved in backporting over 23,000 patches to a different kernel.

One way to try to separate out that work was suggested by Srivatsa S. Bhat. A developer who backports a patch to an older kernel is essentially resubmitting it, and so must add a Signed-off-by tag to the patch changelog. Each patch in the stable kernel also contains the commit ID of the original in the mainline. Using that information, one can look at each stable patch and identify any Signed-off-by tags that were added since that patch was merged into the mainline. Those additional signoffs should indicate who backported each one.

In the 4.9.x series, 21,495 of the commits have added Signed-off-by tags. The remaining ones will include the above-mentioned version-number changes, patches that should have gotten an additional tag but didn't, and (most probably) patches that were backported by their original author. The result is thus a picture that is not perfect, but which is clear enough:

Top 4.9.x backporters
DeveloperChangesetsPct
Greg Kroah-Hartman1513570.41%
Sasha Levin920842.84%
Ben Hutchings3101.44%
David Woodhouse1420.66%
Amit Pundir900.42%
Sudip Mukherjee830.39%
Jason A. Donenfeld730.34%
Mark Rutland710.33%
Lee Jones520.24%
Nathan Chancellor440.20%
Florian Fainelli420.20%
David A. Long400.19%
Nick Desaulniers360.17%
Alex Shi270.13%
Thomas Gleixner240.11%
James Morse240.11%
Giuliano Procida240.11%
Nobuhiro Iwamatsu230.11%
Thadeu Lima de Souza Cascardo220.10%
Arnd Bergmann150.07%

The bulk of the backporting work is clearly being done by the two stable-kernel maintainers: Kroah-Hartman and Sasha Levin. In some cases, they have both added signoffs to the same patch, causing the percentages to add up to more than 100%. The work done by everybody else pales by comparison — especially if one only looks at the patch counts. Often, though, the reason for a developer other than the stable-kernel maintainers to backport a patch is that the backport is not trivial. So, while the other developers backported far fewer patches, many of those patches almost certainly required a lot more work.

Bug reports

In theory, almost every patch in the stable series is a bug fix, implying that somebody must have found and reported a bug. As it happens, only 4,236 of the commits in the 4.9.x series include a Reported-by tag — only 18% of the total. So most of the problems being fixed are either coming to light in some other way, or the report tags are not being included. For the patches that did include such tags, the results look like:

Top bug reporters in 4.9.x
ReporterReportsPct
Syzbot90118.8%
Hulk Robot 1813.8%
kernel test robot 1563.2%
Dmitry Vyukov 1002.1%
Andrey Konovalov 801.7%
Dan Carpenter 791.6%
Jann Horn 340.7%
Guenter Roeck 290.6%
Jianlin Shi 270.6%
Ben Hutchings 260.5%
Fengguang Wu 260.5%
Al Viro 210.4%
Arnd Bergmann 190.4%
Lars-Peter Clausen 190.4%
Xu, Wen 190.4%
Eric Biggers 180.4%
Igor Zhbanov 180.4%
TOTE Robot 180.4%
Tetsuo Handa 170.4%
Linus Torvalds160.3%

Bug reporting is clearly widely distributed, with the top three reporters (all robots) accounting for just over 25% of the total. Even so, it is clear that the bug-hunting robots are finding a lot of problems, hopefully before our users do.

Bug introduction

Another thing one can look at is the source of the bugs that were fixed in 4.9.x. Some work mapping Fixes tags in 4.9.x commits to the original commits can shine a light on when bugs were introduced; the result is a plot that looks like this:

[4.9.x fixes]

The 4.9 and 4.8 releases are, unsurprisingly, the source of many of the bugs fixed in the stable updates, with nearly 700 coming from each. After that comes the usual long tail, including every release ever made since the Git era began at 2.6.12. Every pre-4.10 release in the Git history is represented here; the least-fixed release is 2.6.17, which was released in 2006, with "only" 22 fixes.

That plot is not the whole story, though. Each of the 4.9.28, 4.9.34, 4.9.51, 4.9.75, 4.9.77, 4.9.78, 4.9.79, 4.9.94, 4.9.102, 4.9.187, 4.9.194, 4.9.195, 4.9.198, 4.9.207, 4.9.214, 4.9.219, 4.9.228, 4.9.253, 4.9.258, 4.9.259, 4.9.261, 4.9.265, 4.9.298, and 4.9.299 releases included a commit that was identified by a later Fixes tag; 4.9.81 and 4.9.218 had two, and 4.9.310 had three. Each of those, clearly, indicates a regression that was introduced into the stable kernel and later fixed. But even that is not the full picture; consider this:

[post-4.9.x fixes]

Every release made after 4.9 also introduced bugs that had to be fixed in the stable updates — over 1,500 fixes in all. That is a lot of buggy commits to have introduced into a "stable" kernel. One should also not take the wrong message from the lower counts for more recent kernel releases. It might be possible that our releases are getting less buggy, but a more plausible explanation is that the empty space in the upper-right half of that plot just represents bugs that have not yet been found and fixed.

The 4.9 stable series was, thus, not perfect — not that anybody ever claimed that it was. It was, however, good enough to be the core of many deployed systems, including an unimaginable number of Android devices. The 4.9 kernel series is a testament to what the development community can accomplish when it sets its mind to it. It was a base that many users could rely on, and has well earned its retirement.

Index entries for this article
KernelReleases/4.9


to post comments

Six years with the 4.9 kernel

Posted Jan 13, 2023 8:52 UTC (Fri) by error27 (subscriber, #8346) [Link] (11 responses)

In olden days we used to measure how good a kernel was by comparing uptimes.

I've heard distro developers say that some kernels are definitely better than others. They say there is an element of chance or something and it just works out that some kernels end up being headaches.

Another thing that happens is that if people know a kernel is going to be a long term release or a something that the distros are going to support for a long time then people try to get all their features merged into it. Maybe the features are not quite up to ideal standards but they really want to get the features merged instead of waiting for the next distro release. So that could impact the quality.

If you look at the top bug reporters chart there are 1804 bugs. At least 1081 of those were Syzkaller bugs. Syzbot was created at about the time that 4.9 was released so there was a backlog of bugs. New kernels will not inherit those bugs and Syzbot finds a lot of bugs in new code before the kernels are released to users.

Another 460 bug reports came from static analysis. (Although potentially some are double counted if the kbuild-bot and I both reported the same bug?) . That means that static checkers are discovering new types of bugs and more people are using static analysis on their code. The static checker fixes combined with the Syzbot fixes mean 1523 bug fixes are the result from new tools or process improvements.

New subsystems are always buggy at first. In olden days wifi or power management used to be problematic but these days they have improved and matured. The hardware has also matured.

So new kernels should be better than old kernels. We often used to talk about doing a "Bug Fix Only" release but these days no one talks about that which also suggests people are happier.

Six years with the 4.9 kernel

Posted Jan 13, 2023 9:21 UTC (Fri) by taladar (subscriber, #68407) [Link] (10 responses)

Hopefully you mean that the kernels with high uptimes were bad because they got insufficient attention for security fixes?

We really should get rid of that idea that a high uptime is something good and to strive for. The longer the system is up the more known issues exist with the kernel and the less tested the boot process becomes (as in, it is less likely to boot if you are forced into starting it by something like a power outage or a hardware problem).

Six years with the 4.9 kernel

Posted Jan 13, 2023 9:28 UTC (Fri) by error27 (subscriber, #8346) [Link]

Correct. Olden times were simpler times. :)

I think it would still be a fun challenge to maintain large up times by using live updates to address security issues. (I have never live patches a kernel myself so I don't know exactly how fun this is...)

Six years with the 4.9 kernel

Posted Jan 13, 2023 16:41 UTC (Fri) by cesarb (subscriber, #6266) [Link] (8 responses)

> We really should get rid of that idea that a high uptime is something good and to strive for.

No, we should instead get rid of the idea that constant churn is something good and to strive for. Our systems should be stable and secure enough that we shouldn't have to worry about them missing security or stability fixes, or about them failing to boot after a power outage.

That is, having high uptime is a proxy for the system being stable and secure enough that it *can* have a high uptime without worry. And that's something we *should* strive for.

Six years with the 4.9 kernel

Posted Jan 13, 2023 17:22 UTC (Fri) by farnz (subscriber, #17727) [Link] (7 responses)

The capacity for a high uptime is good, but actually having one is bad in practice. Not because of security or stability fixes, but rather because a high uptime correlates with being unable to get the system back into working state after a hardware failure.

A host with a low uptime is a host where everything it does is configured on boot - because anything else is too much pain to administer. A host with a high uptime can be one where important configuration is not persisted to permanent storage, but instead is an in-memory consequence of a command run by an admin who's not currently around and who didn't document what they did sufficiently to allow it to be redone after the power outage (or whatever failure causes the system to reboot).

Six years with the 4.9 kernel

Posted Jan 15, 2023 15:58 UTC (Sun) by ballombe (subscriber, #9523) [Link] (6 responses)

Except low uptime increase the risk of hardware failure.

Six years with the 4.9 kernel

Posted Jan 15, 2023 17:38 UTC (Sun) by farnz (subscriber, #17727) [Link] (4 responses)

I don't see how a small increase in the risk of hardware failure due to a software restart (which, yes, puts a small amount of extra strain on the hardware) is an argument against being ready for an outage.

The point is that a high-uptime host is more likely to have a change that's critical to operation, and not applied on host start up. If something happens to cause the host to reboot (mains power outage, perhaps), then someone had to find all of those changes. And the longer the uptime, the more such changes you have to track down.

If you can restore from backup and recover with not more than a day's work lost, that's a lot better than being unable to restore for a week because the person who knows what changes need to be made to get from booted to operational is out of contact. And if the person you need is no longer available at all (left company, or worse), you're in serious trouble.

It's a corollary of the observation that the more often you do something, the more likely you are to be able to do it successfully when you have to do it on an unscheduled basis.

Six years with the 4.9 kernel

Posted Jan 15, 2023 22:02 UTC (Sun) by ballombe (subscriber, #9523) [Link] (3 responses)

It depend on your goal.
If your goal is service availability you might be right,
but if the goal is computational throughput (as in HPC) then long uptime are better.

Six years with the 4.9 kernel

Posted Jan 16, 2023 10:48 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Even if your goal is computational throughput, long uptime can be worse because the resulting downtime after an adverse event (such as an electricity grid failure coincident with a UPS failure) is much longer than the downtime due to a single host failing during reboot, but then being trivial to replace.

The tradeoff you're making when you go for long uptimes is between having change management discipline enforced by reboots, having to have some other mechanism to enforce change management discipline, or having unpredictable long downtime where returning a host to operational state after downtime is a skilled process involving sysadmin interaction, rather than something you can delegate to "remote hands" and automation.

Key to this is that what actually matters is change management - regular reboots are merely one way to enforce good change management discipline, and if you're able to enforce it another way, you can have your long uptimes. Not having good change management is what causes unpredictably long downtimes - because a host's required state to become operational may be kept in RAM and one sysadmin's head, and lacking that sysadmin means you have to reverse-engineer the changes they made outside change management.

Six years with the 4.9 kernel

Posted Jan 16, 2023 21:55 UTC (Mon) by neggles (subscriber, #153254) [Link] (1 responses)

You're assuming that configuration changes are being made, and that it's even possible to make them, which is very much not the case for HPC-type systems; those nodes are almost always generic, non-persistent, network-booted machines that you *can't* log into, with their configuration pushed from a central management system, and they're generally isolated from the internet (thus mitigating the vast majority of security issues). High uptime is not inherently bad across all systems.

Six years with the 4.9 kernel

Posted Jan 17, 2023 9:34 UTC (Tue) by farnz (subscriber, #17727) [Link]

If it's impossible to make configuration changes, then it's impossible to get a new workload onto the system, or even new data to process. This surprises me - while I don't work in HPC, I was under the impression that HPC systems get new workloads and data provided on a fairly regular basis after construction.

Given this, regular reboots (which imply short uptimes) are one way to force change management to be done properly - because configuration that's not persisted is lost every time the system reboots. It's not the only way; if the system administrators are suitably disciplined, and the only way for non-administrators to get access is to submit jobs via tooling (e.g. running qsub) that takes care of ensuring that changes do not persist between jobs, then you've got change management that way, and don't need regular reboots.

Six years with the 4.9 kernel

Posted Jan 16, 2023 9:14 UTC (Mon) by jtaylor (subscriber, #91739) [Link]

Is there any evidence that this is really the case?

At work I maintain ~1000 hardware servers which are restarted weekly and a couple hundred which are not, we see no difference in hardware failure.

Privately my home pc is rebooted often several times a day and they have all lasted ~5-7 years before being replaced due to being outdated, not failed.


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds