Six years with the 4.9 kernel

By Jonathan Corbet
January 12, 2023

The release of the 4.9.337 stable kernel update on January 7 marked the end of an era: after just over six years of maintenance, the 4.9.x series will receive no more updates. This kernel saw a lot of change after Linus Torvalds made the "final" release and left the building; it's time for a look at the "stable" portion of this kernel's life to see what can be learned.

The development cycle that led up to the 4.9 release saw the addition of 16,214 non-merge changesets contributed by 1,719 developers (a record at the time) working for (at least) 228 companies. In the six years between 4.9 and 4.9.337, instead, it gained 23,391 non-merge changesets from 4,037 developers working for at least 503 companies. The 4.9.337 release contains 114,000 more lines of code than 4.9 did. Rather than being the end of a kernel's development life, the final release from Torvalds is really just the beginning of a new and longer phase — at least, for long-term-support kernels.

Contributors

The top contributors of fixes to 4.9.x were:

Top bug-fix contributors to 4.9.x

Developer Changesets Pct

Greg Kroah-Hartman 470 2.0%

Eric Dumazet 395 1.7%

Johan Hovold 356 1.5%

Dan Carpenter 326 1.4%

Takashi Iwai 295 1.3%

Arnd Bergmann 286 1.2%

Thomas Gleixner 196 0.8%

Jason A. Donenfeld 171 0.7%

Eric Biggers 159 0.7%

Colin Ian King 138 0.6%

Christophe JAILLET 134 0.6%

Nathan Chancellor 125 0.5%

Hans de Goede 120 0.5%

Geert Uytterhoeven 117 0.5%

Xin Long 113 0.5%

Yang Yingliang 108 0.5%

Jan Kara 102 0.4%

Randy Dunlap 101 0.4%

Linus Torvalds 92 0.4%

Johannes Berg 92 0.4%

Peter Zijlstra 91 0.4%

Al Viro 90 0.4%

Florian Fainelli 89 0.4%

Theodore Ts'o 88 0.4%

Top bug-fix contributors to 4.9.x
Developer	Changesets	Pct
Greg Kroah-Hartman	470	2.0%
Eric Dumazet	395	1.7%
Johan Hovold	356	1.5%
Dan Carpenter	326	1.4%
Takashi Iwai	295	1.3%
Arnd Bergmann	286	1.2%
Thomas Gleixner	196	0.8%
Jason A. Donenfeld	171	0.7%
Eric Biggers	159	0.7%
Colin Ian King	138	0.6%
Christophe JAILLET	134	0.6%
Nathan Chancellor	125	0.5%
Hans de Goede	120	0.5%
Geert Uytterhoeven	117	0.5%
Xin Long	113	0.5%
Yang Yingliang	108	0.5%
Jan Kara	102	0.4%
Randy Dunlap	101	0.4%
Linus Torvalds	92	0.4%
Johannes Berg	92	0.4%
Peter Zijlstra	91	0.4%
Al Viro	90	0.4%
Florian Fainelli	89	0.4%
Theodore Ts'o	88	0.4%

While Greg Kroah-Hartman shows as the top contributor of changesets, it is worth remembering that 337 of them are simply setting the version number for each stable release. His appearance there is thus an artifact of how the stable kernels are produced — not that he doesn't play a major role in this process, of course, as will be seen below.

The most active employers of contributors to 4.9.x were:

Employers supporting 4.9.x fixes

Company Changesets Pct

(Unknown) 2177 9.3%

(None) 2149 9.2%

Google 1940 8.3%

Red Hat 1911 8.2%

Intel 1553 6.6%

SUSE 1181 5.0%

Huawei Technologies 1050 4.5%

IBM 834 3.6%

(Consultant) 767 3.3%

Linux Foundation 697 3.0%

Linaro 625 2.7%

Arm 434 1.9%

Oracle 387 1.7%

Mellanox 314 1.3%

Samsung 286 1.2%

Broadcom 260 1.1%

Linutronix 234 1.0%

Facebook 226 1.0%

Renesas Electronics 201 0.9%

NXP Semiconductors 196 0.8%

Employers supporting 4.9.x fixes
Company	Changesets	Pct
(Unknown)	2177	9.3%
(None)	2149	9.2%
Google	1940	8.3%
Red Hat	1911	8.2%
Intel	1553	6.6%
SUSE	1181	5.0%
Huawei Technologies	1050	4.5%
IBM	834	3.6%
(Consultant)	767	3.3%
Linux Foundation	697	3.0%
Linaro	625	2.7%
Arm	434	1.9%
Oracle	387	1.7%
Mellanox	314	1.3%
Samsung	286	1.2%
Broadcom	260	1.1%
Linutronix	234	1.0%
Facebook	226	1.0%
Renesas Electronics	201	0.9%
NXP Semiconductors	196	0.8%

It can be interesting to compare these numbers to the statistics for the 4.9 release. There are many of the same names there, but the ordering is different. The biggest contributors of work for a mainline release may not be the biggest contributors of fixes after that release is made.

Backports

The stable rules require that changes appear in the mainline before being added to a stable update, so most (or all) of the patches counted above were written for the mainline. Backporting them to 4.9 is a different level of work on top of that. This task can be as simple as applying a patch unmodified to a different tree, or as complex as rewriting it altogether. Either way, there is clearly a lot of work involved in backporting over 23,000 patches to a different kernel.

One way to try to separate out that work was suggested by Srivatsa S. Bhat. A developer who backports a patch to an older kernel is essentially resubmitting it, and so must add a Signed-off-by tag to the patch changelog. Each patch in the stable kernel also contains the commit ID of the original in the mainline. Using that information, one can look at each stable patch and identify any Signed-off-by tags that were added since that patch was merged into the mainline. Those additional signoffs should indicate who backported each one.

In the 4.9.x series, 21,495 of the commits have added Signed-off-by tags. The remaining ones will include the above-mentioned version-number changes, patches that should have gotten an additional tag but didn't, and (most probably) patches that were backported by their original author. The result is thus a picture that is not perfect, but which is clear enough:

Top 4.9.x backporters

Developer Changesets Pct

Greg Kroah-Hartman 15135 70.41%

Sasha Levin 9208 42.84%

Ben Hutchings 310 1.44%

David Woodhouse 142 0.66%

Amit Pundir 90 0.42%

Sudip Mukherjee 83 0.39%

Jason A. Donenfeld 73 0.34%

Mark Rutland 71 0.33%

Lee Jones 52 0.24%

Nathan Chancellor 44 0.20%

Florian Fainelli 42 0.20%

David A. Long 40 0.19%

Nick Desaulniers 36 0.17%

Alex Shi 27 0.13%

Thomas Gleixner 24 0.11%

James Morse 24 0.11%

Giuliano Procida 24 0.11%

Nobuhiro Iwamatsu 23 0.11%

Thadeu Lima de Souza Cascardo 22 0.10%

Arnd Bergmann 15 0.07%

Top 4.9.x backporters
Developer	Changesets	Pct
Greg Kroah-Hartman	15135	70.41%
Sasha Levin	9208	42.84%
Ben Hutchings	310	1.44%
David Woodhouse	142	0.66%
Amit Pundir	90	0.42%
Sudip Mukherjee	83	0.39%
Jason A. Donenfeld	73	0.34%
Mark Rutland	71	0.33%
Lee Jones	52	0.24%
Nathan Chancellor	44	0.20%
Florian Fainelli	42	0.20%
David A. Long	40	0.19%
Nick Desaulniers	36	0.17%
Alex Shi	27	0.13%
Thomas Gleixner	24	0.11%
James Morse	24	0.11%
Giuliano Procida	24	0.11%
Nobuhiro Iwamatsu	23	0.11%
Thadeu Lima de Souza Cascardo	22	0.10%
Arnd Bergmann	15	0.07%

The bulk of the backporting work is clearly being done by the two stable-kernel maintainers: Kroah-Hartman and Sasha Levin. In some cases, they have both added signoffs to the same patch, causing the percentages to add up to more than 100%. The work done by everybody else pales by comparison — especially if one only looks at the patch counts. Often, though, the reason for a developer other than the stable-kernel maintainers to backport a patch is that the backport is not trivial. So, while the other developers backported far fewer patches, many of those patches almost certainly required a lot more work.

Bug reports

In theory, almost every patch in the stable series is a bug fix, implying that somebody must have found and reported a bug. As it happens, only 4,236 of the commits in the 4.9.x series include a Reported-by tag — only 18% of the total. So most of the problems being fixed are either coming to light in some other way, or the report tags are not being included. For the patches that did include such tags, the results look like:

Top bug reporters in 4.9.x

Reporter Reports Pct

Syzbot 901 18.8%

Hulk Robot 181 3.8%

kernel test robot 156 3.2%

Dmitry Vyukov 100 2.1%

Andrey Konovalov 80 1.7%

Dan Carpenter 79 1.6%

Jann Horn 34 0.7%

Guenter Roeck 29 0.6%

Jianlin Shi 27 0.6%

Ben Hutchings 26 0.5%

Fengguang Wu 26 0.5%

Al Viro 21 0.4%

Arnd Bergmann 19 0.4%

Lars-Peter Clausen 19 0.4%

Xu, Wen 19 0.4%

Eric Biggers 18 0.4%

Igor Zhbanov 18 0.4%

TOTE Robot 18 0.4%

Tetsuo Handa 17 0.4%

Linus Torvalds 16 0.3%

Top bug reporters in 4.9.x
Reporter	Reports	Pct
Syzbot	901	18.8%
Hulk Robot	181	3.8%
kernel test robot	156	3.2%
Dmitry Vyukov	100	2.1%
Andrey Konovalov	80	1.7%
Dan Carpenter	79	1.6%
Jann Horn	34	0.7%
Guenter Roeck	29	0.6%
Jianlin Shi	27	0.6%
Ben Hutchings	26	0.5%
Fengguang Wu	26	0.5%
Al Viro	21	0.4%
Arnd Bergmann	19	0.4%
Lars-Peter Clausen	19	0.4%
Xu, Wen	19	0.4%
Eric Biggers	18	0.4%
Igor Zhbanov	18	0.4%
TOTE Robot	18	0.4%
Tetsuo Handa	17	0.4%
Linus Torvalds	16	0.3%

Bug reporting is clearly widely distributed, with the top three reporters (all robots) accounting for just over 25% of the total. Even so, it is clear that the bug-hunting robots are finding a lot of problems, hopefully before our users do.

Bug introduction

Another thing one can look at is the source of the bugs that were fixed in 4.9.x. Some work mapping Fixes tags in 4.9.x commits to the original commits can shine a light on when bugs were introduced; the result is a plot that looks like this:

The 4.9 and 4.8 releases are, unsurprisingly, the source of many of the bugs fixed in the stable updates, with nearly 700 coming from each. After that comes the usual long tail, including every release ever made since the Git era began at 2.6.12. Every pre-4.10 release in the Git history is represented here; the least-fixed release is 2.6.17, which was released in 2006, with "only" 22 fixes.

That plot is not the whole story, though. Each of the 4.9.28, 4.9.34, 4.9.51, 4.9.75, 4.9.77, 4.9.78, 4.9.79, 4.9.94, 4.9.102, 4.9.187, 4.9.194, 4.9.195, 4.9.198, 4.9.207, 4.9.214, 4.9.219, 4.9.228, 4.9.253, 4.9.258, 4.9.259, 4.9.261, 4.9.265, 4.9.298, and 4.9.299 releases included a commit that was identified by a later Fixes tag; 4.9.81 and 4.9.218 had two, and 4.9.310 had three. Each of those, clearly, indicates a regression that was introduced into the stable kernel and later fixed. But even that is not the full picture; consider this:

Every release made after 4.9 also introduced bugs that had to be fixed in the stable updates — over 1,500 fixes in all. That is a lot of buggy commits to have introduced into a "stable" kernel. One should also not take the wrong message from the lower counts for more recent kernel releases. It might be possible that our releases are getting less buggy, but a more plausible explanation is that the empty space in the upper-right half of that plot just represents bugs that have not yet been found and fixed.

The 4.9 stable series was, thus, not perfect — not that anybody ever claimed that it was. It was, however, good enough to be the core of many deployed systems, including an unimaginable number of Android devices. The 4.9 kernel series is a testament to what the development community can accomplish when it sets its mind to it. It was a base that many users could rely on, and has well earned its retirement.

Index entries for this article
Kernel	Releases/4.9

to post comments

Six years with the 4.9 kernel

Posted Jan 13, 2023 8:52 UTC (Fri) by error27 (subscriber, #8346) [Link] (11 responses)

In olden days we used to measure how good a kernel was by comparing uptimes.

I've heard distro developers say that some kernels are definitely better than others. They say there is an element of chance or something and it just works out that some kernels end up being headaches.

Another thing that happens is that if people know a kernel is going to be a long term release or a something that the distros are going to support for a long time then people try to get all their features merged into it. Maybe the features are not quite up to ideal standards but they really want to get the features merged instead of waiting for the next distro release. So that could impact the quality.

If you look at the top bug reporters chart there are 1804 bugs. At least 1081 of those were Syzkaller bugs. Syzbot was created at about the time that 4.9 was released so there was a backlog of bugs. New kernels will not inherit those bugs and Syzbot finds a lot of bugs in new code before the kernels are released to users.

Another 460 bug reports came from static analysis. (Although potentially some are double counted if the kbuild-bot and I both reported the same bug?) . That means that static checkers are discovering new types of bugs and more people are using static analysis on their code. The static checker fixes combined with the Syzbot fixes mean 1523 bug fixes are the result from new tools or process improvements.

New subsystems are always buggy at first. In olden days wifi or power management used to be problematic but these days they have improved and matured. The hardware has also matured.

So new kernels should be better than old kernels. We often used to talk about doing a "Bug Fix Only" release but these days no one talks about that which also suggests people are happier.

Six years with the 4.9 kernel

Posted Jan 13, 2023 9:21 UTC (Fri) by taladar (subscriber, #68407) [Link] (10 responses)

Hopefully you mean that the kernels with high uptimes were bad because they got insufficient attention for security fixes?

We really should get rid of that idea that a high uptime is something good and to strive for. The longer the system is up the more known issues exist with the kernel and the less tested the boot process becomes (as in, it is less likely to boot if you are forced into starting it by something like a power outage or a hardware problem).

Six years with the 4.9 kernel

Posted Jan 13, 2023 9:28 UTC (Fri) by error27 (subscriber, #8346) [Link]

Correct. Olden times were simpler times. :)

I think it would still be a fun challenge to maintain large up times by using live updates to address security issues. (I have never live patches a kernel myself so I don't know exactly how fun this is...)

Six years with the 4.9 kernel

Posted Jan 13, 2023 16:41 UTC (Fri) by cesarb (subscriber, #6266) [Link] (8 responses)

> We really should get rid of that idea that a high uptime is something good and to strive for.

No, we should instead get rid of the idea that constant churn is something good and to strive for. Our systems should be stable and secure enough that we shouldn't have to worry about them missing security or stability fixes, or about them failing to boot after a power outage.

That is, having high uptime is a proxy for the system being stable and secure enough that it *can* have a high uptime without worry. And that's something we *should* strive for.

Six years with the 4.9 kernel

Posted Jan 13, 2023 17:22 UTC (Fri) by farnz (subscriber, #17727) [Link] (7 responses)

The capacity for a high uptime is good, but actually having one is bad in practice. Not because of security or stability fixes, but rather because a high uptime correlates with being unable to get the system back into working state after a hardware failure.

A host with a low uptime is a host where everything it does is configured on boot - because anything else is too much pain to administer. A host with a high uptime can be one where important configuration is not persisted to permanent storage, but instead is an in-memory consequence of a command run by an admin who's not currently around and who didn't document what they did sufficiently to allow it to be redone after the power outage (or whatever failure causes the system to reboot).

Six years with the 4.9 kernel

Posted Jan 15, 2023 15:58 UTC (Sun) by ballombe (subscriber, #9523) [Link] (6 responses)

Except low uptime increase the risk of hardware failure.

Six years with the 4.9 kernel

Posted Jan 15, 2023 17:38 UTC (Sun) by farnz (subscriber, #17727) [Link] (4 responses)

I don't see how a small increase in the risk of hardware failure due to a software restart (which, yes, puts a small amount of extra strain on the hardware) is an argument against being ready for an outage.

The point is that a high-uptime host is more likely to have a change that's critical to operation, and not applied on host start up. If something happens to cause the host to reboot (mains power outage, perhaps), then someone had to find all of those changes. And the longer the uptime, the more such changes you have to track down.

If you can restore from backup and recover with not more than a day's work lost, that's a lot better than being unable to restore for a week because the person who knows what changes need to be made to get from booted to operational is out of contact. And if the person you need is no longer available at all (left company, or worse), you're in serious trouble.

It's a corollary of the observation that the more often you do something, the more likely you are to be able to do it successfully when you have to do it on an unscheduled basis.

Six years with the 4.9 kernel

Posted Jan 15, 2023 22:02 UTC (Sun) by ballombe (subscriber, #9523) [Link] (3 responses)

It depend on your goal.
If your goal is service availability you might be right,
but if the goal is computational throughput (as in HPC) then long uptime are better.

Six years with the 4.9 kernel

Posted Jan 16, 2023 10:48 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Even if your goal is computational throughput, long uptime can be worse because the resulting downtime after an adverse event (such as an electricity grid failure coincident with a UPS failure) is much longer than the downtime due to a single host failing during reboot, but then being trivial to replace.

The tradeoff you're making when you go for long uptimes is between having change management discipline enforced by reboots, having to have some other mechanism to enforce change management discipline, or having unpredictable long downtime where returning a host to operational state after downtime is a skilled process involving sysadmin interaction, rather than something you can delegate to "remote hands" and automation.

Key to this is that what actually matters is change management - regular reboots are merely one way to enforce good change management discipline, and if you're able to enforce it another way, you can have your long uptimes. Not having good change management is what causes unpredictably long downtimes - because a host's required state to become operational may be kept in RAM and one sysadmin's head, and lacking that sysadmin means you have to reverse-engineer the changes they made outside change management.

Six years with the 4.9 kernel

Posted Jan 16, 2023 21:55 UTC (Mon) by neggles (subscriber, #153254) [Link] (1 responses)

You're assuming that configuration changes are being made, and that it's even possible to make them, which is very much not the case for HPC-type systems; those nodes are almost always generic, non-persistent, network-booted machines that you *can't* log into, with their configuration pushed from a central management system, and they're generally isolated from the internet (thus mitigating the vast majority of security issues). High uptime is not inherently bad across all systems.

Six years with the 4.9 kernel

Posted Jan 17, 2023 9:34 UTC (Tue) by farnz (subscriber, #17727) [Link]

If it's impossible to make configuration changes, then it's impossible to get a new workload onto the system, or even new data to process. This surprises me - while I don't work in HPC, I was under the impression that HPC systems get new workloads and data provided on a fairly regular basis after construction.

Given this, regular reboots (which imply short uptimes) are one way to force change management to be done properly - because configuration that's not persisted is lost every time the system reboots. It's not the only way; if the system administrators are suitably disciplined, and the only way for non-administrators to get access is to submit jobs via tooling (e.g. running qsub) that takes care of ensuring that changes do not persist between jobs, then you've got change management that way, and don't need regular reboots.

Six years with the 4.9 kernel

Posted Jan 16, 2023 9:14 UTC (Mon) by jtaylor (subscriber, #91739) [Link]

Is there any evidence that this is really the case?

At work I maintain ~1000 hardware servers which are restarted weekly and a couple hundred which are not, we see no difference in hardware failure.

Privately my home pc is rebooted often several times a day and they have all lasted ~5-7 years before being replaced due to being outdated, not failed.