[go: up one dir, main page]

|
|
Log in / Subscribe / Register

The intersection of modules, GKI, and rocket science

By Jonathan Corbet
October 11, 2021
One does not normally expect a lot of controversy around a patch series that makes changes to platform-specific configurations and drivers. The furor over some work on the Samsung Exynos platform may thus be surprising. When one looks into the discussion, things become more clear; it mostly has to do with disagreements over the best ways to get hardware vendors to cooperate with the kernel development community.

In mid-September, Will McVicker posted a brief series of changes for the Exynos configuration files; one week later, a larger, updated series followed. The purpose in both cases was to change a number of low-level system-on-chip (SoC) drivers to allow them to be built as loadable modules. That would seem like an uncontroversial change; it is normally expected that device drivers will be modular. But the situation is a little different for these low-level SoC drivers.

Generic kernels and essential drivers

In the distant past, kernels for Arm-based SoCs were specific to the target platform. While x86 kernels normally run on all x86 processors, Arm kernels were built for a small range of target platforms and would not boot on anything else. Over the years, the Arm developers have worked to make the 64-bit Arm kernel sufficiently generic that a single binary image can boot on a wide range of platforms. To a great extent, this portability has been achieved by building drivers as modules so that a kernel running on a given device need only load the drivers that are relevant to that device.

There is a bootstrapping problem to solve, though; before the kernel can load a single module, it must be able to boot to a point where it has the platform in a known, stable state and is able to mount a RAM-based root filesystem. That can only happen if the drivers needed to boot that far are built into the kernel itself. Thus, the generic kernel contains a long list of platform-specific drivers to configure clocks, pin controllers, and more; without them, the kernel would never boot. The maintainers' policy has long stated that any drivers which are essential for the boot process must be built into the kernel itself.

McVicker's patch set takes a number of these essential drivers and, for reasons to be discussed in detail below, removes them from the kernel image, making them into loadable modules instead. Ostensibly, this change does not violate the policy for these drivers, but only if it can be demonstrated that the drivers are, in fact, not essential for the kernel to boot on the affected hardware. Therein lies the first big problem for this patch set: McVicker made it clear in the cover letter that these changes had not actually been tested on the appropriate hardware. While he is optimistic that the systems should still boot with modular drivers, nobody has yet proved that.

Optimism only gets one so far in the kernel community. This lack of testing has caused Exynos platform maintainer Krzysztof Kozlowski to repeatedly push back on the patches; until he is sure that all Exynos systems can boot with those drivers as modules, he is unwilling to take the changes. He has offered to help with some of the needed testing. Meanwhile, Arnd Bergmann has backed up Kozlowski's reticence:

The "correctness-first" principle is not up for negotiation. If you are uncomfortable with the code or the amount of testing because you think it breaks something, you should reject the patches. Moving core platform functionality is fundamentally hard and it can go wrong in all possible ways where it used to work by accident because the init order was fixed.

The lack of testing seems like the kind of problem that should be amenable to a solution. Reaching the needed level of confidence may take a while, though. Some systems running a given SoC may boot without a specific clock driver (say) because the firmware initializes the clocks to a reasonable configuration at power-on. Counting on all firmware to have its act together in this way can be a risky endeavor, though. Even so, this testing, which should have been done before the patches were ever submitted, should be possible to fill in after the fact.

Out-of-tree code

There is still the question of why one would want to make this possibly risky change. The obvious benefit is making the core kernel image smaller; this is especially appreciated on all of the platforms that don't use the drivers in question and thus see them as dead weight. But there is another motivation here that relates to a different kernel, also called "generic".

The kernels shipped on Android devices have notoriously contained vast amounts of out-of-tree code, to the point that such code sometimes outweighs the mainline code that is in use on the device. This has led to problems throughout the ecosystem, including a lack of cooperation with upstream kernel developers, the fragmentation of the Android kernel space, the inability to apply security updates when the vendor inevitably stops doing so, and the cost of maintaining all of those kernels. To address this problem, Google has been pushing vendors of Android-based devices toward its "generic kernel image" (GKI), which is a core kernel that must be shipped by all Android devices. Vendors are able to supply their own modules to load into that kernel, but they cannot replace the kernel itself.

This policy brings a number of benefits. Vendor changes are restricted to what can be done with the module API, and Google has been pushing to restrict that somewhat as well. The days of vendors replacing the CPU scheduler should be done now. Vendors, naturally, chafe at these restrictions, but they have little alternative to compliance. If they choose to run their own system, even if it is an Android fork, they lose access to many of the Google apps and services that make Android useful for their customers.

Code that is built into the GKI kernel thus cannot be changed by device vendors. Code that is loaded from modules, instead, can be shoved aside and replaced. Viewed in this light, the desire to modularize built-in drivers becomes rather easier to understand. Even so, there are two different aspects of this situation that are worth examining. One is that vendors want to ship out-of-tree modules on their devices rather than upstreaming their drivers to hide their secret magic from competitors. As Lee Jones described it:

In order for vendors to work more closely with upstream, they need the ability to over-ride a *few* drivers to supplement them with some functionality which they believe provides them with a competitive edge (I think you called this "value-add" before) prior to the release of a device. This is a requirement that cannot be worked around.

As one might imagine, this position is seen as less than fully compelling by much of the kernel development community. It is also not entirely convincing; as Tomasz Figa put it:

Generally, the subsystems being mentioned here are so basic (clock, pinctrl, rtc), that I really can't imagine what kind of rocket science one might want to hide for competitive reasons.

Jones tried to de-emphasize this point of discussion later on, but it was a bit late; he had said (part of) the quiet part out loud.

The other piece of the puzzle is simpler to understand. Even if a set of clock drivers contains no real secrets of interest, the vendor may simply lack the desire to make the effort to get the drivers upstream. It is possible to get out-of-tree drivers built into the GKI, but Google would clearly rather not deal with that anymore, so there is a continual pressure to get drivers into the mainline. If the drivers can be supplied directly by the vendor as a module, instead, they disappear from the GKI and that pressure vanishes. With regard to the Exynos changes, a lack of desire to work upstream seems like a plausible explanation; as Kozlowski pointed out in the above-linked message, Samsung has only contributed a single change to the Exynos subsystem since 2017.

Jones has tried to characterize vendors' upstream reticence as temporary, saying "vendors are not able to upstream all functionality right away". Later, though, he said:

But [they have] no incentive to upstream code [for] old (dead) platforms that they no longer make money from. We're not talking about kind-hearted individuals here. These are business entities.

If neither new or old code can be upstreamed, then it would appear that mainline support for these platforms is at a dead end.

Better or worse?

The natural reaction for many kernel developers is to make life harder for vendors that are seemingly looking for ways to avoid engaging with the development community. That would include rejecting the patch set under consideration here. Olof Johansson's opinion was:

This patchset shouldn't go in.

GKI is a fantastic effort, since it finally seems like Google has the backbone to put pressure on the vendors to upstream all their stuff.

This patch set dilutes and undermines all of that by opening up a truck-size loophole, reducing the impact of GKI, and overall removes leverage to get vendors to do the right thing.

McVicker, instead, argued that modularizing these drivers is a way to bring vendors closer to upstream and will improve the situation overall:

We believe that if we make it easier for SoC vendors to directly use the upstream kernel during bring-up and during the development stages of their project, then that will decrease the friction of working with upstream (less downstream changes) and increase the upstream contributions.

Which of these positions is closer to the truth is hard to say; each may hold water with respect to some vendors while falling down with others. Getting vendors to engage with upstream is a constant process requiring judicious use of both carrots and sticks.

That said, the outcome of this particular discussion is not in a great deal of doubt. Making life easier for uncooperative vendors is usually not, on its own, sufficient reason to keep a patch set out of the kernel. Bergmann described it well in the above-linked message:

I understand that it would be convenient for SoC vendors to never have to upstream their platform code again, and that Android would benefit from this in the short run.

From my upstream perspective, this is absolutely a non-goal. If it becomes easier as a side-effect of making the kernel more modular, that's fine.

So, in a sense, much of the discussion was irrelevant; if the patches can be shown to work properly (which has not yet happened), then they are consistent with many of the community's long-term goals and will likely find their way into the mainline sooner or later. Whether that will encourage vendors to work upstream or, instead, make it easier for them to stay away remains to be seen. But problems with uncooperative vendors have existed for as long as the Linux kernel has; they will not go away regardless of what happens here.

Index entries for this article
KernelAndroid/Generic kernel image
KernelDevelopment model/Loadable modules
KernelModules


to post comments

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 16:16 UTC (Mon) by developer122 (guest, #152928) [Link] (1 responses)

>We believe that if we make it easier for SoC vendors to directly use the upstream kernel during bring-up and during the development stages of their project, then that will decrease the friction of working with upstream (less downstream changes) and increase the upstream contributions.

Let's take a step back here. Why does this make it easier to work with an upstream kernel during development? Why would this compel them to work with and contribute to upstream? Neither of these hold water to me at all.

These are open source kernels. They're free to modify them however they wish during development of bring-up code or of the device as a whole. The rest of the kernel community writes tons of upstreamable drivers and other code without first needing to prototype it in a separate kernel module...

...which these vendors could get away with never upstreaming. It would indeed reduce friction if they never had to interact with upstream ever again. But this doesn't lead to fewer downstream changes and provides NO incentive to upstream anything.

I'm not even sure I buy the argument of "well it makes it easier to work with these kernels, which happen to be mainline(ish), and so if we work with them enough we'll eventually see the light and be compelled by the spirit of Torvalds to contribute." This sounds exactly like an argument crafted to be pleasant to upstream developers. But it's rubbish. They have to work with these google-provided kernels whether they like it or not.

This also isn't a case of "how I feel about you will affect my behavior elsewhere" Businesses are psychopaths, not toddlers. The sticks and carrots work because companies weigh costs and benefits, not because they hold grudges or get cozy with people. This request is simply to have a cost removed, in the same way a polluter may request an exemption from the EPA because cleanup is expensive. I don't believe for a moment that allowing a corporation to upstream less will somehow create an emotional attachment that drives them to upstream more.

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 20:13 UTC (Mon) by IanKelling (subscriber, #89418) [Link]

Right. The quote is like: if you delete all that code we don't want to work with you on, then it will be easier to work with you.

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 16:41 UTC (Mon) by developer122 (guest, #152928) [Link] (9 responses)

>Jones has tried to characterize vendors' upstream reticence as temporary, saying "vendors are not able to upstream all functionality right away". Later, though, he said:
>>But [they have] no incentive to upstream code [for] old (dead) platforms that they no longer make money from. We're not talking about kind-hearted individuals here. These are business entities.
>If neither new or old code can be upstreamed, then it would appear that mainline support for these platforms is at a dead end.

Code doesn't need to be prototyped in a kernel module during development. After launch there's no reason not to leave modules in place.

Vendors always have a reason *not* to upstream code. It requires effort. That may be effort they can't spare during launch, or effort they don't want to spend on old products. The excuse will change with the situation, but the ironclad rule of accounting remains: the easiest way to increase profit is to reduce cost.

There is no point in a product's lifecycle where a vendor will naturally feel compelled to upstream anything because the action generates no revenue. The most open-source friendly companies merely use upstreaming as a tool to reduce *future* costs, but clearly this isn't a consideration to android vendors.

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 20:27 UTC (Mon) by IanKelling (subscriber, #89418) [Link] (1 responses)

> Vendors always have a reason *not* to upstream code. It requires effort. That may be effort they can't spare during launch, or effort they don't want to spend on old products.

Not a convincing argument. Everything requires effort, thus I'm justified in doing nothing. First I'm doing this, I can't bother with that. And after that, I'm doing something else, so I can't bother then either. \end sarcasm

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 2:13 UTC (Tue) by developer122 (guest, #152928) [Link]

Read my comment carefully. I'm not arguing in favor of it. I'm admonishing the fact that they always will have an excuse.

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 20:32 UTC (Mon) by IanKelling (subscriber, #89418) [Link]

> but clearly this isn't a consideration to android vendors.

That seems too broad a characterization.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 7:18 UTC (Tue) by nilsmeyer (guest, #122604) [Link] (4 responses)

> Vendors always have a reason *not* to upstream code. It requires effort. That may be effort they can't spare during launch, or effort they don't want to spend on old products. The excuse will change with the situation, but the ironclad rule of accounting remains: the easiest way to increase profit is to reduce cost.

There's another disincentive: If older hardware is supported for a longer time then users have less incentive to upgrade to newer devices.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 8:10 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (1 responses)

I seriously doubt this is an actual consideration when refusing to upstream.

Institutional inertia (we never had to do it therefore we will never do it), differences in corporate culture (people used to contribute to Linux because is was an escape from corporate unix fossilization; those folks OTOH consider upstreaming as an additional obligation on top of their internal corporate politics) weight a lot more.

And all the technical debt produced by years of bad upstreaming practices means there is not even a fun aspect to the whole thing, it starts with clearing the debt, that both devs and their management will be reluctant to invest in.

Basic point, corporate devs don’t want to clean the mess, their management see no benefit in doing so and will only weight in favor of it in presence of external pressure like GKI.

The years cloud giants spent in promoting open source (as opposed to free software) do not help either, Google is being hurt by the consequences of its own selfish advocacy.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 12:00 UTC (Tue) by smurf (subscriber, #17840) [Link]

… and they show no intent of stopping that practice any time soon.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 20:35 UTC (Tue) by marcH (subscriber, #57642) [Link] (1 responses)

> There's another disincentive: If older hardware is supported for a longer time then users have less incentive to upgrade to newer devices.

This is the so-called "planned obsolescence" but it's been very hard to demonstrate when simpler explanations abound. Cutting corners and not backporting security fixes help the bottom line *directly*, really no need for more complex and very indirect rationales.

Planned obsolescense makes absolutely zero sense in a competitive environment: in that case you're just hurting your brand and helping your competition.

I really wish the time and money spent lobbying against this very elusive "planned obsolescence" were all spent on Rights to Repair instead. The latter is a real problem with real impact (and is incidentally related to open-source, unlike the first one)

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 22:19 UTC (Tue) by developer122 (guest, #152928) [Link]

Comprehensive Right to Repair (including the unrestricted right to modify devices not just "restore them to working order") would do a lot to tackle planned obsolescence. Not all of it, that's for sure, but it's much easier to codify into laws and standards.

The intersection of modules, GKI, and rocket science

Posted Oct 13, 2021 16:26 UTC (Wed) by marcH (subscriber, #57642) [Link]

> Code doesn't need to be prototyped in a kernel module during development.

I guess it doesn't apply in this particular context but in general many driver changes don't require a full reboot.

> After launch there's no reason not to leave modules in place.

You must be hardware engineer if you think there's no development after launch ;-)

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 19:06 UTC (Mon) by willy (subscriber, #9762) [Link] (2 responses)

Where's the rocket science part?! I was expecting to see something about a NewSpace launch company using Exynos in their engine controllers, or Ingenuity being based on an Exynos, or something.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 15:00 UTC (Tue) by immibis (subscriber, #105511) [Link] (1 responses)

The rocket science is whatever these companies are doing with their clocks and pin controllers that makes them want to prevent everyone else seeing how their clocks and pin controllers work :)

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 21:45 UTC (Tue) by Wol (subscriber, #4433) [Link]

The reality is those companies probably don't have a clue as to the provenance and legality of all that code, and therefore would rather it was not exposed to scrutiny.

Personally, I suspect they would be much better off if they just push-mirrored their development tree to github or the like, as that would discourage employees from "nicking" code they shouldn't, satisfy any GPL requirement, and encourage motivated Open Source developers to muck in and help/advise.

Surely it can't be THAT hard to devise a minimal-friction set-up that encourages source to be pushed "over-the-wall".

Cheers,
Wol

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 21:16 UTC (Mon) by tsoni.lwn (subscriber, #139617) [Link] (6 responses)

"The days of vendors replacing the CPU scheduler should be done now."

This is not a right understanding w.r.t GKI.

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 21:20 UTC (Mon) by corbet (editor, #1) [Link] (5 responses)

Care to explain what "a right understanding" would be, then?

Thank you.

The intersection of modules, GKI, and rocket science

Posted Oct 11, 2021 21:42 UTC (Mon) by tsoni.lwn (subscriber, #139617) [Link] (3 responses)

Hi,

GKI provides the vendor tracehooks (also restricted tracehooks), which could help you to write the vendor module and in that vendor module you can override most of the paths of the upstream scheduler or value-adds of the Vendors which helps them in power and performance by changing the behavior of scheduler.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 4:28 UTC (Tue) by Paf (subscriber, #91811) [Link] (2 responses)

So… it lets them replace most of the scheduler? That really sounds like what John said.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 16:31 UTC (Tue) by tsoni.lwn (subscriber, #139617) [Link] (1 responses)

Article tries to mention that you can't replace CPU scheduler behavior after GKI and I tried to clarify that it is not true. One can still replace and change the CPU scheduler behavior or replace it entirely even w/ GKI.

The intersection of modules, GKI, and rocket science

Posted Oct 13, 2021 18:53 UTC (Wed) by ballombe (subscriber, #9523) [Link]

So GKI is a kind of a loophole that allow to change the kernel without patching it while providing a semi-stable interface? Good to know...

The intersection of modules, GKI, and rocket science

Posted Oct 13, 2021 21:08 UTC (Wed) by karim (subscriber, #114) [Link]

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 2:59 UTC (Tue) by jhoblitt (subscriber, #77733) [Link] (3 responses)

I suspect that Google may have decided that hardware vendors, including qualcomm, are likely to continue to be resistant to working with upstream well into the future. So much so that the only path forward for being able to ship major android updates is to take maters into their own hands with "tensor", even if it is more or less an Exynos SoC. I'm wonder if this will be a demonstration along the lines of "look, it is reasonable for you to be required to ship major os updates for 5 years" or if Google is intending to start providing SoC chips or IP to phone manufactures.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 16:37 UTC (Tue) by tsoni.lwn (subscriber, #139617) [Link] (2 responses)

As discussed in the email thread at kernel mailing list, upstreaming entire SOC drivers at kernel.org is pipe dream. Particularly mobile chipsets. There is a huge churn in terms of the SOCs due to different market segments and speed at which they get replaced or superseded. Please also note that H/W blocks do get change based on the market requirement and it is hard to keep up upstreaming everything. I see that multiple SOC vendors are trying to upstream their drivers either through Linaro or any other external entity or internal engineers. It is for the best interest of the community is to develop the missing frameworks and drivers so that downstream users doesn't fork or come up with the frameworks which makes their upstreaming effort even harder. Please note that many many frameworks and ideas came into the kernel.org over the years due to adoption of ARM and other architectures in the Embedded Systems and it has not happened without the engagement of SOC vendors. If you look at the Google Chrome Laptop, most (if not all) drivers and low-level code is at upstream.

The intersection of modules, GKI, and rocket science

Posted Oct 13, 2021 0:17 UTC (Wed) by pwsan (subscriber, #56604) [Link] (1 responses)

> As discussed in the email thread at kernel mailing list, upstreaming entire SOC drivers at kernel.org is pipe dream.

It's been done before. There's simply little incentive to do so as long as it's not a business requirement for the hardware company to do it.

It seems implausible that the GKI work is about encouraging hardware vendors to upstream their code, since as others on the lists and in these comments have already observed, GKI actually appears to create an incentive not to work with upstream. The most likely motivations are the top three "updates and upgrades" arguments mentioned at https://source.android.com/devices/architecture/kernel/ge... : "Security updates are labor intensive", "Difficult to merge Long-Term Supported updates", and "Inhibits Android platform release upgrades"

The intersection of modules, GKI, and rocket science

Posted Oct 20, 2021 23:36 UTC (Wed) by florianfainelli (subscriber, #61952) [Link]

> It's been done before. There's simply little incentive to do so as long as it's not a business requirement for the hardware company to do it.

If you are referring to the TI platforms, we can argue that not everything landed upstream although it's definitively the example of how to do it. This was a few years ago when there was also less complexity to deal with.

> It seems implausible that the GKI work is about encouraging hardware vendors to upstream their code, since as others on the lists and in these comments have already observed, GKI actually appears to create an incentive not to work with upstream. The most likely motivations are the top three "updates and upgrades" arguments mentioned at https://source.android.com/devices/architecture/kernel/ge... : "Security updates are labor intensive", "Difficult to merge Long-Term Supported updates", and "Inhibits Android platform release upgrades"

Well it encourages the vendors to get their changes upstream in that, if they cannot manage to put it an out of tree/loadable module, then they need to get the symbols/services/functionality into the core kernel image, and because kernel developers won't accept a new feature/export/functionality without at least an user of it, then this means vendors will have to submit their code along with it.

The intersection of modules, GKI, and rocket science

Posted Oct 12, 2021 20:55 UTC (Tue) by marcH (subscriber, #57642) [Link]

> There is still the question of why one would want to make this possibly risky change. The obvious benefit is making the core kernel image smaller; this is especially appreciated on all of the platforms that don't use the drivers in question and thus see them as dead weight.

So long story short someone had both a perfectly good, technical rationale and another, much less popular rationale for the same code change but they "forgot" to test the good rationale before sending the patches. That's funny, you could make a mailing-list show out of it. Maybe a LWN article even! Oh, wait...

On the clocking topic specifically

Posted Oct 20, 2021 23:49 UTC (Wed) by florianfainelli (subscriber, #61952) [Link]

TBH, SoC vendors that desire to hide their gory details should embrace ARM's SCMI or a similar firmware-based interface, preferably something for which drivers already exist in the client OS, but maybe you have good reasons to invent your own. Then it becomes largely OS agnostic (if you happen to care), and the underlying firmware implementation can remain proprietary, thus pleasing your lawyers.

Sure it's an abstraction that you may not like, and sure you may prefer to maintain your bare-metal driver but form a kernel distribution perspective there is a single clock driver, a single cpufreq driver, a single hwmon driver, a single power domain that you need to maintain moving forward. Or more realistically there is less than 20 or so different clock drivers, but maybe 5 or 6 at most.


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds