The effect of Meltdown and Spectre in our communities
A late-breaking development in the computing world led to a somewhat hastily arranged panel discussion at this year's linux.conf.au in Sydney. The embargo for the Meltdown and Spectre vulnerabilities broke on January 4; three weeks later, Jonathan Corbet convened representatives from five separate parts of our community, from cloud to kernel to the BSDs and beyond. As Corbet noted in the opening, the panel itself was organized much like the response to the vulnerabilities themselves, which is why it didn't even make it onto the conference schedule until a few hours earlier.
Introductions
Corbet is, of course, the executive editor here at LWN and has been "writing a lot of wrong stuff on the internet" about the vulnerabilities. He came up with the idea of a panel to discuss how the episode had impacted the free-software community and how our community responded to it, rather than to recount the "gory technical details". The panelists were Jessie Frazelle, Benno Rice, Kees Cook, Katie McLaughlin, and Andrew "bunnie" Huang, who each introduced themselves in turn.
Frazelle works at Microsoft and is on the security team for Kubernetes; she said she "has done a few things with containers", which elicited some laughter. Rice is a FreeBSD developer and core team member who works for iXsystems, which develops FreeNAS. Cook works for Google on upstream kernel security; he found out about Meltdown and Spectre before the embargo date, so he has some knowledge and insight about that part of the puzzle. McLaughlin is at Divio, which is a Django cloud provider; her company did not find out about the vulnerabilities until the embargo broke, so she has been "having fun" since then. Huang makes and breaks hardware, he said, which may give him some perspective on "the inside guts".
Corbet started by asking each panelist to recount their experiences in responding to the vulnerabilities and to comment on how it impacted their particular niche. There was a lot of confusion, Frazelle said, which led to questions about "whether containers would save you from a hardware bug—the answer is 'no'". Fixing this process moving forward would obviously involve "not having an absolute shitshow of an embargo", she said.
Rice said that the BSD community's main gripe is that it didn't get notified of the problem until very late in the embargo period despite having relationships with many of the vendors involved. The BSD world found out roughly eleven days before the embargo broke (which would put it right at Christmas); he didn't think that was due to any "malice or nastiness on anyone's part". He hopes that this is something that only happens once in a lifetime, though that statement caused a chuckle around the room—and from Rice himself. It is, he hopes, a chance to learn from the mistakes made this time, so that it can be done better in the future.
Even within Google (which discovered the vulnerabilities), knowledge about Meltdown and Spectre was pretty well contained, Cook said. Once he knew, he had to be careful who he talked with about it; those that needed to know had to go through a "somewhat difficult" approval process before they could be informed. He tried to make sure that the kernel maintainers were among the informed group around the time of the Kernel Summit in late October, the x86 maintainers in particular; once that happened, the process of working around the problem in the kernel seemed to accelerate.
As a platform provider, figuring out the right response was tricky, McLaughlin said, but her company's clients have been understanding. Divio does not host "really human-life-critical systems", she said, but that does not apply to everyone out there who has to respond. She and her colleagues were blindsided by the disclosure and things are not "100% back to where they were" on their platform.
Huang said that he feels for the engineers who work on these chips, as there is an arms race in the industry to get more and more performance; the feature that was exploited in this case is one that provides a large performance gain. It will be interesting to see how things play out in the legal arena, he said; there are already class-action lawsuits against Intel for this and previous hardware bugs had rather large settlements ($400 million for the Pentium FDIV bug, for example). He is concerned that lawsuits will cause hardware makers to get even more paranoid about sharing chip information; they will worry that more information equates to worse legal outcomes (and settlements).
Embargoes
Corbet then turned to the embargo process itself, wondering if it made things worse in this case, as has been asserted in various places. It is clear that there are various have and have-nots, with the latter completely left out in the cold without any kind of response ready even if the embargo had gone its full length.
Rice believes that embargoes do serve a purpose, but that this situation led to a lot of panic so that some were not informed because the hardware vendors were worried about premature disclosure. There is "a lovely bit of irony" that people worked out what was going on based on the merging of performance-degrading patches to Linux without Linus Torvalds's usual steadfast blocking of those kinds of patches. Without some kind of body that coordinates these kinds of embargoes (and disclosures), though, it is difficult to have any kind of consistency in how they are handled.
The embargo has been described as a "complete disaster", Cook said, but the real problems were with the notification piece; the embargo itself was relatively successful. The embargo only broke six days early after six months of being held under wraps. Those parts that could be, were worked on in the open, though that did lead to people realizing that something was going on: "Wow, Linus is happy with this, something is terribly wrong". The fact that it broke early made for a scramble, but he wondered what would have happened if it broke in July—it would have been months before there were any fixes.
It is quite surprising that the embargo held for as long as it did without people at least starting rumors about the problems, Frazelle said. But it is clear that the small cloud providers were given no notice even though they purchase plenty of hardware from Intel. There is an "exclusive club" and the lines on who is in and out of the club are not at all clear. Corbet said he would hate to see a world where only the largest cloud providers get this kind of information, since it represents a competitive disadvantage for those who are not in the club. Rice noted that community projects are even in a worse position, since they don't directly have the vendor relationships that might allow them to get early notice of these hardware flaws.
An audience member directed a question at McLaughlin: how could the disclosure process have been handled better for smaller cloud providers like the one she works for? She said that her company is running on an operating system that still did not have all the fixes, but that once the embargo broke, the company was able to "flip some switches" to disable access to many things in its systems. The sites were still up, but in a read-only mode and "no one died". She is not sure that can really scale; other providers had large enough teams to build their own kernels, but many were left waiting for the upstream kernel (or their distribution's kernel) to incorporate fixes. It is a difficult problem, she said, and she doesn't really have a good answer on how to fix it.
Open hardware
Another audience question on alternatives to Intel got redirected into a question on open hardware. Corbet asked Huang if open hardware was "a path to salvation", but Huang was not optimistic on that front. "I have an opinion on that", he said, to laughter. All of the elements that make up Meltdown and Spectre (speculative execution and timing side channels) were already known, he said, and open hardware is not immune. He quoted Seymour Cray: "memory is like an orgasm, it's better when you don't have to fake it". Whenever we fake memory to get better performance, it opens up side channels. Open hardware has not changed the laws of physics, so it is just as vulnerable to these kinds of problems.
On the other hand, there is a class of hardware problems where being able to review the processor designs will help, he said. These problems have not been disclosed yet, but he seemed to indicate they will be coming to light. Being able to look at all of the bits of the registers and what they control, as well as various debug features and other things that are typically hidden inside the processor design will make those kinds of bugs easier to find.
Corbet asked about the possibility of reproducible builds in the open hardware world, but Huang said that is a difficult problem to solve too. There are doping attacks that can change a chip in ways that cannot be detected even with visual inspection. So even being able to show that a particular design is embodied in the silicon is no real defense. We are building chips with gates a few atoms wide at this point, which makes for an "extremely hard problem". In software, we have the advantage that we can build our tools from source code, but we cannot build chip fabrication facilities (fabs) from source; until we can build fabs from source, we are not going to have that same level of transparency, he said.
Speculative execution has been warned about as an area of concern for a long time, an audience member said, but those warnings were either not understood or were ignored. There are likely warnings about other things today that likewise going unheeded; how does the industry ensure that we don't ignore warnings of risky features going forward. Corbet noted that writing in C or taking input into web forms have been deemed risky—sometimes we have to do risky things to make any progress.
[PULL QUOTE: We probably should be paying more attention to the more paranoid among us, Cook said. END QUOTE]We probably should be paying more attention to the more paranoid among us, Cook said. Timing side channels have been around for a long time, but there were no practical attacks, so they were often ignored. Waiting for a practical attack may mean that someone malicious comes up with one; instead we should look at ways to mitigate these "impractical" problems if there is a way to do so with an acceptable impact on the code. Huang said that as a hardware engineer, he finds it "insane that you guys think you can run secrets with non-secrets on the same piece of hardware"; that was met with a round of applause.
A question from Twitter was next: how should the recipients of early disclosure of these kinds of bugs be chosen? Corbet said that there is an existing process within the community for disclosure, even though it works reasonably well, it was not followed for unclear reasons this time. But Cook said that process is for software vulnerabilities, not for hardware bugs, which have not happened with same kind of frequency as their software counterparts. The row hammer flaw was the most recent hardware vulnerability of this sort and that was handled awkwardly as well. Improving the process for hardware flaws is something we need to do, he said.
But Rice thinks that the process for hardware bugs should be similar to what is done for software. After all, the problems need to be fixed in software, Corbet agreed. The vulnerabilities also span multiple communities and an embargo is meant to allow a subset of the world to prepare fixes before "the stuff hits the rotating thing", Rice said, which did not work in this case.
Are containers able to help here? Corbet pointed to one of the early responses from Alan Cox who suggested that those affected should execute the plan (that they "most certainly have") to move their services when their cloud provider fails. He asked if that had happened: were people able to do that and did it help them?
One of the main selling points of Kubernetes and containers is that they avoid vendor lock-in, Frazelle said. She did not know if anyone used that to move providers. But many did like that the process of upgrading their kernel was easier using containers and orchestration. Services could be moved, the kernel upgraded and booted, then the services could be moved back, resulting in zero downtime.
McLaughlin agreed with that. Her company was able to migrate some of its services to other Amazon Web Services (AWS) regions that had kernel versions with the fixes applied. That is region-dependent, however, so it wasn't a complete fix, she said. In a bit of comedic break, there was a brief exchange on hardware. McLaughlin: "containers are cool in some aspects but they need to run on hardware". Frazelle: "sadly". Corbet: "that hardware thing's a real pain". Huang: "sorry, guys". Audience: laughter.
The future
Next up was another audience question: "is Spectre and Meltdown just the first salvo of the next twenty years of our life being hell?" Cook was somewhat optimistic that the mitigation for Meltdown (kernel page-table isolation (KPTI) for Linux) would be a "gigantic hammer" that would fix a wide range of vulnerabilities in this area. With three variants (Meltdown plus two for Spectre), there is the expectation that more problems of this sort will be found as researchers focus on these areas. If KPTI doesn't mitigate some of the newer attacks, at least the kernel developers will have had some practice in working around these kinds of problems, he said.
Since timing side channels have been known for some time, the novelty with Meltdown and Spectre comes from combining that with speculative execution, Rice said. That means that people will be looking at all of these CPU performance optimizations closely to see what else can be shaken loose.
Corbet asked Huang if he thought that would lead hardware designers toward more defensive designs. The problem, of course, is that any Huang can imagine are going to have a large impact on performance. Those purchasing hardware will have to decide if they are willing to take a "two to ten to twenty percent hit" for something that has rock-solid security. That kind of performance degradation impacts lots of things, including power budgets, maintainability, and so on, he said.
The last audience question asked if there were lessons that this hardware bug could teach us so that we do better when the inevitable next software vulnerability turns up. Rice said that, from his perspective, the software process is working better than the hardware one. It is not perfect, by any means, but is working reasonably well. Lessons are being learned, however, and he has had good conversations with those who did not alert the BSD community last time, so he is hopeful that situation will not repeat when the (inevitable) next hardware bug comes along.
Huang asked who embargoes are meant to protect against; is it script kiddies or state-level actors? Cook said "yes", but Corbet argued that script kiddies are last century's problem. They have been superseded by "political, commercial, or military interests". He said there have been rumors that these vulnerabilities were available for sale before the disclosure as well.
[PULL QUOTE: Huang pointed out that state-level actors are likely to have learned about the vulnerabilities at the same time as those inside the embargo did. END QUOTE]Huang pointed out that state-level actors are likely to have learned about the vulnerabilities at the same time as those inside the embargo did; those actors are monitoring various communication channels that were used to disclose the flaws. So opening the bugs up to the community to allow it to work together on fixes would have been a more powerful response, he said. That, too, was an applause-worthy line.
The fact that some of the fixes could not be worked on in the open (the Spectre fixes in particular) made things worse, Corbet said. Once the embargo broke, it was clear those patches were not in good shape and some did not even fix what they purported to. The problem was, as Cook pointed out, that once the words "speculative execution" were mentioned, the cat was pretty much fully out of the bag. While there was a plausible, if highly suspect, reason behind the KPTI patches (breaking kernel address-space layout randomization or KASLR), that was not possible for Spectre. Normally that kind of work all happens behind closed doors; Meltdown was the exception here.
In closing, several of the panelists shared their takeaways from this episode. McLaughlin reiterated the need to make sure that systems are up to date. Fixes are great, but they need to actually be rolled out in order to make a difference. Cook suggested that people should be running a recent kernel; it turns out that it is difficult to backport these fixes to, say, 2.6 kernels (or even 3.x kernels). He did also want to share his perspective that, contrary to the "OMG 2018 has started off as the worst year ever" view, 2017 was actually the year with all the badness. 2018 is the year where we are fixing all of the problems, which is a good thing.
Rice further beat the drum for learning from this event and making the next one work better for everyone. Similarly, Frazelle restated her thinking with a grin: "containers sadly won't save you, but they will ease your pain in upgrading". That was met with both panel and audience laughter, as might be guessed.
Corbet rounded things out with a plea to the industry for more information regarding the timeline and what was happening behind the scenes, which is normally released shortly after an embargo is lifted. In particular, he would like to understand what was happening for the roughly three months after the vulnerabilities were found but before groups like the Linux kernel developers started to get involved. There are people at the conference who have publicly stated that they are not allowed to even say the names of the vulnerabilities. That really needs to change so that we can all learn from this one and be better prepared for the next.
A YouTube video of the panel is available.
[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Sydney for LCA.]
| Index entries for this article | |
|---|---|
| Security | Meltdown and Spectre |
| Conference | linux.conf.au/2018 |