The effect of Meltdown and Spectre in our communities

By Jake Edge
January 31, 2018

linux.conf.au

A late-breaking development in the computing world led to a somewhat hastily arranged panel discussion at this year's linux.conf.au in Sydney. The embargo for the Meltdown and Spectre vulnerabilities broke on January 4; three weeks later, Jonathan Corbet convened representatives from five separate parts of our community, from cloud to kernel to the BSDs and beyond. As Corbet noted in the opening, the panel itself was organized much like the response to the vulnerabilities themselves, which is why it didn't even make it onto the conference schedule until a few hours earlier.

Introductions

Corbet is, of course, the executive editor here at LWN and has been "writing a lot of wrong stuff on the internet" about the vulnerabilities. He came up with the idea of a panel to discuss how the episode had impacted the free-software community and how our community responded to it, rather than to recount the "gory technical details". The panelists were Jessie Frazelle, Benno Rice, Kees Cook, Katie McLaughlin, and Andrew "bunnie" Huang, who each introduced themselves in turn.

Frazelle works at Microsoft and is on the security team for Kubernetes; she said she "has done a few things with containers", which elicited some laughter. Rice is a FreeBSD developer and core team member who works for iXsystems, which develops FreeNAS. Cook works for Google on upstream kernel security; he found out about Meltdown and Spectre before the embargo date, so he has some knowledge and insight about that part of the puzzle. McLaughlin is at Divio, which is a Django cloud provider; her company did not find out about the vulnerabilities until the embargo broke, so she has been "having fun" since then. Huang makes and breaks hardware, he said, which may give him some perspective on "the inside guts".

Corbet started by asking each panelist to recount their experiences in responding to the vulnerabilities and to comment on how it impacted their particular niche. There was a lot of confusion, Frazelle said, which led to questions about "whether containers would save you from a hardware bug—the answer is 'no'". Fixing this process moving forward would obviously involve "not having an absolute shitshow of an embargo", she said.

Rice said that the BSD community's main gripe is that it didn't get notified of the problem until very late in the embargo period despite having relationships with many of the vendors involved. The BSD world found out roughly eleven days before the embargo broke (which would put it right at Christmas); he didn't think that was due to any "malice or nastiness on anyone's part". He hopes that this is something that only happens once in a lifetime, though that statement caused a chuckle around the room—and from Rice himself. It is, he hopes, a chance to learn from the mistakes made this time, so that it can be done better in the future.

Even within Google (which discovered the vulnerabilities), knowledge about Meltdown and Spectre was pretty well contained, Cook said. Once he knew, he had to be careful who he talked with about it; those that needed to know had to go through a "somewhat difficult" approval process before they could be informed. He tried to make sure that the kernel maintainers were among the informed group around the time of the Kernel Summit in late October, the x86 maintainers in particular; once that happened, the process of working around the problem in the kernel seemed to accelerate.

As a platform provider, figuring out the right response was tricky, McLaughlin said, but her company's clients have been understanding. Divio does not host "really human-life-critical systems", she said, but that does not apply to everyone out there who has to respond. She and her colleagues were blindsided by the disclosure and things are not "100% back to where they were" on their platform.

Huang said that he feels for the engineers who work on these chips, as there is an arms race in the industry to get more and more performance; the feature that was exploited in this case is one that provides a large performance gain. It will be interesting to see how things play out in the legal arena, he said; there are already class-action lawsuits against Intel for this and previous hardware bugs had rather large settlements ($400 million for the Pentium FDIV bug, for example). He is concerned that lawsuits will cause hardware makers to get even more paranoid about sharing chip information; they will worry that more information equates to worse legal outcomes (and settlements).

Embargoes

Corbet then turned to the embargo process itself, wondering if it made things worse in this case, as has been asserted in various places. It is clear that there are various have and have-nots, with the latter completely left out in the cold without any kind of response ready even if the embargo had gone its full length.

Rice believes that embargoes do serve a purpose, but that this situation led to a lot of panic so that some were not informed because the hardware vendors were worried about premature disclosure. There is "a lovely bit of irony" that people worked out what was going on based on the merging of performance-degrading patches to Linux without Linus Torvalds's usual steadfast blocking of those kinds of patches. Without some kind of body that coordinates these kinds of embargoes (and disclosures), though, it is difficult to have any kind of consistency in how they are handled.

The embargo has been described as a "complete disaster", Cook said, but the real problems were with the notification piece; the embargo itself was relatively successful. The embargo only broke six days early after six months of being held under wraps. Those parts that could be, were worked on in the open, though that did lead to people realizing that something was going on: "Wow, Linus is happy with this, something is terribly wrong". The fact that it broke early made for a scramble, but he wondered what would have happened if it broke in July—it would have been months before there were any fixes.

It is quite surprising that the embargo held for as long as it did without people at least starting rumors about the problems, Frazelle said. But it is clear that the small cloud providers were given no notice even though they purchase plenty of hardware from Intel. There is an "exclusive club" and the lines on who is in and out of the club are not at all clear. Corbet said he would hate to see a world where only the largest cloud providers get this kind of information, since it represents a competitive disadvantage for those who are not in the club. Rice noted that community projects are even in a worse position, since they don't directly have the vendor relationships that might allow them to get early notice of these hardware flaws.

An audience member directed a question at McLaughlin: how could the disclosure process have been handled better for smaller cloud providers like the one she works for? She said that her company is running on an operating system that still did not have all the fixes, but that once the embargo broke, the company was able to "flip some switches" to disable access to many things in its systems. The sites were still up, but in a read-only mode and "no one died". She is not sure that can really scale; other providers had large enough teams to build their own kernels, but many were left waiting for the upstream kernel (or their distribution's kernel) to incorporate fixes. It is a difficult problem, she said, and she doesn't really have a good answer on how to fix it.

Open hardware

Another audience question on alternatives to Intel got redirected into a question on open hardware. Corbet asked Huang if open hardware was "a path to salvation", but Huang was not optimistic on that front. "I have an opinion on that", he said, to laughter. All of the elements that make up Meltdown and Spectre (speculative execution and timing side channels) were already known, he said, and open hardware is not immune. He quoted Seymour Cray: "memory is like an orgasm, it's better when you don't have to fake it". Whenever we fake memory to get better performance, it opens up side channels. Open hardware has not changed the laws of physics, so it is just as vulnerable to these kinds of problems.

On the other hand, there is a class of hardware problems where being able to review the processor designs will help, he said. These problems have not been disclosed yet, but he seemed to indicate they will be coming to light. Being able to look at all of the bits of the registers and what they control, as well as various debug features and other things that are typically hidden inside the processor design will make those kinds of bugs easier to find.

Corbet asked about the possibility of reproducible builds in the open hardware world, but Huang said that is a difficult problem to solve too. There are doping attacks that can change a chip in ways that cannot be detected even with visual inspection. So even being able to show that a particular design is embodied in the silicon is no real defense. We are building chips with gates a few atoms wide at this point, which makes for an "extremely hard problem". In software, we have the advantage that we can build our tools from source code, but we cannot build chip fabrication facilities (fabs) from source; until we can build fabs from source, we are not going to have that same level of transparency, he said.

Speculative execution has been warned about as an area of concern for a long time, an audience member said, but those warnings were either not understood or were ignored. There are likely warnings about other things today that likewise going unheeded; how does the industry ensure that we don't ignore warnings of risky features going forward. Corbet noted that writing in C or taking input into web forms have been deemed risky—sometimes we have to do risky things to make any progress.

[PULL QUOTE: We probably should be paying more attention to the more paranoid among us, Cook said. END QUOTE]

We probably should be paying more attention to the more paranoid among us, Cook said. Timing side channels have been around for a long time, but there were no practical attacks, so they were often ignored. Waiting for a practical attack may mean that someone malicious comes up with one; instead we should look at ways to mitigate these "impractical" problems if there is a way to do so with an acceptable impact on the code. Huang said that as a hardware engineer, he finds it "insane that you guys think you can run secrets with non-secrets on the same piece of hardware"; that was met with a round of applause.

A question from Twitter was next: how should the recipients of early disclosure of these kinds of bugs be chosen? Corbet said that there is an existing process within the community for disclosure, even though it works reasonably well, it was not followed for unclear reasons this time. But Cook said that process is for software vulnerabilities, not for hardware bugs, which have not happened with same kind of frequency as their software counterparts. The row hammer flaw was the most recent hardware vulnerability of this sort and that was handled awkwardly as well. Improving the process for hardware flaws is something we need to do, he said.

But Rice thinks that the process for hardware bugs should be similar to what is done for software. After all, the problems need to be fixed in software, Corbet agreed. The vulnerabilities also span multiple communities and an embargo is meant to allow a subset of the world to prepare fixes before "the stuff hits the rotating thing", Rice said, which did not work in this case.

Are containers able to help here? Corbet pointed to one of the early responses from Alan Cox who suggested that those affected should execute the plan (that they "most certainly have") to move their services when their cloud provider fails. He asked if that had happened: were people able to do that and did it help them?

One of the main selling points of Kubernetes and containers is that they avoid vendor lock-in, Frazelle said. She did not know if anyone used that to move providers. But many did like that the process of upgrading their kernel was easier using containers and orchestration. Services could be moved, the kernel upgraded and booted, then the services could be moved back, resulting in zero downtime.

McLaughlin agreed with that. Her company was able to migrate some of its services to other Amazon Web Services (AWS) regions that had kernel versions with the fixes applied. That is region-dependent, however, so it wasn't a complete fix, she said. In a bit of comedic break, there was a brief exchange on hardware. McLaughlin: "containers are cool in some aspects but they need to run on hardware". Frazelle: "sadly". Corbet: "that hardware thing's a real pain". Huang: "sorry, guys". Audience: laughter.

The future

Next up was another audience question: "is Spectre and Meltdown just the first salvo of the next twenty years of our life being hell?" Cook was somewhat optimistic that the mitigation for Meltdown (kernel page-table isolation (KPTI) for Linux) would be a "gigantic hammer" that would fix a wide range of vulnerabilities in this area. With three variants (Meltdown plus two for Spectre), there is the expectation that more problems of this sort will be found as researchers focus on these areas. If KPTI doesn't mitigate some of the newer attacks, at least the kernel developers will have had some practice in working around these kinds of problems, he said.

Since timing side channels have been known for some time, the novelty with Meltdown and Spectre comes from combining that with speculative execution, Rice said. That means that people will be looking at all of these CPU performance optimizations closely to see what else can be shaken loose.

Corbet asked Huang if he thought that would lead hardware designers toward more defensive designs. The problem, of course, is that any Huang can imagine are going to have a large impact on performance. Those purchasing hardware will have to decide if they are willing to take a "two to ten to twenty percent hit" for something that has rock-solid security. That kind of performance degradation impacts lots of things, including power budgets, maintainability, and so on, he said.

The last audience question asked if there were lessons that this hardware bug could teach us so that we do better when the inevitable next software vulnerability turns up. Rice said that, from his perspective, the software process is working better than the hardware one. It is not perfect, by any means, but is working reasonably well. Lessons are being learned, however, and he has had good conversations with those who did not alert the BSD community last time, so he is hopeful that situation will not repeat when the (inevitable) next hardware bug comes along.

Huang asked who embargoes are meant to protect against; is it script kiddies or state-level actors? Cook said "yes", but Corbet argued that script kiddies are last century's problem. They have been superseded by "political, commercial, or military interests". He said there have been rumors that these vulnerabilities were available for sale before the disclosure as well.

[PULL QUOTE: Huang pointed out that state-level actors are likely to have learned about the vulnerabilities at the same time as those inside the embargo did. END QUOTE]

Huang pointed out that state-level actors are likely to have learned about the vulnerabilities at the same time as those inside the embargo did; those actors are monitoring various communication channels that were used to disclose the flaws. So opening the bugs up to the community to allow it to work together on fixes would have been a more powerful response, he said. That, too, was an applause-worthy line.

The fact that some of the fixes could not be worked on in the open (the Spectre fixes in particular) made things worse, Corbet said. Once the embargo broke, it was clear those patches were not in good shape and some did not even fix what they purported to. The problem was, as Cook pointed out, that once the words "speculative execution" were mentioned, the cat was pretty much fully out of the bag. While there was a plausible, if highly suspect, reason behind the KPTI patches (breaking kernel address-space layout randomization or KASLR), that was not possible for Spectre. Normally that kind of work all happens behind closed doors; Meltdown was the exception here.

In closing, several of the panelists shared their takeaways from this episode. McLaughlin reiterated the need to make sure that systems are up to date. Fixes are great, but they need to actually be rolled out in order to make a difference. Cook suggested that people should be running a recent kernel; it turns out that it is difficult to backport these fixes to, say, 2.6 kernels (or even 3.x kernels). He did also want to share his perspective that, contrary to the "OMG 2018 has started off as the worst year ever" view, 2017 was actually the year with all the badness. 2018 is the year where we are fixing all of the problems, which is a good thing.

Rice further beat the drum for learning from this event and making the next one work better for everyone. Similarly, Frazelle restated her thinking with a grin: "containers sadly won't save you, but they will ease your pain in upgrading". That was met with both panel and audience laughter, as might be guessed.

Corbet rounded things out with a plea to the industry for more information regarding the timeline and what was happening behind the scenes, which is normally released shortly after an embargo is lifted. In particular, he would like to understand what was happening for the roughly three months after the vulnerabilities were found but before groups like the Linux kernel developers started to get involved. There are people at the conference who have publicly stated that they are not allowed to even say the names of the vulnerabilities. That really needs to change so that we can all learn from this one and be better prepared for the next.

A YouTube video of the panel is available.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Sydney for LCA.]

Index entries for this article
Security	Meltdown and Spectre
Conference	linux.conf.au/2018

to post comments

The effect of Meltdown and Spectre in our communities

Posted Jan 31, 2018 21:50 UTC (Wed) by GhePeU (subscriber, #56133) [Link] (7 responses)

Cook was somewhat optimistic that the mitigation for Meltdown (kernel page-table isolation (KPTI) for Linux) would be a "gigantic hammer" that would fix a wide range of vulnerabilities in this area.

I’ve been thinking about this, KPTI is already automatically disabled on all AMD CPUs and I’ve seen patches (I’m not sure if they were merged yet) to disable it on specific classes of unaffected Intel and non-Intel CPUs and to honour a register that will allow future CPUs to indicate they’re safe too, so the thinking on that side seems to be that KPTI fixes a specific issue and should be avoided when not needed, but at the same time Kees Cook here and others (including the people who developed KAISER) talk about it like something generally useful that will improve security for a larger class of problems. So which is which? Trade-offs, I know...

The effect of Meltdown and Spectre in our communities

Posted Jan 31, 2018 22:29 UTC (Wed) by davecb (subscriber, #1574) [Link]

I consider it part of a defense in depth.

Solaris x86 started out with user space being visible to the kernel, and people used that. To create bugs, as it happened. We later took that mapping out, and required the older, slower and more paranoid mechanisms be used.

Then we improved them so they stayed paranoid, but ran rather quickly. That gave us a smaller threat surface, making it somewhat harder to invent mealtdown-like explots withoout ever having to predict meltdown as a possible attack.

Didn't help with spectre, of course, but then only the Multicians worried about that kind of covert channel. Too bad!

The effect of Meltdown and Spectre in our communities

Posted Jan 31, 2018 23:28 UTC (Wed) by hansendc (subscriber, #7363) [Link] (5 responses)

There is no one-size-fits-all approach to security. I'd say that folks fit into three camps, generally:

If you are paranoid, you leave KPTI on no matter what (boot with pti=on). If you are like most folks, you take the kernel's sane defaults (off on AMD, off on Intel with the RDCL_NO bit set in the ARCH_CAPABILITIES MSR, on everywhere else for now). If you are entirely non-paranoid for some reason, you boot with pti=off.

The effect of Meltdown and Spectre in our communities

Posted Feb 5, 2018 4:55 UTC (Mon) by immibis (subscriber, #105511) [Link] (4 responses)

You can go even further in the performance vs. security tradeoff: I wonder if anyone's tried adding an option to disable all memory protection. A syscall is then just a call.

The effect of Meltdown and Spectre in our communities

Posted Feb 7, 2018 2:09 UTC (Wed) by nix (subscriber, #2304) [Link] (3 responses)

Ah, but that's also performance versus *stability*: every wild pointer bug can murder you. Spectre and Meltdown are unlikely to be triggered by accident. :)

The effect of Meltdown and Spectre in our communities

Posted Feb 8, 2018 6:18 UTC (Thu) by immibis (subscriber, #105511) [Link] (2 responses)

Well then here's an option in-between "no memory protection" and "normal pre-meltdown memory protection" - use a memory protection key for kernel space.

User-space code then can unlock the key at any time, but needs to use a special instruction to do so, so it's unlikely to happen by accident.

The effect of Meltdown and Spectre in our communities

Posted Feb 9, 2018 1:54 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

I'm fairly sure memory protection keys are for fairly small amounts of memory, and they protect *physical* memory, not chunks of the virtual address space and definitely not chunks of the virtual address space under specific privilege levels, magically autodecrypting at zero cost as you switch privilege levels: i.e. either kernel memory is encrypted, or it is not. If it's encrypted, kernel operations aren't exactly going to be very fast.

The effect of Meltdown and Spectre in our communities

Posted Feb 9, 2018 5:31 UTC (Fri) by immibis (subscriber, #105511) [Link]

Not those kinds of keys.
MPKs are tag bits associated with each page-table entry, which indirectly look up permissions in another processor register. See https://lwn.net/Articles/667156/

So you leave all user-space pages set to 0, for example, and set kernel pages to 1 (except for one containing the kernel entry point). Then you set the "MPK 1 permissions" register to write-disable, read-disable, execute-disable. Then when entering the kernel you clear those flags, and set them again when leaving. The "MPK 1 permissions" register is global, it's not part of the page-table entry.

Normally you wouldn't do this because the "set permissions register" instruction is not privileged, meaning any code can run it. But if you were trying to run a high-performance minimal-security still-somewhat-robust system, you might!

The effect of Meltdown and Spectre in our communities

Posted Jan 31, 2018 23:34 UTC (Wed) by joey (guest, #328) [Link] (1 responses)

Re open hardware, Bunnie didn't mention formal verification of CPUs. The difficulty with it (besides scaling to such complex CPUs as Intel chips) is perhaps that you can only use it to answer questions you think to ask. But, once once something like side channel attacks are a known problem, it should be possible to detect specific instances such as meltdown using formal verification.

Intel has been using formal verification in some areas since the Pentium F00F bug. Some Openrisc etc designs are being formally verified to some extent. Would open hardware have made it more likely that the right question was asked, so meltdown could be detected before vulnerable chips were deployed? (Would have been my question if I'd made it to Sydney.)

The effect of Meltdown and Spectre in our communities

Posted Feb 1, 2018 18:23 UTC (Thu) by joey (guest, #328) [Link]

https://www.bunniestudios.com/blog/?p=5127 might be the answer to that Q

The effect of Meltdown and Spectre in our communities

Posted Feb 1, 2018 0:09 UTC (Thu) by karkhaz (subscriber, #99844) [Link] (1 responses)

What a great idea it was to organise this panel---thanks, Jonathan! It's great to hear the insider experience from such a variety of people.

The effect of Meltdown and Spectre in our communities

Posted Feb 1, 2018 0:29 UTC (Thu) by corbet (editor, #1) [Link]

Thanks! The real credit belongs to the panelists, though, and to the LCA organizers who scrambled hard to find space for this session.

The effect of Meltdown and Spectre in our communities

Posted Feb 1, 2018 7:10 UTC (Thu) by rav (guest, #89256) [Link]

The 'comedic break' occurs in the video at 28:24: https://youtu.be/nlcXQWJALqQ?t=28m24s

The effect of Meltdown and Spectre in our communities

Posted Feb 1, 2018 7:14 UTC (Thu) by mjthayer (guest, #39183) [Link] (2 responses)

A few things which have been on my mind in this context. Please excuse me if they are naive.

1) KPTI seems to be a pretty bad performance hit, especially for old CPUs. It would obviously be possible to expand the kernel stub which is still mapped into all processes and reduce the hit that way. Are people looking at what can safely be done in that area? (I strongly assume they are.)

2) Sort of following on from some of the comments reported - "insane that you guys think you can run secrets with non-secrets on the same piece of hardware", the plan that they "most certainly have" - perhaps (we) software - and hardware - people are trying to be too self-important regarding security. We go to great (which may not equate to successful) efforts to keep those exploits we learn about secret until we have produced some sort of broken fix. Would it not be more sensible to spend more efforts educating those users who want to know about the risks associated with software and hardware so that they can prepare back-up plans when something hits, and put more effort into keeping things working when things need patching? The damage an exploit would cause is often (bad) guesswork, but the damage caused by bad patches, and by effort which could have gone into other things instead, is more measurable.

3) Regarding open hardware, is there any current effort into ISAs which are explicitly designed for Qemu-style recompilers? I know Transmeta failed at something like that in its day, but I could imagine that that could make some sense. Qemu is with all respect not a high performance tool, but it would probably still be a lot better with such an ISA. It seems to me that this would be a more realistic target for open CPUs than something like x86, but also not something that the likes of Intel would have an interest in doing in user-visible software.

The effect of Meltdown and Spectre in our communities

Posted Feb 1, 2018 13:09 UTC (Thu) by cesarb (subscriber, #6266) [Link] (1 responses)

> Regarding open hardware, is there any current effort into ISAs which are explicitly designed for Qemu-style recompilers?

While not explicitly designed for that, there have been good results with the RISC-V ISA: https://carrv.github.io/2017/papers/clark-rv8-carrv2017.pdf

The effect of Meltdown and Spectre in our communities

Posted Feb 2, 2018 8:56 UTC (Fri) by mjthayer (guest, #39183) [Link]

>> Regarding open hardware, is there any current effort into ISAs which are explicitly designed for Qemu-style recompilers?

> While not explicitly designed for that, there have been good results with the RISC-V ISA: https://carrv.github.io/2017/papers/clark-rv8-carrv2017.pdf

That looks like RV8 on x86. I was more thinking of the other way round. Some ISA which gave the recompiler some control over execution units, to replicate some of the out-of-order things which are normally done internally using software and basic hardware, and possibly more control over the various caches. For intellectual interest, does anyone else have thoughts about what would be possible or necessary, and whether any interesting level of performance would be achievable?

The effect of Meltdown and Spectre in our communities

Posted Feb 1, 2018 15:13 UTC (Thu) by fuhchee (guest, #40059) [Link] (5 responses)

> "he finds it "insane that you guys think you can run secrets with non-secrets on the same piece of hardware"; that was met with a round of applause. "

I don't understand the applause. The whole concept of timeshare systems implies just this mode of operation. Even a single-user system that runs code with different levels of trust/privilege is of this mode. If it's insanity, then the whole computing world is mad.

The effect of Meltdown and Spectre in our communities

Posted Feb 1, 2018 15:40 UTC (Thu) by bandrami (guest, #94229) [Link]

Maybe "Sysiphean" is a better word than "mad", but it's ultimately a losing game. The CPU *can* access any byte of memory, and no matter what walls you put up someone will find a way over, around, or through them eventually. (Even if you had a dedicated processor for each ring, which is a cool idea, you still have to have some kind of hyper/supervisor, which just kicks the same problem up a level.)

The effect of Meltdown and Spectre in our communities

Posted Feb 2, 2018 8:24 UTC (Fri) by mjthayer (guest, #39183) [Link] (3 responses)

> I don't understand the applause. The whole concept of timeshare systems implies just this mode of operation. Even a single-user system that runs code with different levels of trust/privilege is of this mode. If it's insanity, then the whole computing world is mad.

Security is not binary. Presumably for most of these single user systems the person in charge still has a certain minimum of trust in those processes not to do anything under the belt, and is prepared to deal with the fall-out if necessary. This already made sense before Spectre and Meltdown. I think that big cloud providers already offer clients with sensitive workloads physical isolation from other customers, as in that they are guaranteed that they will never share a physical PC in the data centre with someone else.

The effect of Meltdown and Spectre in our communities

Posted Feb 2, 2018 12:19 UTC (Fri) by fuhchee (guest, #40059) [Link] (2 responses)

> I think that big cloud providers already offer clients with sensitive workloads physical isolation from other customers,

Sure. But part the cloud infrastructure is a "secret" being run on the same piece of hardware as the isolated customer's code.

The effect of Meltdown and Spectre in our communities

Posted Feb 2, 2018 14:09 UTC (Fri) by mjthayer (guest, #39183) [Link] (1 responses)

>> I think that big cloud providers already offer clients with sensitive workloads physical isolation from other customers,

> Sure. But part the cloud infrastructure is a "secret" being run on the same piece of hardware as the isolated customer's code.

Even if we assume that the infrastructure in the individual machines is proprietary code, rather than "public" open source, or that that infrastructure has access to secrets about the bigger cloud set-up rather than just waiting for instructions from other machines (I do not know much about cloud infrastructure I must admit), I would expect the main protection for provider and customer from each other would be the trust based on their business relationship, backed up by laws, rather than technical capabilities of the CPUs.

The effect of Meltdown and Spectre in our communities

Posted Feb 2, 2018 14:28 UTC (Fri) by fuhchee (guest, #40059) [Link]

I'm not talking about "proprietary vs open source". I'm not even talking about "legal protection".

I'm talking about the simple bits & bytes level. A machine that participates in cloud infrastructure must necessarily have some secrets, for example those associated with authenticating itself to that infrastructure. Whether that's in the kernel, hypervisor, an administrative peer userspace VM, or some crypto coprocessor, doesn't matter. If that same piece of hardware also runs tenant code, "insanity" ensues.

Performance impacts of actually fixing the hardware

Posted Feb 2, 2018 17:30 UTC (Fri) by ecree (guest, #95790) [Link] (3 responses)

more defensive designs. The problem, of course, is that any Huang can imagine are going to have a large impact on performance. Those purchasing hardware will have to decide if they are willing to take a "two to ten to twenty percent hit" for something that has rock-solid security.

So here's what I don't understand about hardware Spectre mitigations. Maybe I'm muddling different variants of the attacks together, but I thought all this happened because certain caches in the CPU (notably the branch target buffer) aren't properly 'tagged' with the address space and/or permission ring, instead simply using the virtual address as their key. At least in variant 2 (the BTB attack), the CPU is being 'trained' with indirect branches performed in the attacker's address space, and this then causes the CPU running in the victim address space to predict the trained virtual address.

But fixing this wouldn't harm performance, indeed it should improve it, because branches (and accesses, and anything else that can train/populate the CPU's architecturally-invisible state) from one address space (or context more broadly construed) are at best irrelevant to predictions in another address space, thus will pollute the statistics and make the prediction perform worse than if it were secure.

So why is there an assumption that preventing the architecturally-invisible state from crossing privilege boundaries will necessarily reduce performance? I can see three possible answers:

Information could still leak out by detecting whether the victim has used enough space in the relevant cache to cause some of the attacker's entries to be evicted. Thus the only way to be sure would be to ~~take off and nuke the site from orbit~~ flush those caches on every context switch, kernel->user transition etc.
The scope of what Intel could do in their microcode patches was limited, but a proper fix in new hardware wouldn't actually imply a slowdown at all.
Everyone already knows this, v2 mitigation is easy, and it's fixes for Spectre v1 ('exploiting conditional branches') that would impact performance.

Though tbh I'm not entirely convinced that a hardware fix for v1 needs to be all that difficult either. The paper states (§7) that Buffering speculatively-initiated memory transactions separately from the cache until speculative execution is committed is not a sufficient countermeasure, since the timing of speculative execution can also reveal information but the attacks they describe there seem very (ahem) speculative, since the operations following the sensitive read in the speculated execution thread will surely 'see' the cache effects of that read (anything else would be a RAW data hazard), but operations in any other concurrent or subsequent speculated thread would not see those effects until the original read retires (anything else would be crazy). As far as I can tell, treating cache updates as pipelined state (like forwarded registers etc.) means that cache effects of speculative execution can never leak information — and again, prevents 'non-meaningful' accesses from polluting the cache with data that likely aren't needed. (Of course, the timing of speculative execution can leak information about cache state left by retired execution, but as long as we have caches this information leak necessarily exists and isn't unique to speculation.) As for the talk at the end of the section about effects on bus, ALU and register file contention… is there even any sense in which hardware can plug those leaks without just disabling speculation? Execution uses CPU resources, and if you want to prevent that from leaking information (e.g. you're a crypto library) you already have to do blinding, perform fruitless work in your else-clauses, etc. — so what does speculative execution add to that problem?

Performance impacts of actually fixing the hardware

Posted Feb 2, 2018 18:15 UTC (Fri) by excors (subscriber, #95769) [Link] (2 responses)

> Execution uses CPU resources, and if you want to prevent that from leaking information (e.g. you're a crypto library) you already have to do blinding, perform fruitless work in your else-clauses, etc. — so what does speculative execution add to that problem?

If you're writing a crypto library, you can put in the effort to ensure the instructions executed by your software are totally independent of any secret data, so the data cannot possibly be exposed through any side channels (timing, cache state, ALU utilisation, etc).

But with speculative execution, the CPU isn't actually going to execute the instructions the software told it to. It might execute a totally arbitrary different series of instructions, thinking that it can roll back afterwards and nobody will have noticed. (Spectre demonstrates that an attacker *can* notice.)

For example, you might have code like "key = test_mode ? dummy_key : secret_key; r = process_in_constant_time(key); if (test_mode) assert(r == process_in_variable_time(key));", where process_in_variable_time() is designed for clarity rather than security and leaks data through side channels; but you can trivially prove that it will only ever be called with the non-secret dummy_key and never with secret_key, so that's perfectly fine. Except, the CPU might mispredict the "if (test_mode)" and speculatively run process_in_variable_time(secret_key) and leak the secret data.

That does seem hard to fix without just disabling speculation.

Performance impacts of actually fixing the hardware

Posted Feb 2, 2018 18:59 UTC (Fri) by ecree (guest, #95790) [Link] (1 responses)

> Except, the CPU might mispredict the "if (test_mode)" and speculatively run process_in_variable_time(secret_key) and leak the secret data.

I think this is just a difference in degree rather than in kind — yes, it means otherwise-secure code can be vulnerable to this attack, but it's not necessary to devise a new way of programming to prevent it; merely applying existing techniques slightly more broadly than before. That is, the programmer has to not accept the "trivial proof" that a branch won't be followed. In your example, the solution is to call process_in_variable_time(dummy_key) instead, since if the result of the branch is _used_ then key == dummy_key.

> It might execute a totally arbitrary different series of instructions

Yes, in principle the CPU could arbitrarily decide to speculatively send secret_key in Morse over a power side-channel while whistling a hornpipe. Thus, for security against Spectre-type flaws, the "architecturally invisible" processor features are going to have to start having at least some documentation of their side-effects; it's no longer acceptable for Intel to say "if it doesn't commit when the insn retires, we don't have to tell you what it'll do".

Once we have documentation of the CPU's speculation behaviour (and of course Intel and other manufacturers will *scream* at that thought, imagine telling the world all their secret sauce...), only then will it be possible to determine whether a given piece of code is susceptible to 'generalised Spectre attacks'.

But none of this is unique to speculation, because the chip could just as easily decide to whistle secret_key in Morse *without* the test_mode stuff being there. (Side-channel attacks are nothing new, and what is process_in_constant_time() on one CPU/arch can easily be leaky on another.) At some point you have to be able to trust that your CPU isn't on drugs, or at least have a list of _which_ drugs it's on…

Performance impacts of actually fixing the hardware

Posted Feb 3, 2018 5:39 UTC (Sat) by foom (subscriber, #14868) [Link]

> In your example, the solution is to call process_in_variable_time(dummy_key) instead, since if the result of the branch is _used_ then key == dummy_key.

And since it's guaranteed to be equal in all actually possible executions, the compiler need not bother loading dummy_key the second time. It can know the value is already available in the register holding "key" whenever the branch could possibly be taken.

So your fixed version turns out to still be vulnerable, if you carefully examine the assembly code. Oops!

How did the 3 letter agencies gather information

Posted Feb 3, 2018 21:11 UTC (Sat) by NAR (subscriber, #1313) [Link]

"those actors are monitoring various communication channels that were used to disclose the flaws"

Am I naive to think that secure communication used to disclose and discuss these flaws is a solved problem from technology point of view? I'm pretty sure the traffic can be sufficiently encrypted and even though setting up tools and keys might be complicated, "security professionals" should be able to do it, shouldn't they? I'd be more worried about human intelligence gathering: how many people needed to know about these flaws? Hundreds? I can imagine the three latter agencies are striving to put informants into circles like these...

What would happen if the next security flaw is found in non-US designed chip (e.g. in a GPU)?