Handling the Kubernetes symbolic link vulnerability
A year-old bug in Kubernetes was the topic of a talk given by Michelle Au and Jan Šafránek at KubeCon + CloudNativeCon North America, which was held mid-December in Seattle. In the talk, they looked at the details of the bug and the response from the Kubernetes product security team (PST). While the bug was fairly straightforward, it was surprisingly hard to fix. The whole process also provided experience that will help improve vulnerability handling in the future.
The Kubernetes project first became aware of the problem from a GitHub issue that was created on November 30, 2017. It gave full detail of the bug and was posted publicly. That is not the proper channel for reporting Kubernetes security bugs, Au stressed. Luckily, a security team member saw the bug report and cleared out all of the details, moving it to a private issue tracker. There is a documented disclosure process for the project that anyone finding a security problem should follow, she said.
Background
In order to understand the bug, some background on volumes in Kubernetes is needed, much of which can also be found in a blog post by Au and Šafránek. They put up an example pod spec (which can be seen in the slides [PDF]) that was using a volume. When that pod gets scheduled to a node, the volume will be mounted so that containers in the pod can access it. Each container can specify where the volume will be mounted inside the container. A directory is created on the host filesystem where the volume is mounted.
When the container starts, the kubelet node manager needs to tell the container runtime about the volume; both the host filesystem path and the location in the container where it should be bind mounted are needed. In addition, a container can specify a subdirectory of the location where the volume is mounted in the container using the subPath directive. But subPath was subject to a classic symbolic-link race condition vulnerability.
A container could symbolically link anything to a subPath name in its view of the volume. A subsequent container that used the subPath name would be using a link controlled by the owner of the pod. Those links are resolved on the host, so linking the subPath name to / would provide access to the root directory on the host; game over.
Šafránek demonstrated the flaw. He showed that it allows access to the root directory of the host, which means that the whole node is compromised. For example, it gives access to the Docker socket so an attacker can run anything in the containers, access any secrets used by the containers, and more. All of that comes about because Kubernetes does not check for symbolic links.
Working out a solution
The problem was reported just before KubeCon Austin, so the PST brainstormed on solutions at that gathering. The first, "naive", idea was simply to resolve the full path, then validate that it still points inside the volume. But there is a time of check to time of use (TOCTTOU) race condition in that scheme. The user can modify the directories after the check but before they are handed to the container runtime.
The next idea was to freeze the filesystem while validating the path and handing it to the container runtime. For Windows, CreateFile() can be used to lock each component of the path until after the runtime mounts it, but something different was needed for Linux. Bind mounting the volume to some Kubernetes directory, outside of user control, and then handing that off is a safe way to get it to the runtime, but it still leaves race conditions. After any symbolic links are resolved and after the path has been validated to remain inside the volume are two points where a user could switch the path to contain a symbolic link.
The /proc filesystem contained a clue that was used for the actual solution that was implemented. The links under /proc/PID/fd can reliably be used to bind mount a file descriptor corresponding to the final component of the subPath. The volume directory is opened, then each component of the subPath is opened using openat() while disallowing following symbolic links and validating that the path is still within the volume. The file descriptor file in /proc of the final component is then bind mounted to a safe location and handed off to the container runtime. That eliminates the races and implements a scheme that is not dependent on the underlying filesystem type.
Making the fix
It took a fair amount of time to get the fix out there; there were lots of end-to-end tests that needed to be developed and run on both Linux and Windows. But, since Kubernetes is developed in the open, how could this fix be developed in secret? The answer is that a separate repository, kubernetes-security, was used. Only the PST can normally access it, but the PST can give temporary access to those working on the fix. Au and Šafránek lost their access after the fix was released; "we have no idea what's going on there now", Šafránek said.
The development and testing process is similar to that of the open kubernetes repository, but the logs of tests and such for kubernetes-security go to a private bucket that only Google employees can access. Šafránek works for Red Hat in Europe, so sometimes he had to wait for Au, who works for Google in the US, to wake up so that he could find out what went wrong for a test run.
The flaw was disclosed to third-party Kubernetes vendors on the closed kubernetes-distributors-announce mailing list one week before it was publicly disclosed. On March 12, CVE-2017-1002101 was announced, which was roughly four months after it was reported. Kubernetes 1.7, 1.8, and 1.9 were updated and released on that day. The timeline and more can be found in the public post-mortem document.
Au went over some "best practices" for avoiding these kinds of problems. To start with, do not run containers as the root user; containers running as another user only have the same access as that user. That can be enforced by using PodSecurityPolicy, though containers will still run in the root group; the upcoming RunAsGroup feature will address that shortcoming. The security policy can also be used to restrict volume access, though that would not have helped for this particular vulnerability.
Using sandboxed containers is something that is being investigated for future Kubernetes releases. Using gVisor or Kata Containers will provide another security boundary. That is in keeping with a core principle that there should be at least two security barriers around untrusted code. For this vulnerability, a sandbox could have prevented access to the host filesystem. Au said she expects to see some movement on container sandboxes over the next year or so.
She started her talk summary with a reminder to follow the project's security disclosure process. She also suggested that Kubernetes and other projects be "extra cautious" when handling untrusted paths. Symbolic-link races and TOCTTOU are well-known dangers in path handling. In addition, she recommended setting restrictive security policies and using multiple security boundaries.
In answer to a question, Au said that most of the four months was taken up by development and testing, some of which was slowed down by the end-of-year holiday break. About two weeks were taken up with the actual release process. When asked about what could improve for the next CVE, Šafránek said that getting access to the private logs is important; Au said that it is being worked on. She also pointed to the post-mortem document as a good source for improvement ideas.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for
assistance in traveling to Seattle for KubeCon NA.]
| Index entries for this article | |
|---|---|
| Security | Containers |
| Security | Vulnerability response |
| Conference | KubeCon NA/2018 |