Black lives matter.
We stand in solidarity with the Black community.
Racism is unacceptable.
It conflicts with the core values of the Kubernetes project and our community does not tolerate it.
We stand in solidarity with the Black community.
Racism is unacceptable.
It conflicts with the core values of the Kubernetes project and our community does not tolerate it.
On March 12, 2018, the Kubernetes Product Security team disclosed CVE-2017-1002101, which allowed containers using subpath volume mounts to access files outside of the volume. This means that a container could access any file available on the host, including volumes for other containers that it should not have access to.
The vulnerability has been fixed and released in the latest Kubernetes patch releases. We recommend that all users upgrade to get the fix. For more details on the impact and how to get the fix, please see the announcement. (Note, some functional regressions were found after the initial fix and are being tracked in issue #61563).
This post presents a technical deep dive on the vulnerability and the solution.
To understand the vulnerability, one must first understand how volume and subpath mounting works in Kubernetes.
Before a container is started on a node, the kubelet volume manager locally mounts all the volumes specified in the PodSpec under a directory for that Pod on the host system. Once all the volumes are successfully mounted, it constructs the list of volume mounts to pass to the container runtime. Each volume mount contains information that the container runtime needs, the most relevant being:
/var/lib/kubelet/pods/<pod uid>/volumes/<volume type>/<volume name>
)When starting the container, the container runtime creates the path in the container root filesystem, if necessary, and then bind mounts it to the provided host path.
Subpath mounts are passed to the container runtime just like any other volume. The container runtime does not distinguish between a base volume and a subpath volume, and handles them the same way. Instead of passing the host path to the root of the volume, Kubernetes constructs the host path by appending the Pod-specified subpath (a relative path) to the base volume’s host path.
For example, here is a spec for a subpath volume mount:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
<snip>
volumeMounts:
- mountPath: /mnt/data
name: my-volume
subPath: dataset1
volumes:
- name: my-volume
emptyDir: {}
In this example, when the Pod gets scheduled to a node, the system will:
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/ + dataset1
/mnt/data
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/dataset1
/mnt/data
in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/dataset1
on the host.The vulnerability with subpath volumes was discovered by Maxim Ivanov, by making a few observations:
The basic example below demonstrates the vulnerability. It takes advantage of the observations outlined above by:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
initContainers:
- name: prep-symlink
image: "busybox"
command: ["bin/sh", "-ec", "ln -s / /mnt/data/symlink-door"]
volumeMounts:
- name: my-volume
mountPath: /mnt/data
containers:
- name: my-container
image: "busybox"
command: ["/bin/sh", "-ec", "ls /mnt/data; sleep 999999"]
volumeMounts:
- mountPath: /mnt/data
name: my-volume
subPath: symlink-door
volumes:
- name: my-volume
emptyDir: {}
For this example, the system will:
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
/mnt/data
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
/mnt/data
in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
on the host./mnt/data/symlink-door
-> /
, and then exits./var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/ + symlink-door
./mnt/data
/var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/symlink-door
/mnt/data
in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty~dir/my-volume/symlink-door
/
on the host! Now the container can see all of the host’s filesystem through its mount point /mnt/data
.This is a manifestation of a symlink race, where a malicious user program can gain access to sensitive data by causing a privileged program (in this case, kubelet) to follow a user-created symlink.
It should be noted that init containers are not always required for this exploit, depending on the volume type. It is used in the EmptyDir example because EmptyDir volumes cannot be shared with other Pods, and only created when a Pod is created, and destroyed when the Pod is destroyed. For persistent volume types, this exploit can also be done across two different Pods sharing the same volume.
The underlying issue is that the host path for subpaths are untrusted and can point anywhere in the system. The fix needs to ensure that this host path is both:
The Kubernetes product security team went through many iterations of possible solutions before finally agreeing on a design.
Our first design was relatively simple. For each subpath mount in each container:
However, this design is prone to the classic time-of-check-to-time-of-use (TOCTTOU) problem. In between steps 2) and 3), the user could change the path back to a symlink. The proper solution needs some way to “lock” the path so that it cannot be changed in between validation and bind mounting by the container runtime. All the subsequent ideas use an intermediate bind mount by kubelet to achieve this “lock” step before handing it off to the container runtime. Once a bind mount is performed, the mount source is fixed and cannot be changed.
We went a bit wild with this idea:
dir1
.dir1/volume
.dir1
.volume/subpath
to subpath
. This ensures that any symlinks get resolved to inside the chroot environment.dir1/subpath
to the container runtime.While this design does ensure that the symlinks cannot point outside of the volume, it was ultimately rejected due to difficulties of implementing the chroot mechanism in 4) across all the various distros and environments that Kubernetes has to support, including containerized kubelets.
Coming back to earth a little bit, our next idea was to:
In theory, this sounded pretty simple, but in reality, 2) was quite difficult to implement correctly. Many scenarios had to be handled where volumes (like EmptyDir) could be on a shared filesystem, on a separate filesystem, on the root filesystem, or not on the root filesystem. NFS volumes ended up handling all bind mounts as a separate mount, instead of as a child to the base volume. There was additional uncertainty about how out-of-tree volume types (that we couldn’t test) would behave.
Given the amount of scenarios and corner cases that had to be handled with the previous design, we really wanted to find a solution that was more generic across all volume types. The final design that we ultimately went with was to:
openat()
syscall, and disallow symlinks. With each path segment, validate that the current path is within the base volume./proc/<kubelet pid>/fd/<final fd>
to a working directory under the kubelet’s pod directory. The proc file is a link to the opened file. If that file gets replaced while kubelet still has it open, then the link will still point to the original file.Note that this solution is different for Windows hosts, where the mounting semantics are different than Linux. In Windows, the design is to:
Both solutions are able to address all the requirements of:
Special thanks to many folks involved with handling this vulnerability:
If you find a vulnerability in Kubernetes, please follow our responsible disclosure process and let us know; we want to do our best to make Kubernetes secure for all users.
-- Michelle Au, Software Engineer, Google; and Jan Šafránek, Software Engineer, Red Hat