JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 13

User namespaces in k8s

Rodrigo Campos, @rata

2 of 13

User namespaces KEP

We started a KEP long ago

https://github.com/kubernetes/enhancements/pull/2101
Very valuable discussions

Incorporated the feedback in this new proposal
Want to share this new proposal (next slides)
Coordinate what are the next steps to iterate or just agree

3 of 13

Proposal - pod.spec changes

Names are preliminary, I’m using field names to simplify explanations.
Pod spec changes:

pod.spec.useHostUsers: bool.

If true or not present, uses the host user namespace (as today)
If false, a new userns is created for the pod.
This field will be used for phase 1, 2 and 3.

pod.spec.securityContex.userns.pod2podIsolation: bool/Enum (TBD)

If enabled, we will make the userns mappings be non-overlapping as much as possible.
This field will be used in phase 3.

Work divided in phases

Each phase makes this work with either more isolation or more workloads

4 of 13

	Phase 1 - pods “without” volumes	Phase 2 - pods with volumes	Phase 3 - pods with volumes and inter-pod isolation
Opt-in with	pod.spec.useHostUsers	pod.spec.useHostUsers	pod.spec.useHostUsers pod.spec.securityContext.userns.pod2podIsolation
Volumes supported	emptyDir Configmap Secret DownwardsAPI projected	any	any
Mapping length	64k	64k	< 64k (~4k?)
Mapping overlap?	No - more isolation!	Yes, with any pods with volume	Yes, but only in the namespace or Service Account same namespaces/SA → same mapping
Works for most workloads?	No, only workloads with these volume types	Yes	No Only pods that need short number of UIDs Can’t share volumes between different k8s namespaces or Service Account (SA). Maybe not very used today?
Mitigates known vulns	Yes, all!	Yes, all!	Yes, all!
Notes		Similar to phase 1, but we just return a fixed mapping (the same) for pods with volumes	Uses heuristics to guess UIDs to map. TBD if we will use per-sa or per-ns mappings. Improvement over phase 2, but with restrictions on the workloads that can use it

5 of 13

Summary

This new proposal tries to incorporate the feedback that came in PR

https://github.com/kubernetes/enhancements/pull/2101

What can we do to agree or iterate the high level idea?

List of vulns not applicable or partially mitigated by userns referenced in previous slide:

Azurescape: this vuln would not possible as is (needs runc breakout). The first cross-account container takeover in the public cloud
CVE-2021-25741: very recent kube vuln. Mitigated as root in the container is not root in the host
CVE-2017-1002101: mitigated, idem
CVE-2021-30465: mitigated, idem
CVE-2016-8867: mitigated, idem
CVE-2018-15664: mitigated, idem
CVE-2019-5736: runc breakout. Not possible to exploit with userns in any phase proposed here

6 of 13

Thanks!

7 of 13

BACKLOG

8 of 13

Intro

What is user namespaces (userns)?

UID/GID in the container are mapped to different UID/GID in the host.
For simplicity, I will use UID. Something analogous is true for GID.
Create /proc/[pid]/uid_map with: host_id, container id, len

Ej. 1000, 0, 1

Inside the container we are root (0), but outside of the container we are UID 1000

Why?

Security is greatly improved

Outside the userns, the capabilities don’t apply (DAC_OVERRIDE, etc. don’t apply!)
Outside the userns, it can be an unprivileged user (container breakout)
Some resources are not namespace-scoped

Pods can’t access that, no matter what capabilities they have inside a userns

Several vulns would have been unexploitable or mitigated

https://unit42.paloaltonetworks.com/azure-container-instances/ - azure vuln is the first cross-account container takeover in the public cloud
CVE-2019-5736: runc breakout ^ depends on, can be completely mitigated with userns
https://github.com/kubernetes/kubernetes/issues/104980 - recent subpath vuln, somewhat mitigated if can’t be root in the host
https://nvd.nist.gov/vuln/detail/CVE-2016-8867
https://nvd.nist.gov/vuln/detail/CVE-2018-15664

9 of 13

Goals and challenges

Goals

We want pods to use non-overlapping mappings whenever possible (more isolation)

IOW, pod A uses hosts UIDs 1000-1999, pod B uses host UIDs 2000-2999

Pods should be able to share volumes
Impose low restrictions, so as many users as possible can use userns

Challenges

Volumes

File UID is not runAsUser, but effective UID (varies on the mapping)
⇒ Sharing volumes needs coordination between pods to use the same EUID

Non-overlapping mappings and sharing files contradict each other today

Mappings need to overlap to share volumes
Leaving id mapped mounts aside, as we can’t use that yet

Host has 2^32 UIDs possible

Today process can use the whole range (2^32), POSIX expects ~2^16 (64k)?
We need to give less than that to containers, otherwise they overlap
UIDs not mapped (/proc/[pid]/uid_map) are unusable!
Most tools today give 2^16 per userns
We can give less, trying to guess how many the container will use

Some others too, but let’s focus on these here

10 of 13

Challenges

Volumes

With userns, a process effective user ID (EUID) is not the one seen inside the container
To calculate the EUID we need the mapping
Files created by a container are owned by the EUID (seen by the host)
⇒ Sharing volumes needs coordination between pods to use the same EUID

We want pods to use non-overlapping mappings whenever possible (more isolation)

IOW, pod A uses hosts UIDs 1000-1999, pod B uses host UIDs 2000-2999

Non-overlapping mappings and sharing files contradict each other today

Leaving id mapped mounts aside, as we can’t use that yet

Host has 2^32 UIDs possible

Today process can use the whole range (2^32), POSIX expects 2^16 (64k)
We need to give less than that to containers, otherwise they overlap
UIDs not mapped (/proc/[pid]/uid_map) are unsuable!
Most tools today give 2^16
We can give less, trying to guess how many the container will use

Some others too, but let’s focus on these here

11 of 13

Proposal - phases

Phase 1 - pods “without” volumes

Used on pods with userns activated (pod.spec.UseHostUsers=false)
Pod is using any of these volume types:

configmaps, secrets, downward API, projected, emptyDir

If any other type is used, this mode won't be used
Pods use non overlapping mappings, chosen by the Kubelet
64K length, so no guessing needed.

Phase 2 - pods with volumes

Used on pods with userns activated (pod.spec.UseHostUsers=false)
If the pod has any volume that is not the ones in phase 1, then this mode is automatically used instead.

Using something other than: configmaps, secrets, downward API, projected, emptyDir

64K length mapping (no guessing needed), chosen by the kubelet
The mapping will be the same for all pods in this category.

Sharing files work

Phase 3 - pods with volume and inter-pod isolation

If Userns and pod2podIsolation fields are activated
Same mapping for pods in the same k8s ns or k8s service account. Heuristics TBD
Extra isolation for pods that can work with some restrictions

Need pods to use < 64k UIDs (might not work with all pods). We run out of UIDs cluster-wise otherwise
Need to limit sharing volumes per-sa or per-ns (NFS might be used across ns today)

12 of 13

Intro

Volumes

With userns, a process effective UID (EUID) is not the one seen inside the container
To calculate EUID we need the mapping in the userns
For simplicity, I will use UID. Something analogous is true for GID.
Create /proc/[pid]/uid_map with: host_id, container id, len

Ej. 1000, 0, 1

Inside the container we are root (0), but outside of the container we are UID 1000

Why?

Security is greatly improved

Outside the userns, the capabilities don’t apply (DAC_OVERRIDE, etc. don’t apply!)
Outside the userns, it can be an unprivileged user (container breakout)
Some resources are not namespace-scoped

Pods can’t access that, no matter what capabilities they have inside a userns

Several vulns would have been unexploitable or mitigated

https://unit42.paloaltonetworks.com/azure-container-instances/ - azure vuln is the first cross-account container takeover in the public cloud
CVE-2019-5736: runc breakout ^ depends on, can be completely mitigated with userns
https://github.com/kubernetes/kubernetes/issues/104980 - recent subpath vuln, somewhat mitigated if can’t be root in the host
https://nvd.nist.gov/vuln/detail/CVE-2016-8867
https://nvd.nist.gov/vuln/detail/CVE-2018-15664

13 of 13

Proposal

Pod.spec changes:

pod.spec.useHostUsers: bool.

If true or not present, uses the host user namespace (as today). If true, a new userns is created for the pod. This field will be used for phase 1 and 2 (more below)

pod.spec.securityContex.userns.pod2podIsolation: bool.

If enabled, we will make the userns mappings be non-overlapping. This field will be used in phase 3.

Work:

Phase 1 - pods “without” volumes

Used on pods with userns bool enabled in pod.spec and using any of these volume types: configmaps, secrets, downward API, projected, emptyDir. If any other type is used, this mode won't be used.
Pods use non overlapping mappings, chosen by the Kubelet
64K length, so no guessing needed.

Phase 2 - pods with volumes

Used for pods with userns bool enabled in pod spec and if the pod has any volume that is not the ones in phase 1 (configmaps, secrets, downward API, projected, emptyDir), then this mode is automatically used instead.
64K length mapping (no guessing needed), chosen by the kubelet
The mapping will be the same for all pods in this category.

Phase 3 - pods with volumes and inter-pod isolation

Activated when the “pod2podIsolation” field in the pod.spec is set to true and userns is activated too
Per ns or per sa mappings with way less than 64k uids exposed. Heuristics TBD