1 of 13

User namespaces in k8s

Rodrigo Campos, @rata

2 of 13

User namespaces KEP

  • We started a KEP long ago
  • Incorporated the feedback in this new proposal
  • Want to share this new proposal (next slides)
  • Coordinate what are the next steps to iterate or just agree

3 of 13

Proposal - pod.spec changes

  • Names are preliminary, I’m using field names to simplify explanations.
  • Pod spec changes:
    • pod.spec.useHostUsers: bool.
      • If true or not present, uses the host user namespace (as today)
      • If false, a new userns is created for the pod.
      • This field will be used for phase 1, 2 and 3.
    • pod.spec.securityContex.userns.pod2podIsolation: bool/Enum (TBD)
      • If enabled, we will make the userns mappings be non-overlapping as much as possible.
      • This field will be used in phase 3.
  • Work divided in phases
    • Each phase makes this work with either more isolation or more workloads

4 of 13

Phase 1 - pods “without” volumes

Phase 2 - pods with volumes

Phase 3 - pods with volumes and inter-pod isolation

Opt-in with

pod.spec.useHostUsers

pod.spec.useHostUsers

pod.spec.useHostUsers

pod.spec.securityContext.userns.pod2podIsolation

Volumes supported

emptyDir

Configmap

Secret

DownwardsAPI

projected

any

any

Mapping length

64k

64k

< 64k (~4k?)

Mapping overlap?

No - more isolation!

Yes, with any pods with volume

Yes, but only in the namespace or Service Account

same namespaces/SA → same mapping

Works for most workloads?

No, only workloads with these volume types

Yes

No

Only pods that need short number of UIDs

Can’t share volumes between different k8s namespaces or Service Account (SA). Maybe not very used today?

Mitigates known vulns

Yes, all!

Yes, all!

Yes, all!

Notes

Similar to phase 1, but we just return a fixed mapping (the same) for pods with volumes

Uses heuristics to guess UIDs to map.

TBD if we will use per-sa or per-ns mappings.

Improvement over phase 2, but with restrictions on the workloads that can use it

5 of 13

Summary

  • This new proposal tries to incorporate the feedback that came in PR
  • What can we do to agree or iterate the high level idea?

List of vulns not applicable or partially mitigated by userns referenced in previous slide:

  • Azurescape: this vuln would not possible as is (needs runc breakout). The first cross-account container takeover in the public cloud
  • CVE-2021-25741: very recent kube vuln. Mitigated as root in the container is not root in the host
  • CVE-2017-1002101: mitigated, idem
  • CVE-2021-30465: mitigated, idem
  • CVE-2016-8867: mitigated, idem
  • CVE-2018-15664: mitigated, idem
  • CVE-2019-5736: runc breakout. Not possible to exploit with userns in any phase proposed here

6 of 13

Thanks!

7 of 13

BACKLOG

8 of 13

Intro

  • What is user namespaces (userns)?
    • UID/GID in the container are mapped to different UID/GID in the host.
    • For simplicity, I will use UID. Something analogous is true for GID.
    • Create /proc/[pid]/uid_map with: host_id, container id, len
      • Ej. 1000, 0, 1
    • Inside the container we are root (0), but outside of the container we are UID 1000
  • Why?
    • Security is greatly improved
      • Outside the userns, the capabilities don’t apply (DAC_OVERRIDE, etc. don’t apply!)
      • Outside the userns, it can be an unprivileged user (container breakout)
      • Some resources are not namespace-scoped
        • Pods can’t access that, no matter what capabilities they have inside a userns
    • Several vulns would have been unexploitable or mitigated

9 of 13

Goals and challenges

  • Goals
    • We want pods to use non-overlapping mappings whenever possible (more isolation)
      • IOW, pod A uses hosts UIDs 1000-1999, pod B uses host UIDs 2000-2999
    • Pods should be able to share volumes
    • Impose low restrictions, so as many users as possible can use userns
  • Challenges
    • Volumes
      • File UID is not runAsUser, but effective UID (varies on the mapping)
      • ⇒ Sharing volumes needs coordination between pods to use the same EUID
    • Non-overlapping mappings and sharing files contradict each other today
      • Mappings need to overlap to share volumes
      • Leaving id mapped mounts aside, as we can’t use that yet
    • Host has 2^32 UIDs possible
      • Today process can use the whole range (2^32), POSIX expects ~2^16 (64k)?
      • We need to give less than that to containers, otherwise they overlap
      • UIDs not mapped (/proc/[pid]/uid_map) are unusable!
      • Most tools today give 2^16 per userns
      • We can give less, trying to guess how many the container will use
    • Some others too, but let’s focus on these here

10 of 13

Challenges

  • Volumes
    • With userns, a process effective user ID (EUID) is not the one seen inside the container
    • To calculate the EUID we need the mapping
    • Files created by a container are owned by the EUID (seen by the host)
    • ⇒ Sharing volumes needs coordination between pods to use the same EUID
  • We want pods to use non-overlapping mappings whenever possible (more isolation)
    • IOW, pod A uses hosts UIDs 1000-1999, pod B uses host UIDs 2000-2999
  • Non-overlapping mappings and sharing files contradict each other today
    • Leaving id mapped mounts aside, as we can’t use that yet
  • Host has 2^32 UIDs possible
    • Today process can use the whole range (2^32), POSIX expects 2^16 (64k)
    • We need to give less than that to containers, otherwise they overlap
    • UIDs not mapped (/proc/[pid]/uid_map) are unsuable!
    • Most tools today give 2^16
    • We can give less, trying to guess how many the container will use
  • Some others too, but let’s focus on these here

11 of 13

Proposal - phases

  • Phase 1 - pods “without” volumes
    • Used on pods with userns activated (pod.spec.UseHostUsers=false)
    • Pod is using any of these volume types:
      • configmaps, secrets, downward API, projected, emptyDir
    • If any other type is used, this mode won't be used
    • Pods use non overlapping mappings, chosen by the Kubelet
    • 64K length, so no guessing needed.
  • Phase 2 - pods with volumes
    • Used on pods with userns activated (pod.spec.UseHostUsers=false)
    • If the pod has any volume that is not the ones in phase 1, then this mode is automatically used instead.
      • Using something other than: configmaps, secrets, downward API, projected, emptyDir
    • 64K length mapping (no guessing needed), chosen by the kubelet
    • The mapping will be the same for all pods in this category.
      • Sharing files work
  • Phase 3 - pods with volume and inter-pod isolation
    • If Userns and pod2podIsolation fields are activated
    • Same mapping for pods in the same k8s ns or k8s service account. Heuristics TBD
    • Extra isolation for pods that can work with some restrictions
      • Need pods to use < 64k UIDs (might not work with all pods). We run out of UIDs cluster-wise otherwise
      • Need to limit sharing volumes per-sa or per-ns (NFS might be used across ns today)

12 of 13

Intro

  • Volumes
    • With userns, a process effective UID (EUID) is not the one seen inside the container
    • To calculate EUID we need the mapping in the userns
    • For simplicity, I will use UID. Something analogous is true for GID.
    • Create /proc/[pid]/uid_map with: host_id, container id, len
      • Ej. 1000, 0, 1
    • Inside the container we are root (0), but outside of the container we are UID 1000
  • Why?
    • Security is greatly improved
      • Outside the userns, the capabilities don’t apply (DAC_OVERRIDE, etc. don’t apply!)
      • Outside the userns, it can be an unprivileged user (container breakout)
      • Some resources are not namespace-scoped
        • Pods can’t access that, no matter what capabilities they have inside a userns
    • Several vulns would have been unexploitable or mitigated

13 of 13

Proposal

Pod.spec changes:

  • pod.spec.useHostUsers: bool.
    • If true or not present, uses the host user namespace (as today). If true, a new userns is created for the pod. This field will be used for phase 1 and 2 (more below)
  • pod.spec.securityContex.userns.pod2podIsolation: bool.
    • If enabled, we will make the userns mappings be non-overlapping. This field will be used in phase 3.

Work:

  • Phase 1 - pods “without” volumes
    • Used on pods with userns bool enabled in pod.spec and using any of these volume types: configmaps, secrets, downward API, projected, emptyDir. If any other type is used, this mode won't be used.
    • Pods use non overlapping mappings, chosen by the Kubelet
    • 64K length, so no guessing needed.
  • Phase 2 - pods with volumes
    • Used for pods with userns bool enabled in pod spec and if the pod has any volume that is not the ones in phase 1 (configmaps, secrets, downward API, projected, emptyDir), then this mode is automatically used instead.
    • 64K length mapping (no guessing needed), chosen by the kubelet
    • The mapping will be the same for all pods in this category.
  • Phase 3 - pods with volumes and inter-pod isolation
    • Activated when the “pod2podIsolation” field in the pod.spec is set to true and userns is activated too
    • Per ns or per sa mappings with way less than 64k uids exposed. Heuristics TBD