1 of 20

Linux Containers

Basic Concepts

Lucian Carata

FRESCO Talklet, 3 Oct 2014

2 of 20

Underlying kernel mechanisms

cgroups

namespaces

seccomp

capabilities

CRIU

manage resources for groups of processes

per process resource isolation

limit available system calls

limit available privileges

checkpoint/restore (with kernel support)

3 of 20

cgroups - user space view

low-level filesystem interface similar to sysfs (/sys) and procfs (/proc)

new filesystem type “cgroup”, default location in /sys/fs/cgroup

cgroup hierarchies

subsystems (controllers)

cpuset

cpu

cpuacct

memory

devices

blkio

net_cls

freezer

hugetbl

perf

net_prio

built as kernel module

/cpu

/sys/fs/cgroup

/high-priority

/normal

/experiment_1

/mem

/opus

/experiment_1

/normal

cpu

cpuacct

memory

each subsystem can be

used at most once*

TL

TL

TL top level cgroup (mount)

4 of 20

cgroups - user space view

tasks

cgroup.procs

release_agent

notify_on_release

cgroup.clone_children

cgroup.sane_behavior

cpuacct.stat

cpuacct.usage

cpuacct.usage_percpu

cpu.stat

cpu.shares

cpu.cfs_period_us

cpu.cfs_quota_us

cpu.rt_period_us

cpu.rt_runtime_us

cgroup hierarchies

/cpu

/sys/fs/cgroup

/high-priority

/normal

/experiment_1

/mem

/opus

/experiment_1

/normal

cpu

cpuacct

memory

TL

TL

common

cpuacct

cpu

TL

cpuset

memory

devices

blkio

net_cls

freezer

hugetbl

perf

net_prio

5 of 20

cgroups - kernel space view

css_set *cgroups

list_head cg_list

task_struct

list_head tasks

cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]

css_set

css_set *cgroups

list_head cg_list

task_struct

list of all tasks using the same css_set

include / linux / cgroup.h

init/main.c

fork(), exit()

kernel code for attach/detaching task from css_set

6 of 20

cgroups - kernel space view

css_set *cgroups

list_head cg_list

task_struct

list_head tasks

cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]

...

css_set

css_set *cgroups

list_head cg_list

task_struct

list of all tasks using the same css_set

include / linux / cgroup.h

int (*attach)(...)

void (*fork)(...)

void (*exit)(...)

void (*bind)(...)

...

const char* name;

cgroupfs_root *root;

cftype *base_cftypes

cgroup_subsys

include / linux / cgroup_subsys.h

cgroup_subsys cpuset_subsys

cgroup_subsys freezer_subsys

cgroup_subsys mem_cgroup_subsys

7 of 20

cgroups - kernel space view

int (*attach)(...)

void (*fork)(...)

void (*exit)(...)

void (*bind)(...)

...

const char* name;

cgroupfs_root *root;

cftype *base_cftypes

cgroup_subsys

include / linux / cgroup_subsys.h

cgroup_subsys cpuset_subsys

.base_cftypes = files

8 of 20

cgroups - summary

cgroup hierarchies

subsystems (controllers)

cpuset

cpu

cpuacct

memory

devices

blkio

net_cls

freezer

hugetbl

perf

net_prio

built as kernel module

/cpu

/sys/fs/cgroup

/high-priority

/normal

/experiment_1

/mem

/opus

/experiment_1

/normal

cpu

cpuacct

memory

each subsystem can be

used at most once*

TL

TL

TL top level cgroup (mount)

9 of 20

namespaces - user space view

Namespaces limit the scope of kernel-side names and data structures

at process granularity

mnt (mount points, filesystems) CLONE_NEWNS

pid (processes) CLONE_NEWPID

net (network stack) CLONE_NEWNET

ipc (System V IPC) CLONE_NEWIPC

uts (unix timesharing - domain name, etc) CLONE_NEWUTS

user (UIDs) CLONE_NEWUSER

10 of 20

namespaces - user space view

Namespaces limit the scope of kernel-side names and data structures

at process granularity

Three system calls for management

clone() new process, new namespace, attach process to ns

unshare() new namespace, attach current process to it

setns(int fd, int nstype) join an existing namespace

11 of 20

namespaces - user space view

  • each namespace is identified by an inode (unique)
  • six(?) entries (inodes) added to /proc/<pid>/ns/

  • two processes are in the same namespace if they see the same inode for equivalent namespace types (mnt, net, user, ...)

User space utilities

* IPROUTE (ip netns add, etc)

* unshare, nsenter (part of util-linux)

* shadow, shadow-utils (for user ns)

12 of 20

namespaces - kernel space view

struct nsproxy *nsproxy

struct cred *cred

task_struct

nsproxy* task_nsproxy(struct task_struct *tsk)

include / linux / nsproxy.h

atomic_t countstruct uts_namespace *uts_ns�struct ipc_namespace *ipc_ns�struct mnt_namespace *mnt_ns�struct pid_namespace *pid_ns_for_children�struct net *net_ns

nsproxy

include / linux / nsproxy.h

...

struct user_namespace *user_ns

cred

include / linux / cred.h

  • For each namespace type, a default namespace exists (the global namespace)
  • struct nsproxy is shared by all tasks with the same set of namespaces

13 of 20

namespaces - kernel space view

struct nsproxy *nsproxy

...

task_struct

include / linux / nsproxy.h

struct uts_namespace *uts_ns�...

nsproxy

  • global access to hostname: system_utsname.nodename
  • namespace-aware access to hostname: &current->nsproxy->uts_ns->name->nodename

Example for uts namespace

char sysname []

char nodename []

char release []

char version []

char machine []

char domainname []��

new_utsname

include / uapi / linux / utsname.h

14 of 20

namespaces - kernel space view

struct nsproxy *nsproxy

...

task_struct

include / linux / nsproxy.h

struct net *net_ns�...

nsproxy

  • a network device belongs to exactly one network namespace
  • a socket belongs to exactly one network namespace
  • a new network namespace only includes the loopback device
  • communication between namespaces using veth or unix sockets

Example for net namespace

Logical copy of the network stack:

  • loopback device
  • all network tables (routing, etc)
  • all sockets
  • /procfs and /sysfs entries��

net

include / net / net_namespace.h

15 of 20

namespaces - summary

Namespaces limit the scope of kernel-side names and data structures

at process granularity

mnt (mount points, filesystems)

pid (processes)

net (network stack)

ipc (System V IPC)

uts (unix timesharing - domain name, etc)

user (UIDs)

16 of 20

Containers

  • A light form of resource virtualization based on kernel mechanisms
  • A container is a user-space construct

  • Multiple containers run on top of the same kernel
    • illusion that they are the only one using resources �(cpu, memory, disk, network)

  • some implementations offer support for
    • container templates
    • deployment / migration
    • union filesystems

taken from the Docker documentation

17 of 20

Container solutions

Mainline

Google containers (lmctfy)

  • uses cgroups only, offers CPU & memory isolation
  • no isolation for: disk I/O, network, filesystem, checkpoint/restore
  • adds some cgroup files: cpu.lat, cpuacct.histogram

LXC: user-space containerisation tools

Docker

systemd-nspawn

Forks

Vserver, OpenVZ

18 of 20

Container solutions - LXC

An LXC container is a userspace process created with the clone() system call

  • with its own pid namespace
  • with its own mnt namespace
  • net namespace (configurable) - lxc.network.type

Offers container templates /usr/share/lxc/templates

  • shell scripts
  • lxc-create -t ubuntu -n containerName
    • also creates cgroup /sys/fs/cgroup/<controller>/lxc/containerName

19 of 20

Container solutions - Docker

A Linux container engine

  • multiple backend drivers
  • application rather than machine-centric
  • app build tools
  • diff-based deployment of updates (AUFS)
  • versioning (git-like) and reuse

  • links (tunnels) between containers

taken from the Docker documentation

20 of 20

Questions?

Thank you!

Lucian Carata

lc525@cam.ac.uk