Linux Containers
Basic Concepts
Lucian Carata
FRESCO Talklet, 3 Oct 2014
Underlying kernel mechanisms
cgroups
namespaces
seccomp
capabilities
CRIU
manage resources for groups of processes
per process resource isolation
limit available system calls
limit available privileges
checkpoint/restore (with kernel support)
cgroups - user space view
low-level filesystem interface similar to sysfs (/sys) and procfs (/proc)
new filesystem type “cgroup”, default location in /sys/fs/cgroup
cgroup hierarchies
subsystems (controllers)
cpuset
cpu
cpuacct
memory
devices
blkio
net_cls
freezer
hugetbl
perf
net_prio
built as kernel module
/cpu
/sys/fs/cgroup
/high-priority
/normal
/experiment_1
/mem
/opus
/experiment_1
/normal
cpu
cpuacct
memory
each subsystem can be
used at most once*
TL
TL
TL top level cgroup (mount)
cgroups - user space view
tasks
cgroup.procs
release_agent
notify_on_release
cgroup.clone_children
cgroup.sane_behavior
cpuacct.stat
cpuacct.usage
cpuacct.usage_percpu
cpu.stat
cpu.shares
cpu.cfs_period_us
cpu.cfs_quota_us
cpu.rt_period_us
cpu.rt_runtime_us
cgroup hierarchies
/cpu
/sys/fs/cgroup
/high-priority
/normal
/experiment_1
/mem
/opus
/experiment_1
/normal
cpu
cpuacct
memory
TL
TL
common
cpuacct
cpu
TL
cpuset
memory
devices
blkio
net_cls
freezer
hugetbl
perf
net_prio
cgroups - kernel space view
css_set *cgroups
list_head cg_list
task_struct
list_head tasks
cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]
css_set
css_set *cgroups
list_head cg_list
task_struct
list of all tasks using the same css_set
include / linux / cgroup.h
init/main.c
fork(), exit()
kernel code for attach/detaching task from css_set
cgroups - kernel space view
css_set *cgroups
list_head cg_list
task_struct
list_head tasks
cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]
...
css_set
css_set *cgroups
list_head cg_list
task_struct
list of all tasks using the same css_set
include / linux / cgroup.h
int (*attach)(...)
void (*fork)(...)
void (*exit)(...)
void (*bind)(...)
...
const char* name;
cgroupfs_root *root;
cftype *base_cftypes
cgroup_subsys
include / linux / cgroup_subsys.h
cgroup_subsys cpuset_subsys
cgroup_subsys freezer_subsys
cgroup_subsys mem_cgroup_subsys
cgroups - kernel space view
int (*attach)(...)
void (*fork)(...)
void (*exit)(...)
void (*bind)(...)
...
const char* name;
cgroupfs_root *root;
cftype *base_cftypes
cgroup_subsys
include / linux / cgroup_subsys.h
cgroup_subsys cpuset_subsys
.base_cftypes = files
cgroups - summary
cgroup hierarchies
subsystems (controllers)
cpuset
cpu
cpuacct
memory
devices
blkio
net_cls
freezer
hugetbl
perf
net_prio
built as kernel module
/cpu
/sys/fs/cgroup
/high-priority
/normal
/experiment_1
/mem
/opus
/experiment_1
/normal
cpu
cpuacct
memory
each subsystem can be
used at most once*
TL
TL
TL top level cgroup (mount)
namespaces - user space view
Namespaces limit the scope of kernel-side names and data structures
at process granularity
mnt (mount points, filesystems) CLONE_NEWNS
pid (processes) CLONE_NEWPID
net (network stack) CLONE_NEWNET
ipc (System V IPC) CLONE_NEWIPC
uts (unix timesharing - domain name, etc) CLONE_NEWUTS
user (UIDs) CLONE_NEWUSER
namespaces - user space view
Namespaces limit the scope of kernel-side names and data structures
at process granularity
Three system calls for management
clone() new process, new namespace, attach process to ns
unshare() new namespace, attach current process to it
setns(int fd, int nstype) join an existing namespace
namespaces - user space view
User space utilities
* IPROUTE (ip netns add, etc)
* unshare, nsenter (part of util-linux)
* shadow, shadow-utils (for user ns)
namespaces - kernel space view
struct nsproxy *nsproxy
struct cred *cred
task_struct
nsproxy* task_nsproxy(struct task_struct *tsk)
include / linux / nsproxy.h
�atomic_t count�struct uts_namespace *uts_ns�struct ipc_namespace *ipc_ns�struct mnt_namespace *mnt_ns�struct pid_namespace *pid_ns_for_children�struct net *net_ns�
nsproxy
include / linux / nsproxy.h
...
struct user_namespace *user_ns
cred
include / linux / cred.h
namespaces - kernel space view
struct nsproxy *nsproxy
...
task_struct
include / linux / nsproxy.h
�struct uts_namespace *uts_ns�...�
nsproxy
Example for uts namespace
char sysname []
char nodename []
char release []
char version []
char machine []
char domainname []��
new_utsname
include / uapi / linux / utsname.h
namespaces - kernel space view
struct nsproxy *nsproxy
...
task_struct
include / linux / nsproxy.h
nsproxy
Example for net namespace
Logical copy of the network stack:
net
include / net / net_namespace.h
namespaces - summary
Namespaces limit the scope of kernel-side names and data structures
at process granularity
mnt (mount points, filesystems)
pid (processes)
net (network stack)
ipc (System V IPC)
uts (unix timesharing - domain name, etc)
user (UIDs)
Containers
taken from the Docker documentation
Container solutions
Mainline
Google containers (lmctfy)
LXC: user-space containerisation tools
Docker
systemd-nspawn
Forks
Vserver, OpenVZ
Container solutions - LXC
An LXC container is a userspace process created with the clone() system call
Offers container templates /usr/share/lxc/templates
Container solutions - Docker
A Linux container engine
taken from the Docker documentation
Questions?
Thank you!
Lucian Carata
lc525@cam.ac.uk
More details
cgroups: http://media.wix.com/ugd/295986_d73d8d6087ed430c34c21f90b0b607fd.pdf
namespaces: http://lwn.net/Articles/531114/ (and series)