�Rethinking the�Linux kernel
Thomas Graf�Cilium Project, Co-Founder & CTO, Isovalent
2
Cameron Askin: Cameron’s World
Remember GeoCities?
What enabled this evolution?
3
Markup Only (HTML)
Programmable Platform
Programmability Essentials
Untrusted code runs in the browser of the user.��→ Sandboxing
Allow evolution of logic without requiring to constantly ship new browser versions.��→ Deploy anytime with� seamless upgrades
Programmability must be provided with minimal overhead.��→ Native Execution� (JIT compiler)
Safety
Continuous�Delivery
Performance
4
Kernel Architecture
5
TCP/IP
VFS
Linux�Kernel
Network Device
Block Device
Admin
Process
Process
Network�Hardware
Storage�Hardware
Configuration�(sysfs,netlink,procfs,...)
Sockets
recvmsg()
sendmsg()
Syscall
read()
File Descriptor
write()
Syscall
User�Space
HW
Kernel Development 101
Cons:
Cons:
Option 1�Native Support
Option 2�Kernel Module
6
7
How about we add JavaScript-like capabilities to the Linux Kernel?
8
9
Process
Scheduler
execve()
Linux�Kernel
Syscall
eBPF Runtime
10
Controller
Sockets
bpf()
Linux�Kernel
TCP/IP
Network Device
recvmsg()
sendmsg()
Process
Syscall
Verifier
JIT Compiler
BPF�Program
BPF�Program
BPF�Program
approved
x86_64
Syscall
Safety & Security�The verifier will reject any unsafe program and provides a sandbox.
Continuous Delivery�Programs can be exchanged without disrupting workloads.
Performance�The JIT compiler ensures native execution performance.
bytecode
eBPF Hooks
11
Process
Storage�Hardware
Sockets
TCP/IP
Network Device
read()
File Descriptor
VFS
Block Device
write()
Linux�Kernel
Network�Hardware
Process
Syscall
Syscall
Where can you hook? kernel functions (kprobes), userspace functions (uprobes), system calls, fentry/fexit, tracepoints, network devices (tc/xdp), network routes, TCP congestion algorithms, sockets (data level)
recvmsg()
sendmsg()
eBPF Maps
What are Maps used for?
12
Controller
Sockets
Linux�Kernel
TCP/IP
Network Device
Process
Syscall
Syscall
Admin
BPF�Map
Syscall
Map Types:�- Hash tables, Arrays�- LRU (Least Recently Used)�- Ring Buffer�- Stack Trace�- LPM (Longest Prefix match)
recvmsg()
sendmsg()
eBPF Helpers
13
Sockets
Linux�Kernel
TCP/IP
Network Device
Process
Syscall
What helpers exist?
[...]�num = bpf_get_prandom_u32();
[...]
recvmsg()
sendmsg()
eBPF Tail and Function Calls
What are Tail Calls used for?
What are Functions Calls used for?
14
Linux�Kernel
15
Community
287 contributors:�(Jan 2016 to Jan 2020)
16
eBPF Projects
�High-performance L4�Loadbalancer��facebookincubator/katran
Android & Security�kernel runtime security instrumentation (KRSI), �Android BPF loader, eBPF traffic monitor
bcc, bpftrace�Performance troubleshooting & profiling��iovisor/bcc
Traffic Optimization�DDoS mitigation, QoS, traffic optimization, load balancer��cloudflare/bpftools
Falco�Container runtime security, behavior analysis��falcosecurity/falco
Cilium�Networking, security and load-balancing for k8s��cilium/cilium
et al.
Tracing & Profiling with
17
Sockets
Linux�Kernel
TCP/IP
Process
Syscall
Verifier
JIT Compiler
Syscall
BPF�Program
Python
BCC
BPF�Maps
recvmsg()
sendmsg()
# tcptop
Tracing... Output every 1 secs. Hit Ctrl-C to end
<screen clears>
19:46:24 loadavg: 1.86 2.67 2.91 3/362 16681
PID COMM LADDR RADDR RX_KB TX_KB
16648 16648 100.66.3.172:22 100.127.69.165:6684 1 0
16647 sshd 100.66.3.172:22 100.127.69.165:6684 0 2149
14374 sshd 100.66.3.172:22 100.127.69.165:25219 0 0
14458 sshd 100.66.3.172:22 100.127.69.165:7165 0 0
bpftrace - DTrace for Linux
18
bpftrace
File Descriptors
Linux�Kernel
VFS
Process
Syscall
Verifier
JIT Compiler
Syscall
bpftrace�Program
BPF�Maps
bpftrace:�github.com/iovisor/bpftrace
# bpftrace -e 'kprobe:do_sys_open { printf("%s: %s\n", comm, str(arg1)) }'
Attaching 1 probe...
git: .git/objects/da
git: .git/objects/pack
git: /etc/localtime
systemd-journal: /var/log/journal/72d0774c88dc4943ae3d34ac356125dd
DNS Res~ver #15: /etc/hosts
^C
open()
19
Networking, load-balancing and security for Kubernetes
Sockets
Linux�Kernel
TCP/IP
Container
Syscall
Verifier
JIT Compiler
Syscall
Clium
BPF�Maps
Network Device
Sockets
Container
Syscall
Network Device
Network�Hardware
TCP/IP
Kubernetes
Container Networking
Service Load balancing:
Container Security
Visibility
�Servicemesh:
20
Hubble: eBPF Visibility for Kubernetes
21
# hubble observe --since=1m -t l7 -j \
| jq 'select(.l7.dns.rcode==3) | .destination.namespace + "/" + .destination.pod_name' \
| sort | uniq -c | sort -r
42 "starwars/jar-jar-binks-6f5847c97c-qmggv"
Go Development Toolchain
22
Development
Program
Maps
Runtime
clang -target bpf
Sockets
Linux�Kernel
TCP/IP
recvmsg()
sendmsg()
Process
Verifier
JIT Compiler
Syscall
BPF�Program
C source
BPF�Program
bytecode
BPF�Map
Syscall
Go Library
Go Library: https://github.com/cilium/ebpf
Outlook: Future of
23
is turning the Linux kernel into a microkernel.
Example: The linux kernel is not aware of containers and microservices (it only knows about namespaces). Cilium is making the Linux kernel container and Kubernetes aware.
could enable the Linux kernel hotpatching we always dreamed about.
Problem:
Function
Function
Function
Hotfix
Linux�Kernel
24
Thank You
eBPF Maintainers�Daniel Borkmann, Alexei Starovoitov�Cilium Team
André Martins, Jarno Rajahalme, Joe Stringer, John Fastabend, Maciej Kwiek, Martynas Pumputis, Paul Chaignon, Quentin Monnet, Ray Bejjani, Tobias Klauser
Facebook Team�Andrii Nakryiko, Andrey Ignatov, Jakub Kicinski, Martin KaFai Lau, Roman Gushchin, Song Liu, Yonghong Song
Google Team
Chenbo Feng, KP Singh, Lorenzo Colitti, Maciej Żenczykowski, Stanislav Fomichev,
BCC & bpftrace�Alastair Robertson, Brendan Gregg, Brenden Blanco�Kernel Team
Björn Töpel, David S. Miller, Edward Cree, Jesper Brouer, Toke Høiland-Jørgensen
All images: Pixabay