1 of 24

Rethinking the�Linux kernel

Thomas Graf�Cilium Project, Co-Founder & CTO, Isovalent

2 of 24

2

Cameron Askin: Cameron’s World

Remember GeoCities?

3 of 24

What enabled this evolution?

3

Markup Only (HTML)

Programmable Platform

4 of 24

Programmability Essentials

Untrusted code runs in the browser of the user.��→ Sandboxing

Allow evolution of logic without requiring to constantly ship new browser versions.��→ Deploy anytime with� seamless upgrades

Programmability must be provided with minimal overhead.��→ Native Execution� (JIT compiler)

Safety

ContinuousDelivery

Performance

4

5 of 24

Kernel Architecture

5

TCP/IP

VFS

Linux�Kernel

Network Device

Block Device

Admin

Process

Process

Network�Hardware

Storage�Hardware

Configuration�(sysfs,netlink,procfs,...)

Sockets

recvmsg()

sendmsg()

Syscall

read()

File Descriptor

write()

Syscall

User�Space

HW

6 of 24

Kernel Development 101

Cons:

  • You likely need to ship a different module for each kernel version
  • Might crash your kernel
  • Change kernel source code
  • Expose configuration API
  • Wait 5 years for your users to upgrade
  • Write kernel module
  • Every kernel release will break it

Cons:

Option 1�Native Support

Option 2�Kernel Module

6

7 of 24

7

How about we add JavaScript-like capabilities to the Linux Kernel?

8 of 24

8

9 of 24

9

Process

Scheduler

execve()

Linux�Kernel

Syscall

10 of 24

eBPF Runtime

10

Controller

Sockets

bpf()

Linux�Kernel

TCP/IP

Network Device

recvmsg()

sendmsg()

Process

Syscall

Verifier

JIT Compiler

BPF�Program

BPF�Program

BPF�Program

approved

x86_64

Syscall

Safety & Security�The verifier will reject any unsafe program and provides a sandbox.

Continuous Delivery�Programs can be exchanged without disrupting workloads.

Performance�The JIT compiler ensures native execution performance.

bytecode

11 of 24

eBPF Hooks

11

Process

Storage�Hardware

Sockets

TCP/IP

Network Device

read()

File Descriptor

VFS

Block Device

write()

Linux�Kernel

Network�Hardware

Process

Syscall

Syscall

Where can you hook? kernel functions (kprobes), userspace functions (uprobes), system calls, fentry/fexit, tracepoints, network devices (tc/xdp), network routes, TCP congestion algorithms, sockets (data level)

recvmsg()

sendmsg()

12 of 24

eBPF Maps

What are Maps used for?

  • Program state
  • Program configuration
  • Share data between programs
  • Share state, metrics, and statistics with user space

12

Controller

Sockets

Linux�Kernel

TCP/IP

Network Device

Process

Syscall

Syscall

Admin

BPF�Map

Syscall

Map Types:�- Hash tables, Arrays�- LRU (Least Recently Used)�- Ring Buffer�- Stack Trace�- LPM (Longest Prefix match)

recvmsg()

sendmsg()

13 of 24

eBPF Helpers

  • Access socket data
  • Perform tail call
  • Access process stack
  • Access syscall arguments
  • ...

13

Sockets

Linux�Kernel

TCP/IP

Network Device

Process

Syscall

What helpers exist?

  • Random numbers
  • Get current time
  • Map access
  • Get process/cgroup context
  • Manipulate network packets and forwarding

[...]�num = bpf_get_prandom_u32();

[...]

recvmsg()

sendmsg()

14 of 24

eBPF Tail and Function Calls

What are Tail Calls used for?

  • Chain programs together
  • Split programs into independent logical components
  • Make BPF programs composable

What are Functions Calls used for?

  • Reuse functionality inside of a program
  • Reduce program size (avoid inlining)

14

Linux�Kernel

15 of 24

15

Community

287 contributors:�(Jan 2016 to Jan 2020)

  • 466 Daniel Borkmann (Cilium; maintainer)
  • 290 Andrii Nakryiko (Facebook)
  • 279 Alexei Starovoitov (Facebook; maintainer)
  • 217 Jakub Kicinski (Facebook)
  • 173 Yonghong Song (Facebook)
  • 168 Martin KaFai Lau (Facebook)
  • 159 Stanislav Fomichev (Google)
  • 148 Quentin Monnet (Cilium)
  • 148 John Fastabend (Cilium)
  • 118 Jesper Dangaard Brouer (Red Hat)
  • [...]

16 of 24

16

eBPF Projects

�High-performance L4�Loadbalancer��facebookincubator/katran

Android & Securitykernel runtime security instrumentation (KRSI), �Android BPF loader, eBPF traffic monitor

bcc, bpftracePerformance troubleshooting & profiling��iovisor/bcc

Traffic OptimizationDDoS mitigation, QoS, traffic optimization, load balancer��cloudflare/bpftools

FalcoContainer runtime security, behavior analysis��falcosecurity/falco

CiliumNetworking, security and load-balancing for k8s��cilium/cilium

et al.

17 of 24

Tracing & Profiling with

17

Sockets

Linux�Kernel

TCP/IP

Process

Syscall

Verifier

JIT Compiler

Syscall

BPF�Program

Python

BCC

BPF�Maps

recvmsg()

sendmsg()

# tcptop

Tracing... Output every 1 secs. Hit Ctrl-C to end

<screen clears>

19:46:24 loadavg: 1.86 2.67 2.91 3/362 16681

PID COMM LADDR RADDR RX_KB TX_KB

16648 16648 100.66.3.172:22 100.127.69.165:6684 1 0

16647 sshd 100.66.3.172:22 100.127.69.165:6684 0 2149

14374 sshd 100.66.3.172:22 100.127.69.165:25219 0 0

14458 sshd 100.66.3.172:22 100.127.69.165:7165 0 0

18 of 24

bpftrace - DTrace for Linux

18

bpftrace

File Descriptors

Linux�Kernel

VFS

Process

Syscall

Verifier

JIT Compiler

Syscall

bpftrace�Program

BPF�Maps

# bpftrace -e 'kprobe:do_sys_open { printf("%s: %s\n", comm, str(arg1)) }'

Attaching 1 probe...

git: .git/objects/da

git: .git/objects/pack

git: /etc/localtime

systemd-journal: /var/log/journal/72d0774c88dc4943ae3d34ac356125dd

DNS Res~ver #15: /etc/hosts

^C

open()

19 of 24

19

Networking, load-balancing and security for Kubernetes

Sockets

Linux�Kernel

TCP/IP

Container

Syscall

Verifier

JIT Compiler

Syscall

Clium

BPF�Maps

Network Device

Sockets

Container

Syscall

Network Device

Network�Hardware

TCP/IP

Kubernetes

20 of 24

Container Networking

  • Highly efficient and flexible networking
  • Routing, Overlay, Cloud-provider native
  • IPv4, IPv6, NAT46
  • Multi cluster routing�

Service Load balancing:

  • Highly scalable L3-L4 load balancing
  • Kubernetes services (replaces kube-proxy)
  • Multi-cluster
  • Service affinity (prefer zones)

Container Security

  • Identity-based network security
  • API-aware security (HTTP, gRPC, Kafka, Cassandra, memcached, ..)
  • DNS-aware policies
  • Encryption
  • SSL data visibility via kTLS�

Visibility

  • Service topology map & live visualization
  • Advanced network metrics & alerting

�Servicemesh:

  • Minimize overhead when injecting servicemesh sidecar proxies
  • Istio integration

20

21 of 24

Hubble: eBPF Visibility for Kubernetes

21

# hubble observe --since=1m -t l7 -j \

| jq 'select(.l7.dns.rcode==3) | .destination.namespace + "/" + .destination.pod_name' \

| sort | uniq -c | sort -r

42 "starwars/jar-jar-binks-6f5847c97c-qmggv"

22 of 24

Go Development Toolchain

22

Development

Program

Maps

Runtime

clang -target bpf

Sockets

Linux�Kernel

TCP/IP

recvmsg()

sendmsg()

Process

Verifier

JIT Compiler

Syscall

BPF�Program

C source

BPF�Program

bytecode

BPF�Map

Syscall

Go Library

23 of 24

Outlook: Future of

23

is turning the Linux kernel into a microkernel.

  • An increasing amount of new kernel functionality is implemented with eBPF.
  • 100% modular and composable.
  • New additions can evolve at a rapid pace. Much quicker than normal kernel development.

Example: The linux kernel is not aware of containers and microservices (it only knows about namespaces). Cilium is making the Linux kernel container and Kubernetes aware.

could enable the Linux kernel hotpatching we always dreamed about.

Problem:

  • Linux kernel vulnerability requires to patch kernel.
  • Rebooting 20’000 servers takes a very long time without risking extensive downtime.

Function

Function

Function

Hotfix

Linux�Kernel

24 of 24

24

Thank You

eBPF Maintainers�Daniel Borkmann, Alexei Starovoitov�Cilium Team

André Martins, Jarno Rajahalme, Joe Stringer, John Fastabend, Maciej Kwiek, Martynas Pumputis, Paul Chaignon, Quentin Monnet, Ray Bejjani, Tobias Klauser

Facebook Team�Andrii Nakryiko, Andrey Ignatov, Jakub Kicinski, Martin KaFai Lau, Roman Gushchin, Song Liu, Yonghong Song

Google Team

Chenbo Feng, KP Singh, Lorenzo Colitti, Maciej Żenczykowski, Stanislav Fomichev,

BCC & bpftraceAlastair Robertson, Brendan Gregg, Brenden Blanco�Kernel Team

Björn Töpel, David S. Miller, Edward Cree, Jesper Brouer, Toke Høiland-Jørgensen

  • Twitter@ciliumproject
  • Contact the speaker@tgraf__

All images: Pixabay