1 of 16

Replacing TPROXY with eBPF

Joe Stringer, Lorenz Bauer

2 of 16

TPROXY use case: Cilium

  • Send incoming/outgoing connections through a userspace proxy for L7
    • If I have a rule for this pod that only allows TCP port 80 `GET /public`, send all traffic to L7 proxy (which we configure with SO_REUSEADDR, IP_TRANSPARENT on that port)
    • From-pod policy, to-pod policy (to other pods or to world)
    • TC ingress (egress can probably be skipped)
    • Implemented in Cilium master via a bunch of ugliness

3 of 16

Current approach with iptables (Cilium)

  • From-pod:
    • Traffic hits BPF prog on TC ingress in host netns
    • BPF prog determines that L7 policy must apply, sets `skb->mark` (and packet_type)
    • Pass up stack
    • Pre-routing:
      • -A CILIUM_PRE_mangle -i lxc+ -p tcp -m mark --mark 0x93830200 -m comment --comment "cilium: TPROXY to host cilium-dns-egress proxy on lxc+"

-j TPROXY --on-port 33683 --on-ip 0.0.0.0 --tproxy-mark 0x200/0xffffffff

    • Rules:
      • 9: from all fwmark 0x200/0xf00 lookup 2004
    • Routing:
      • local default dev lo scope host
  • To-pod:
    • Very similar, but from TC ingress on other device (eg, vxlan encap dev or net access dev)

4 of 16

TPROXY use case: Cloudflare Spectrum

  • Proxy accepts TCP / UDP traffic, forwards it to origin server
  • Forward traffic to a specific subnet to a local socket
  • Roughly:�-A PREROUTING -p tcp -m set --match-set spectrum/v4/h:n dst -j TPROXY --on-port 2438 --on-ip 127.0.0.1

Single IP

Many subnets

Single Port

bind()

SO_BINDTOPREFIX*

Many Ports

TPROXY

TPROXY

5 of 16

Cloudflare Goals

  • Get rid of out-of-tree SO_BINDTOPREFIX
  • Make tproxy sockets visible to the load balancer (via inet_lookup)
  • Far future: very flexible socket dispatch

6 of 16

TPROXY magic sauce

ip_rcv

ip_rcv_core

skb_orphan: skb->sk = NULL

NF_HOOK(NF_INET_PRE_ROUTING) <- TPROXY added and removed here, skb->sk = sk

ip_rcv_finish

dst_input

...

tcp_v4_rcv

tcp_v4_rcv

__inet_lookup_skb

skb_steal_sock <- returns skb->sk from TPROXY

7 of 16

Idea #1: set_sk(skb) helper

  • Like TPROXY, assign to skb->sk

8 of 16

Idea #2: Hook socket dispatch

  • Add fallback to eBPF in inet_lookup / udp_lib_lookup
  • If lookup for ESTABLISHED / LISTENING fail
  • Run eBPF which selects socket according to business logic

9 of 16

Idea #3: Early demux from BPF

  • sk = bpf_sk_lookup_tcp(...)
  • return bpf_redirect_sk(skb, sk)
    • direct call to tcp_rcv / udp_rcv
    • “steal” skb from the rest of the stack

10 of 16

sk_lookup_* with SK_REUSEPORT

  • Unrelated, but needs to be fixed before set_sk can work
  • Helpers don’t pass skb to inet_lookup, REUSEPORT programs can’t run
    • XDP doesn’t have skb

11 of 16

set_sk question: skb_orphan() in ip_rcv_core

  • This kills our `bpf_set_sk()` from tc hooks
  • From: Florian Westphal <fw@strlen.de>��Without the skb_orphan udp/tcp might steal tunnel/ppp etc. socket instead of tproxy assigned tcp/udp socket.

12 of 16

set_sk: what kind of validation?

  • bpf_sk_lookup_tcp() for unrelated socket/unrelated protocol compared to packet
        • `bpf_set_sk()` checks l3/l4 protos match socket upon set? Then need to ensure they aren’t modified afterwards

13 of 16

socket dispatch: how to prevent infinite loop?

  • Could enforce LISTENER_ONLY flag for TCP, but UDP?
  • SK_REUSEPORT also at risk, no access to socket lookup though

14 of 16

socket dispatch: tied to network namespace?

  • No other similar hook exists
  • Only a single user per namespace?

15 of 16

redirect_sk: duplicate work

  • What checks need to be made for this to be safe?
  • How much of the usual receive path does this duplicate?

16 of 16

Question: what should ctx be?

  • skb is powerful, but wastes work
    • problematic from non-skb context
  • new type with 5-tuple
    • could add packet buffers a la XDP ctx later