1 of 31

xTCP

By Dave Seddon

2024 September 11th

2 of 31

Disclaimer

  • This is not investment advice ;)
  • Thoughts are Dave Sedon’s only

3 of 31

Agenda

  • ss –tcp –info
  • The dodgy cluster
  • xTCP Overview Diagram
  • xTCP Overview
  • Netlink TCP_Diag
  • Marshalling
  • Exporting ( pub/sub )
  • Data Store ( Clickhouse )

4 of 31

[das@t:~]$ ss --tcp --info -n | more

State Recv-Q Send-Q Local Address:Port Peer Address:Port

ESTAB 0 0 172.16.50.141:53731 3.167.192.41:443

cubic wscale:9,9 rto:223 rtt:22.878/17.427 ato:40 mss:1428 pmtu:1500 rcvmss:1428 advmss:1448 cwnd:10 bytes_sent:931 bytes_acked:932 bytes_received:6219 segs_out:10 segs_in:12 data_segs_out:4 data_segs_in:

8 send 4993443bps lastsnd:11540 lastrcv:11539 lastack:11469 pacing_rate 9986720bps delivery_rate 1641608bps delivered:5 app_limited busy:101ms rcv_space:14480 rcv_ssthresh:498552 minrtt:12.874 snd_wnd:68608 rcv_wn

d:494080

ESTAB 0 0 172.16.50.141:29649 69.173.154.8:443

cubic wscale:2,9 rto:226 rtt:25.949/5.148 ato:40 mss:1448 pmtu:1500 rcvmss:1400 advmss:1448 cwnd:10 bytes_sent:1904 bytes_acked:1905 bytes_received:5938 segs_out:19 segs_in:14 data_segs_out:4 data_segs_in

:6 send 4464141bps lastsnd:64684 lastrcv:64657 lastack:3239 pacing_rate 8928104bps delivery_rate 518968bps delivered:5 app_limited busy:100ms rcv_rtt:25 rcv_space:14480 rcv_ssthresh:498552 minrtt:22.321 snd_wnd:37

648 rcv_wnd:495104

ESTAB 0 0 172.16.50.141:16561 15.197.213.252:443

cubic wscale:8,9 rto:302 rtt:101.812/9.064 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:7324 bytes_acked:7325 bytes_received:71738 segs_out:356 segs_in:259 data_segs_out:132 data_s

egs_in:255 send 1137783bps lastsnd:471940 lastrcv:113188 lastack:113188 pacing_rate 2275560bps delivery_rate 356920bps delivered:133 app_limited busy:13956ms rcv_rtt:128.862 rcv_space:27027 rcv_ssthresh:498552 min

rtt:84.642 rcv_ooopack:1 snd_wnd:30976 rcv_wnd:442368

ESTAB 0 0 172.16.50.141:10257 52.12.0.49:443

cubic wscale:7,9 rto:244 rtt:43.9/7.904 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:18 bytes_sent:16505 bytes_acked:16506 bytes_received:2987 segs_out:22 segs_in:15 data_segs_out:14 data_segs_i

n:6 send 4749704bps lastsnd:15867 lastrcv:15826 lastack:5783 pacing_rate 9499376bps delivery_rate 2511280bps delivered:15 app_limited busy:201ms rcv_space:14480 rcv_ssthresh:498552 minrtt:36.643 snd_wnd:56576 rcv_

wnd:498176

ESTAB 0 0 172.16.50.141:21431 35.71.131.137:443

cubic wscale:8,9 rto:219 rtt:18.568/8.373 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:1005 bytes_acked:1006 bytes_received:4446 segs_out:10 segs_in:10 data_segs_out:4 data_segs_in

:6 send 6238690bps lastsnd:6726 lastrcv:6697 lastack:6697 pacing_rate 12476872bps delivery_rate 2152152bps delivered:5 app_limited busy:49ms rcv_space:14480 rcv_ssthresh:498552 minrtt:9.584 snd_wnd:68352 rcv_wnd:4

96128

ESTAB 0 0 172.16.50.141:36785 3.210.219.37:443

cubic wscale:8,9 rto:687 rtt:132.234/86.63 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:4 ssthresh:4 bytes_sent:114169 bytes_retrans:797 bytes_acked:113373 bytes_received:151562 segs_out:3106 se

gs_in:1567 data_segs_out:1558 data_segs_in:1550 send 350409bps lastsnd:4032 lastrcv:3950 lastack:3950 pacing_rate 420488bps delivery_rate 159248bps delivered:1555 app_limited busy:131157ms retrans:0/11 dsack_dups:

7 rcv_space:14480 rcv_ssthresh:498552 minrtt:75.752 snd_wnd:48384 rcv_wnd:495104 rehash:4

ESTAB 0 0 127.0.0.1:61755 127.0.0.1:19092

cubic wscale:9,9 rto:201 rtt:0.11/0.086 ato:40 mss:1448 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_sent:313908140 bytes_acked:313908141 bytes_received:15087312 segs_out:521060 segs_in:369098 data_se

gs_out:376464 data_segs_in:269412 send 1053090909bps lastsnd:2411 lastrcv:2411 lastack:2411 pacing_rate 2106181816bps delivery_rate 2106181816bps delivered:376465 app_limited busy:63599ms rcv_rtt:85059.4 rcv_space

:458996 rcv_ssthresh:434517 minrtt:0.031 snd_wnd:738304 rcv_wnd:458752

Horrible to parse!! ( parse = pain in the … )

5 of 31

The dodgy cluster

cron

ss –tcp –info -n

rsync

python

The initial implementation of the socket collection was python and elasticsearch, running on the “dodgy cluster”

6 of 31

xTCP Overview Diagram

Request all TCP diag

Kernel response with TCP diag data

Streaming client

NSQ

Linux Kernel

pub/sub queue

xTCP

xTCP deserializes the netlink TCP_diag messages into a protobuf

( it can encode to protobuf, prototext, protojson, msgpack )

xTCP streams the records to a pub/sub system

( easy to add more )

Clickhouse reads from pub/sub and inserts into tables

7 of 31

xTCP

xTCP is a golang tool to extract Linux kernel TCP_DIAG socket data

Steps

  1. Open netlink socket
  2. Send netlink TCP_DIAG request
  3. Read TCP_DIAG response
  4. Deserialize to protobuf
  5. Marshal protobuf to bytes
  6. Publish the bytes to pub/sub system
  7. Clickhouse reads from pub/sub system

8 of 31

Netlink TCP_DIAG

NlMsgHdr

type NlMsgHdr struct {

Length uint32

Type uint16

Flags uint16

Sequence uint32

Pid uint32

}

InetDiagMsg

Nlattr

type InetDiagMsg struct {

Family uint8

State uint8

Timer uint8

Retrans uint8

SocketID SocketID

Expires uint32

Rqueue uint32

Wqueue uint32

UID uint32

Inode uint32

}

type Nlattr struct {

NlaLen uint16

NlaType uint16

}

<attribute payload>

type Nlattr struct {

NlaLen uint16

NlaType uint16

}

repeated

Types

20 socket diag

3 = End of a dump

repeated

NlMsgHdr end of a dump

Pcap file header

Pcap packet header

Netlink cooked header

type Header struct {

Magic uint32

VersionMajor uint16

VersionMinor uint16

ThisZone int32

SigFigs uint32

SnapLen uint32

LinkType uint32

}

type RecordHeader struct {

TsSec uint32

TsUsec uint32

CapLen uint32

Len uint32

}

type ?? struct {

16 bytes

}

End of dump compes in it’s own Netlink Packet

9 of 31

Marshalling

xTCP reads the kernel structs into golang structs

xTCP can then marshal into:

Does NOT compress. Leave the compression to the pub/sub system. E.g. Kafka has zstd.

10 of 31

Exporting ( pub/sub )

xTCP can then export the mashalled bytes to pub/sub:

Of course, you can also use a tool like Benthos to translate the message into a wide range of other types

( Unfortunately, for kafka/redpanda xTCP does not yet use the schema registry, because I couldn’t work out how )

11 of 31

Data Store ( Clickhouse )

xTCP ultimately streams the records into Clickhouse

Clickhouse is a very high performance column data store that leverages log-structured-merge trees

12 of 31

xTCP2 improvements since xTCP1

  • More kernel struct support
    • BBR2 and Prague struct deserialization is already in place for the kernel merge
    • Handles kernels from > 4.19 forward
  • Tests && Benchmarking
    • Xtcp2 has a LOTS of tests and benchmarking for all the deserialization
  • No Reflection
    • ~50-100x faster
  • Simplified and faster without go channels
    • Does NOT use channels. Channels were definitely the bottleneck in xtcp1
    • Benchmarking showed that this dramatically improved performance
  • Sync.Pool
    • Reduced garbage collection via sync.pools
  • Marshalling
    • Supports multiple marshalling and is easy to extend
  • Pub/Sub
    • Supports multiple pub/sub systems and is easy to extend

13 of 31

xTCP[2] does NOT use reflection

BenchmarkDeserializeInetDiagMsg-16 100000000 11.20 ns/op

BenchmarkDeserializeInetDiagMsgReflection-16 1848198 633.7 ns/op

BenchmarkDeserializeInetDiagSockID-16 352283820 3.439 ns/op

BenchmarkDeserializeInetDiagSockIDReflection-16 2102726 539.6 ns/op

BenchmarkDeserializeBBRInfo-16 386265880 3.227 ns/op

BenchmarkDeserializeBBRInfoReflection-16 5569983 200.3 ns/op

BenchmarkDeserializeDCTCPInfoReflection-16 5931864 200.9 ns/op

BenchmarkDeserializeMemInfo-16 476499230 2.534 ns/op

BenchmarkDeserializeMemInfoReflection-16 6394890 187.6 ns/op

BenchmarkDeserializeSkMemInfo-16 284344549 4.241 ns/op

BenchmarkDeserializeSkMemInfoReflection-16 4660838 265.3 ns/op

BenchmarkDeserializeSockOpt-16 469690822 2.524 ns/op

BenchmarkDeserializeSockOptReflection-16 12774944 101.1 ns/op

BenchmarkDeserializeTrafficClass-16 535579550 1.958 ns/op

BenchmarkDeserializeTrafficClassReflection-16 12078608 103.6 ns/op

BenchmarkDeserializeTCPInfo-16 68040174 17.39 ns/op <----- New non-reflection

BenchmarkDeserializeTCPInfoReflection-16 1270778 942.7 ns/op <--------------- Reflection !!

The first implementation of xTCP used reflection

The new xtcp2 implementation is ~50-100 times faster!

Tcp_info struct without reflection is ~17ns

Tcp_info struct with reflection is ~942ns

942 / 17 ~= 55.4 times faster

14 of 31

xTCP on a single machine

xTCP

xTCPView

SQL queries

FreeBSD/MacOS on the roadmap

xTCP extracts socket data and streams it to Clickhouse

15 of 31

xTCP on many machines

xTCP

xTCPView

SQL queries

Kafka Cluster

ClickHouse Cluster

Edgio has ~20k machines streaming xTCP data

16 of 31

xTCPView

xTCPView is a dashboard system for visualizing the TCP socket data

  • Golang HTTP web server
    • Serves a Flutter javascript website
  • Web browser runs the Flutter website
  • Flutter makes GRPC web calls to xTCPView
  • xTCPView serves the data to the Flutter website
  • Flutter website renders the data into Flutter charts

xTCPView

GRPC web calls to xTCPView

Time (t)

TCP socket RTT and RTT variance

17 of 31

RTT data

SELECT

sec,

nsec,

hostname,

tcp_info_rtt,

tcp_info_rtt_var,

tcp_info_min_rtt,

tcp_info_rcv_rtt

FROM xtcp.xtcp_records

ORDER BY

sec DESC,

nsec DESC

LIMIT 20

Query id: 342acea9-6f6c-4fd8-b857-ed671bbdb919

┌─────────────────────sec─┬──────nsec─┬─hostname─┬─tcp_info_rtt─┬─tcp_info_rtt_var─┬─tcp_info_min_rtt─┬─tcp_info_rcv_rtt─┐

1. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 7833 │ 6921 │ 1448 │ 1382016 │

2. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 18027 │ 3534 │ 13633 │ 0 │

3. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 2587 │ 794 │ 1412 │ 4690 │

4. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 16107 │ 3324 │ 9170 │ 0 │

5. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 42688 │ 9736 │ 41653 │ 45000 │

6. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 42688 │ 9736 │ 41653 │ 45000 │

7. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 45487 │ 17085 │ 39866 │ 74000 │

8. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 76163 │ 5443 │ 72130 │ 81000 │

9. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 76163 │ 5443 │ 72130 │ 81000 │

10. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 76289 │ 1110 │ 70076 │ 0 │

11. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 56418 │ 25976 │ 37191 │ 0 │

12. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 17004 │ 3325 │ 13239 │ 0 │

13. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 53406 │ 20141 │ 33923 │ 0 │

14. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 20400 │ 8471 │ 12834 │ 0 │

15. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 17310 │ 4137 │ 14244 │ 0 │

16. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 5850 │ 1616 │ 1000 │ 0 │

17. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 22963 │ 6171 │ 18357 │ 0 │

18. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 23946 │ 2099 │ 19000 │ 0 │

19. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 23946 │ 2099 │ 19000 │ 0 │

20. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 81945 │ 17515 │ 68786 │ 81000 │

└─────────────────────────┴───────────┴──────────┴──────────────┴──────────────────┴──────────────────┴──────────────────┘

20 rows in set. Elapsed: 0.009 sec. Processed 84.77 thousand rows, 2.80 MB (9.62 million rows/s., 317.53 MB/s.)

Peak memory usage: 10.80 MiB.

The current table has ~200 fields

18 of 31

Reference

19 of 31

FreeBSD

20 of 31

xTCP for Kubernetes

By Dave Seddon

2024 December 6th

21 of 31

xTCP for kubernetes?

  • Background
    • Xtcp was originally design for exporting socket data from the “default” network namespace
  • Objective
    • Allow organizations to monitor the tcp_diag socket data across their entire kubernetes (k8s) cluster
  • Why?
    • Tcp_diag socket data is a very rich source of information about performance
    • Allows for tuning and optimization
    • Allow for security related monitoring. E.g. Are there sockets that should not exist?
  • Challenge
    • Netlink sockets need to be opened within each k8s PoD, to query the tcp_diag data for each PoD
    • How can we efficiently perform cluster monitoring?

22 of 31

xTCP for kubernetes?

Pod X

xtcp

Pod Y

Container X

Container Y

xtcp

Netlink sockets to query the tcp_diag

xtcp

Xtcp runs as a side-car in each PoD and each connects to Kafka

Simple solution is to run xtcp as a side-car which is injected into each Pod

The downside is that there’s xTCP running in RAM in all these places (lots of duplication), and there are lots of Kafka sockets. The TCP kafka sockets are a convenient way to get the data out of each unique Pod network namespace. The processing of the netlink socket data into protobuf will happen in each instance.

Often the PoDs are pretty light weight and only have a few TCP sockets, so running an entire xTCP process seems wasteful.

23 of 31

xTCP for kubernetes?

Pod X

Pod Y

Xtcp Daemonset

Container X

Container Y

It would be amazing if xTCP could run as a daemonSet (single instance per machine) and then open netlink sockets into each PoD

It would be amazing if xTCP could run as a daemonSet (single instance per machine) and then open netlink sockets into each PoD

However, a single process can’t belong to multiple network name spaces, so this is impossible :(

This solution is attractive because there would be a single instance of xTCP running, so this would be quite efficient. The problem is you can't do this.

24 of 31

Challenge

So how can we minimize the number of xTCP instances running, and still get access to all the network name spaces?

25 of 31

xTCP for kubernetes?

Pod X

Pod Y

Xtcp Daemonset

Container X

Container Y

Unix domain sockets

Netlink sockets to query the tcp_diag

Sidecar lightweight domain-socket to netlink link proxy

Domain sockets can be mounted between a daemonset container and each of the PoDs. This can be used as a lightweight way to request and return the data from each PoD

Unix domain sockets mounted between the xtcp Daemonset and each Pod

Lightweight UDS/netlink proxy

The processing of the netlink socket data will happen in the daemonset, so this will move most of the CPU intensive work to the daemonset container, rather than happening within each PoD.

26 of 31

xTCP for kubernetes?

Xtcp Daemonset

volumes:

- name: xtcp-volume

emptyDir:

medium: Memory

27 of 31

Poller

ns

xTCP

ns opens netlink sockets into each network namespace and will add remove the netlink sockets as network name spaces come and go

GetNetlinkSocketFDs() (fds []int)

For each network namespace

netlinkers

28 of 31

Some io_uring thinking follows

29 of 31

32bits

24bits

struct io_uring_cqe {

__u64 user_data; /* sqe->data submission passed back */

__s32 res; /* result code for this event */

__u32 flags;

};

8bits

operation

requestID

netID

30 of 31

Batch of Request all TCP diag

Kernel response with TCP diag data

Linux Kernel

xTCP

xTCP deserializes the netlink TCP_diag messages into a protobuf

( it can encode to protobuf, prototext, protojson, msgpack )

Poller will queue up the netlink write requests and send them as a batch

31 of 31

Options

Option

Description

Pros

Cons

Comment

Simple

Inject xtcp as a side car

Xtcp opens netlink and kafka sockets

Each pod is independant

Pretty simple

Each PoD has a kafka socket open, so that would mean a lot of kafka sockets

~20MB ram per Pod

Daemonset

Single instance of xtcp per machine

Small amount of ram on each machine

Can’t access each network name space