xTCP
By Dave Seddon
2024 September 11th
Disclaimer
Agenda
[das@t:~]$ ss --tcp --info -n | more
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 172.16.50.141:53731 3.167.192.41:443
cubic wscale:9,9 rto:223 rtt:22.878/17.427 ato:40 mss:1428 pmtu:1500 rcvmss:1428 advmss:1448 cwnd:10 bytes_sent:931 bytes_acked:932 bytes_received:6219 segs_out:10 segs_in:12 data_segs_out:4 data_segs_in:
8 send 4993443bps lastsnd:11540 lastrcv:11539 lastack:11469 pacing_rate 9986720bps delivery_rate 1641608bps delivered:5 app_limited busy:101ms rcv_space:14480 rcv_ssthresh:498552 minrtt:12.874 snd_wnd:68608 rcv_wn
d:494080
ESTAB 0 0 172.16.50.141:29649 69.173.154.8:443
cubic wscale:2,9 rto:226 rtt:25.949/5.148 ato:40 mss:1448 pmtu:1500 rcvmss:1400 advmss:1448 cwnd:10 bytes_sent:1904 bytes_acked:1905 bytes_received:5938 segs_out:19 segs_in:14 data_segs_out:4 data_segs_in
:6 send 4464141bps lastsnd:64684 lastrcv:64657 lastack:3239 pacing_rate 8928104bps delivery_rate 518968bps delivered:5 app_limited busy:100ms rcv_rtt:25 rcv_space:14480 rcv_ssthresh:498552 minrtt:22.321 snd_wnd:37
648 rcv_wnd:495104
ESTAB 0 0 172.16.50.141:16561 15.197.213.252:443
cubic wscale:8,9 rto:302 rtt:101.812/9.064 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:7324 bytes_acked:7325 bytes_received:71738 segs_out:356 segs_in:259 data_segs_out:132 data_s
egs_in:255 send 1137783bps lastsnd:471940 lastrcv:113188 lastack:113188 pacing_rate 2275560bps delivery_rate 356920bps delivered:133 app_limited busy:13956ms rcv_rtt:128.862 rcv_space:27027 rcv_ssthresh:498552 min
rtt:84.642 rcv_ooopack:1 snd_wnd:30976 rcv_wnd:442368
ESTAB 0 0 172.16.50.141:10257 52.12.0.49:443
cubic wscale:7,9 rto:244 rtt:43.9/7.904 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:18 bytes_sent:16505 bytes_acked:16506 bytes_received:2987 segs_out:22 segs_in:15 data_segs_out:14 data_segs_i
n:6 send 4749704bps lastsnd:15867 lastrcv:15826 lastack:5783 pacing_rate 9499376bps delivery_rate 2511280bps delivered:15 app_limited busy:201ms rcv_space:14480 rcv_ssthresh:498552 minrtt:36.643 snd_wnd:56576 rcv_
wnd:498176
ESTAB 0 0 172.16.50.141:21431 35.71.131.137:443
cubic wscale:8,9 rto:219 rtt:18.568/8.373 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:1005 bytes_acked:1006 bytes_received:4446 segs_out:10 segs_in:10 data_segs_out:4 data_segs_in
:6 send 6238690bps lastsnd:6726 lastrcv:6697 lastack:6697 pacing_rate 12476872bps delivery_rate 2152152bps delivered:5 app_limited busy:49ms rcv_space:14480 rcv_ssthresh:498552 minrtt:9.584 snd_wnd:68352 rcv_wnd:4
96128
ESTAB 0 0 172.16.50.141:36785 3.210.219.37:443
cubic wscale:8,9 rto:687 rtt:132.234/86.63 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:4 ssthresh:4 bytes_sent:114169 bytes_retrans:797 bytes_acked:113373 bytes_received:151562 segs_out:3106 se
gs_in:1567 data_segs_out:1558 data_segs_in:1550 send 350409bps lastsnd:4032 lastrcv:3950 lastack:3950 pacing_rate 420488bps delivery_rate 159248bps delivered:1555 app_limited busy:131157ms retrans:0/11 dsack_dups:
7 rcv_space:14480 rcv_ssthresh:498552 minrtt:75.752 snd_wnd:48384 rcv_wnd:495104 rehash:4
ESTAB 0 0 127.0.0.1:61755 127.0.0.1:19092
cubic wscale:9,9 rto:201 rtt:0.11/0.086 ato:40 mss:1448 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_sent:313908140 bytes_acked:313908141 bytes_received:15087312 segs_out:521060 segs_in:369098 data_se
gs_out:376464 data_segs_in:269412 send 1053090909bps lastsnd:2411 lastrcv:2411 lastack:2411 pacing_rate 2106181816bps delivery_rate 2106181816bps delivered:376465 app_limited busy:63599ms rcv_rtt:85059.4 rcv_space
:458996 rcv_ssthresh:434517 minrtt:0.031 snd_wnd:738304 rcv_wnd:458752
Horrible to parse!! ( parse = pain in the … )
The dodgy cluster
cron
ss –tcp –info -n
rsync
python
The initial implementation of the socket collection was python and elasticsearch, running on the “dodgy cluster”
xTCP Overview Diagram
Request all TCP diag
Kernel response with TCP diag data
Streaming client
NSQ
Linux Kernel
pub/sub queue
xTCP
xTCP deserializes the netlink TCP_diag messages into a protobuf
( it can encode to protobuf, prototext, protojson, msgpack )
xTCP streams the records to a pub/sub system
( easy to add more )
Clickhouse reads from pub/sub and inserts into tables
xTCP
xTCP is a golang tool to extract Linux kernel TCP_DIAG socket data
Steps
Netlink TCP_DIAG
NlMsgHdr
type NlMsgHdr struct {
Length uint32
Type uint16
Flags uint16
Sequence uint32
Pid uint32
}
InetDiagMsg
Nlattr
type InetDiagMsg struct {
Family uint8
State uint8
Timer uint8
Retrans uint8
SocketID SocketID
Expires uint32
Rqueue uint32
Wqueue uint32
UID uint32
Inode uint32
}
type Nlattr struct {
NlaLen uint16
NlaType uint16
}
<attribute payload>
type Nlattr struct {
NlaLen uint16
NlaType uint16
}
repeated
Types
20 socket diag
3 = End of a dump
repeated
NlMsgHdr end of a dump
Pcap file header
Pcap packet header
Netlink cooked header
type Header struct {
Magic uint32
VersionMajor uint16
VersionMinor uint16
ThisZone int32
SigFigs uint32
SnapLen uint32
LinkType uint32
}
type RecordHeader struct {
TsSec uint32
TsUsec uint32
CapLen uint32
Len uint32
}
type ?? struct {
16 bytes
}
End of dump compes in it’s own Netlink Packet
Marshalling
xTCP reads the kernel structs into golang structs
xTCP can then marshal into:
Does NOT compress. Leave the compression to the pub/sub system. E.g. Kafka has zstd.
Exporting ( pub/sub )
xTCP can then export the mashalled bytes to pub/sub:
Of course, you can also use a tool like Benthos to translate the message into a wide range of other types
( Unfortunately, for kafka/redpanda xTCP does not yet use the schema registry, because I couldn’t work out how )
Data Store ( Clickhouse )
xTCP ultimately streams the records into Clickhouse
Clickhouse is a very high performance column data store that leverages log-structured-merge trees
xTCP2 improvements since xTCP1
xTCP[2] does NOT use reflection
BenchmarkDeserializeInetDiagMsg-16 100000000 11.20 ns/op
BenchmarkDeserializeInetDiagMsgReflection-16 1848198 633.7 ns/op
BenchmarkDeserializeInetDiagSockID-16 352283820 3.439 ns/op
BenchmarkDeserializeInetDiagSockIDReflection-16 2102726 539.6 ns/op
BenchmarkDeserializeBBRInfo-16 386265880 3.227 ns/op
BenchmarkDeserializeBBRInfoReflection-16 5569983 200.3 ns/op
BenchmarkDeserializeDCTCPInfoReflection-16 5931864 200.9 ns/op
BenchmarkDeserializeMemInfo-16 476499230 2.534 ns/op
BenchmarkDeserializeMemInfoReflection-16 6394890 187.6 ns/op
BenchmarkDeserializeSkMemInfo-16 284344549 4.241 ns/op
BenchmarkDeserializeSkMemInfoReflection-16 4660838 265.3 ns/op
BenchmarkDeserializeSockOpt-16 469690822 2.524 ns/op
BenchmarkDeserializeSockOptReflection-16 12774944 101.1 ns/op
BenchmarkDeserializeTrafficClass-16 535579550 1.958 ns/op
BenchmarkDeserializeTrafficClassReflection-16 12078608 103.6 ns/op
BenchmarkDeserializeTCPInfo-16 68040174 17.39 ns/op <----- New non-reflection
BenchmarkDeserializeTCPInfoReflection-16 1270778 942.7 ns/op <--------------- Reflection !!
The first implementation of xTCP used reflection
The new xtcp2 implementation is ~50-100 times faster!
Tcp_info struct without reflection is ~17ns
Tcp_info struct with reflection is ~942ns
942 / 17 ~= 55.4 times faster
xTCP on a single machine
xTCP
xTCPView
SQL queries
FreeBSD/MacOS on the roadmap
xTCP extracts socket data and streams it to Clickhouse
xTCP on many machines
xTCP
xTCPView
SQL queries
Kafka Cluster
ClickHouse Cluster
Edgio has ~20k machines streaming xTCP data
xTCPView
xTCPView is a dashboard system for visualizing the TCP socket data
xTCPView
GRPC web calls to xTCPView
Time (t)
TCP socket RTT and RTT variance
RTT data
SELECT
sec,
nsec,
hostname,
tcp_info_rtt,
tcp_info_rtt_var,
tcp_info_min_rtt,
tcp_info_rcv_rtt
FROM xtcp.xtcp_records
ORDER BY
sec DESC,
nsec DESC
LIMIT 20
Query id: 342acea9-6f6c-4fd8-b857-ed671bbdb919
┌─────────────────────sec─┬──────nsec─┬─hostname─┬─tcp_info_rtt─┬─tcp_info_rtt_var─┬─tcp_info_min_rtt─┬─tcp_info_rcv_rtt─┐
1. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 7833 │ 6921 │ 1448 │ 1382016 │
2. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 18027 │ 3534 │ 13633 │ 0 │
3. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 2587 │ 794 │ 1412 │ 4690 │
4. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 16107 │ 3324 │ 9170 │ 0 │
5. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 42688 │ 9736 │ 41653 │ 45000 │
6. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 42688 │ 9736 │ 41653 │ 45000 │
7. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 45487 │ 17085 │ 39866 │ 74000 │
8. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 76163 │ 5443 │ 72130 │ 81000 │
9. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 76163 │ 5443 │ 72130 │ 81000 │
10. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 76289 │ 1110 │ 70076 │ 0 │
11. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 56418 │ 25976 │ 37191 │ 0 │
12. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 17004 │ 3325 │ 13239 │ 0 │
13. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 53406 │ 20141 │ 33923 │ 0 │
14. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 20400 │ 8471 │ 12834 │ 0 │
15. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 17310 │ 4137 │ 14244 │ 0 │
16. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 5850 │ 1616 │ 1000 │ 0 │
17. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 22963 │ 6171 │ 18357 │ 0 │
18. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 23946 │ 2099 │ 19000 │ 0 │
19. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 23946 │ 2099 │ 19000 │ 0 │
20. │ 2024-09-14 19:21:52.000 │ 825331283 │ t │ 81945 │ 17515 │ 68786 │ 81000 │
└─────────────────────────┴───────────┴──────────┴──────────────┴──────────────────┴──────────────────┴──────────────────┘
20 rows in set. Elapsed: 0.009 sec. Processed 84.77 thousand rows, 2.80 MB (9.62 million rows/s., 317.53 MB/s.)
Peak memory usage: 10.80 MiB.
The current table has ~200 fields
Reference
FreeBSD
xTCP for Kubernetes
By Dave Seddon
2024 December 6th
xTCP for kubernetes?
xTCP for kubernetes?
Pod X
xtcp
Pod Y
Container X
Container Y
xtcp
Netlink sockets to query the tcp_diag
xtcp
Xtcp runs as a side-car in each PoD and each connects to Kafka
Simple solution is to run xtcp as a side-car which is injected into each Pod
The downside is that there’s xTCP running in RAM in all these places (lots of duplication), and there are lots of Kafka sockets. The TCP kafka sockets are a convenient way to get the data out of each unique Pod network namespace. The processing of the netlink socket data into protobuf will happen in each instance.
Often the PoDs are pretty light weight and only have a few TCP sockets, so running an entire xTCP process seems wasteful.
xTCP for kubernetes?
Pod X
Pod Y
Xtcp Daemonset
Container X
Container Y
It would be amazing if xTCP could run as a daemonSet (single instance per machine) and then open netlink sockets into each PoD
It would be amazing if xTCP could run as a daemonSet (single instance per machine) and then open netlink sockets into each PoD
However, a single process can’t belong to multiple network name spaces, so this is impossible :(
This solution is attractive because there would be a single instance of xTCP running, so this would be quite efficient. The problem is you can't do this.
Challenge
So how can we minimize the number of xTCP instances running, and still get access to all the network name spaces?
xTCP for kubernetes?
Pod X
Pod Y
Xtcp Daemonset
Container X
Container Y
Unix domain sockets
Netlink sockets to query the tcp_diag
Sidecar lightweight domain-socket to netlink link proxy
Domain sockets can be mounted between a daemonset container and each of the PoDs. This can be used as a lightweight way to request and return the data from each PoD
Unix domain sockets mounted between the xtcp Daemonset and each Pod
Lightweight UDS/netlink proxy
The processing of the netlink socket data will happen in the daemonset, so this will move most of the CPU intensive work to the daemonset container, rather than happening within each PoD.
xTCP for kubernetes?
Xtcp Daemonset
volumes:
- name: xtcp-volume
emptyDir:
medium: Memory
Poller
ns
xTCP
ns opens netlink sockets into each network namespace and will add remove the netlink sockets as network name spaces come and go
GetNetlinkSocketFDs() (fds []int)
For each network namespace
netlinkers
Some io_uring thinking follows
32bits
24bits
struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
__s32 res; /* result code for this event */
__u32 flags;
};
8bits
operation
requestID
netID
Batch of Request all TCP diag
Kernel response with TCP diag data
Linux Kernel
xTCP
xTCP deserializes the netlink TCP_diag messages into a protobuf
( it can encode to protobuf, prototext, protojson, msgpack )
Poller will queue up the netlink write requests and send them as a batch
Options
Option | Description | Pros | Cons | Comment |
Simple | Inject xtcp as a side car Xtcp opens netlink and kafka sockets Each pod is independant | Pretty simple | Each PoD has a kafka socket open, so that would mean a lot of kafka sockets ~20MB ram per Pod | |
Daemonset | Single instance of xtcp per machine | Small amount of ram on each machine | Can’t access each network name space | |
| | | | |
| | | | |
| | | | |