1 of 66

CS 31204: Computer Networks – The Internet Transport Protocols

INDIAN INSTITUTE OF TECHNOLOGY

KHARAGPUR

Department of Computer Science and Engineering

Sandip Chakraborty

sandipc@cse.iitkgp.ac.in

Abhijnan Chakraborty

abhijnan@cse.iitkgp.ac.in

2 of 66

Transmission Control Protocol (TCP)

  • TCP was specifically designed to provide a reliable, end-to-end byte stream over an unreliable internetwork.

  • Internetwork – different parts may have widely different topologies, bandwidths, delays, packet sizes and other parameters

  • TCP dynamically adapts to properties of the internetwork and is robust in the face of many kinds of failures.

  • RFC 793 (September 1981) – Base protocol
    • RFC 1122 (clarifications and bug fixes), RFC 1323 (High performance), RFC 2018 (SACK), RFC 2581 (Congestion Control), RFC 3168 (Explicit Congestion Notification)

Indian Institute of Technology Kharagpur

3 of 66

TCP Service Model

  • Uses Sockets to define an end-to-end connection (Source IP, Source Port, Source Initial Sequence Number, Destination IP, Destination Port, Destination Initial Sequence Number)

  • Unix Model of Socket Implementation:
    • A single daemon process, called Internet Daemon (inetd) runs all the times at different well-known ports, and wait for the first incoming connection
    • When a first incoming connection comes, inetd forks a new process and starts the corresponding daemon (for example httpd at port 80, ftpd at port 21 etc.)

  • All TCP connections are full-duplex and point-to-point. TCP does not support multicasting or broadcasting.

Indian Institute of Technology Kharagpur

4 of 66

TCP Service Model

  • A TCP connection is a byte stream, not a message stream

  • Message boundaries are not preserved end-to-end

  • Example:
    • The sending process does four 512 byte writes to a TCP stream – for write() call to the TCP socket
    • These data may be delivered as – four 512 byte chunks, two 1024 byte chunks, one 2048 byte chunk or some other way
    • There is no way for the receiver to detect the unit(s) in which the data were written by the sending process.

Indian Institute of Technology Kharagpur

5 of 66

The TCP Protocol – The Header

Source: Computer Networks (5th Edition) by Tanenbaum, Wetherell

Indian Institute of Technology Kharagpur

6 of 66

TCP Sequence Number and Acknowledgement Number

  • 32 bits sequence number and acknowledgement number

  • Every byte on a TCP connection has its own 32 bit sequence number – a byte stream oriented connection

  • TCP uses sliding window based flow control – the acknowledgement number contains next expected byte in order, which acknowledges the cumulative bytes that has been received by the receiver.
    • ACK number 31245 means that the receiver has correctly received up to 31244 bytes and expecting for byte 31245

Indian Institute of Technology Kharagpur

7 of 66

TCP Segments

  • The sending and receiving TCP entities exchange data in the form of segments.

  • A TCP segment consists of a fixed 20 byte header (plus an optional part) followed by zero or more data bytes.

  • TCP can accumulate data from several write() calls into one segment, or split data from one write() into multiple segments

  • A segment size is restricted by two parameters
    • IP Payload (65515 bytes)
    • Maximum Transmission Unit (MTU) of the link

Indian Institute of Technology Kharagpur

8 of 66

How a TCP Segment is Created

  • Write() calls from the applications write data to the TCP sender buffer.
  • sender maintains a dynamic window size based on the flow and congestion control algorithm

  • Modern implementations of TCP uses path MTU discovery to determine the MTU of the end-to-end path (uses ICMP protocol), and sets up the Maximum Segment Size (MSS) during connection establishment
    • May depend on other parameters (buffer implementation).

  • Check the sender window after receiving an ACK. If the window size is less than MSS, construct a single segment; otherwise construct multiple segments, each equals to the MSS

Indian Institute of Technology Kharagpur

9 of 66

Challenges in TCP Design

  • Segments are constructed dynamically, so retransmissions do not guarantee the retransmission of the same data segment – a retransmission may contain additional data or less data

  • Segments may arrive out-of-order. TCP receiver should handle out-of-order segments in a proper way, so that data wastage is minimized.

Indian Institute of Technology Kharagpur

10 of 66

Window Size field in the TCP Segment Header

  • Flow control in TCP is handled using a variable sized sliding window.

  • The window size field tells how many bytes the receiver can receive based on the current free size at its buffer space.

  • What is meant by window size 0?

  • TCP Acknowledgement – combination of acknowledgement number and window size

Indian Institute of Technology Kharagpur

11 of 66

TCP Connection Establishment

  • How to choose the initial sequence number?
    • Protect delayed duplicates, do now generate the initial sequence number for every connection from 0
    • Original implementation of TCP used a clock based approach, the clock ticked every 4 microseconds, the value of the clock cycles from 0 to 232-1. The value of the clock gives the initial sequence number

  • TCP SYN flood attack
    • Solution: Use cryptographic function to generate sequence numbers

Indian Institute of Technology Kharagpur

12 of 66

TCP Connection Release

Indian Institute of Technology Kharagpur

13 of 66

TCP State Transition Diagram – Connection Modeling

Event/Action

Dashed: Server

Solid: Client

Source: Computer Networks (5th Edition) by Tanenbaum, Wetherell

Indian Institute of Technology Kharagpur

14 of 66

TCP State Transition Diagram – Connection Modeling

Event/Action

Dashed: Server

Solid: Client

Source: Computer Networks (5th Edition) by Tanenbaum, Wetherell

Indian Institute of Technology Kharagpur

15 of 66

TCP Sliding Window

Source: Computer Networks (5th Edition) by Tanenbaum, Wetherell

Indian Institute of Technology Kharagpur

16 of 66

Delayed Acknowledgements

  • Consider a telnet connection, that reacts on every keystroke.

  • In the worst case, whenever a character arrives at the sending TCP entity, TCP creates a 21 byte TCP segment, 20 bytes of header and 1 byte of data. For this segment, another ACK and window update is sent when the application reads that 1 byte. This results in a huge wastage of bandwidth.

  • Delayed acknowledgements: Delay acknowledgement and window updates for up to 500 msec in the hope of receiving few more data packets within that interval.

  • However, the sender can still send multiple short data segments.

Indian Institute of Technology Kharagpur

17 of 66

Nagle’s Algorithm

  • When data come into the sender in small pieces, just send the first piece and buffer all the rest until the first piece is acknowledged.

  • Then send all buffered data in one TCP segment and start buffering again until the next segment is acknowledged.
    • Only one short packet can be outstanding at any time.

  • Do we want Nagle’s Algorithm all the time?

  • Nagle’s Algorithm and Delayed Acknowledgement
    • Receiver waits for data and sender waits for acknowledgement – results in starvation

Indian Institute of Technology Kharagpur

18 of 66

Silly Window Syndrome

  • Data are passed to the sending TCP entity in large blocks, but an interactive application on the receiver side reads data only 1 byte at a time.

Source: Computer Networks (5th Edition) by Tanenbaum, Wetherell

  • Clark’s solution: Do not send window update for 1 byte. Wait until sufficient space is available at the receiver buffer.

Indian Institute of Technology Kharagpur

19 of 66

Handling Short Segments – Sender and Receiver Together

  • Nagle’s algorithm and Clark’s solution to silly window syndrome are complementary

  • Nagle’s algorithm: Solve the problem caused by the sending application delivering data to TCP a byte at a time

  • Clark’s solution: Receiving application fetching the data up from TCP a byte at a time

  • Exception: The PSH flag is used to inform the sender to create a segment immediately without waiting for more data

Indian Institute of Technology Kharagpur

20 of 66

Handling Out of Order in TCP

  • TCP buffers out of order segments and forward a duplicate acknowledgement to the sender.

  • Acknowledgement in TCP – Cumulative acknowledgement

  • Receiver has received bytes 0, 1, 2, _, 4, 5, 6, 7
    • TCP sends a cumulative acknowledgement with ACK number 3, acknowledging everything up to byte 2
    • Once 4 is received, a duplicate ACK with ACK number 3 (next expected byte) is forwarded – triggers congestion control
    • After timeout, sender retransmits byte 3
    • Once byte 3 is received, it can send another cumulative ACK with ACK number 8 (next ecpected byte)

Indian Institute of Technology Kharagpur

21 of 66

TCP Timer Management

  • TCP Retransmission Timeout (RTO): When a segment is sent, a retransmission timer is started
    • If the segment is acknowledged before the timer expires, the timer is stopped
    • If the timer expires before the acknowledgement comes, the segment is retransmitted

  • What can be an ideal value of RTO ?

  • Possible solution: Estimate RTT, and RTO is some positive multiples of RTT

  • RTT estimation is difficult for transport layer – why?

Indian Institute of Technology Kharagpur

22 of 66

RTT at Data Link Layer vs RTT at Transport Layer

Data Link Layer

Transport Layer

Indian Institute of Technology Kharagpur

23 of 66

RTT Estimation at the Transport Layer

  •  

Indian Institute of Technology Kharagpur

24 of 66

Problem with EWMA

  • Even given a good value of SRTT, choosing a suitable RTO is nontrivial.

  • Initial implementation of TCP used RTO = 2SRTT

  • Experience showed that a constant value was too inflexible, because it failed to response when the variance went up (RTT fluctuation is high) – happens normally at high load

  • Consider variance of RTT during RTO estimation.

Indian Institute of Technology Kharagpur

25 of 66

RTO Estimation

  •  

Indian Institute of Technology Kharagpur

26 of 66

Karn’s Algorithm

  • How will you get the RTT estimation, when a segment is lost and retransmitted again?

  • Karn’s algorithm:
    • Do not update estimates on any segments that has been retransmitted
    • The timeout is doubled each successive retransmission until the segments gets through the first time

Indian Institute of Technology Kharagpur

27 of 66

Other TCP Timers

  • Persistent TCP Timer: Avoid deadlock when receiver buffer is announced as zero
    • After the timer goes off, sender forwards a probe packet to the receiver to get the updated window size

  • Keepalive Timer: Close the connection when a connection has been idle for a long duration

  • TCP TIME_WAIT: Wait before closing a connection – twice the packet lifetime

Indian Institute of Technology Kharagpur

28 of 66

TCP Congestion Control

  • Based on implementation of AIMD using a window and with packet loss as the binary signal

  • TCP maintains a Congestion Window (CWnd) – number of bytes the sender may have in the network at any time

  • Sending Rate = Congestion Window / RTT

  • Sender Window (SWnd) = Min (CWnd, RWnd)

  • RWnd – Receiver advertised window size

Indian Institute of Technology Kharagpur

29 of 66

1986 Congestion Collapse

  • In 1986, the growing popularity of Internet led to the first occurrence of congestion collapse – a prolonged period during which goodput dropped precipitously (more than a factor of 100)

  • Early TCP Congestion Control algorithm – Effort by Van Jancobson (1988)

  • Challenged for Jacobson – Implement congestion control without making much change in the protocol (made it instantly deployable)

  • Packet loss is a suitable signal for congestion – use timeout to detect packet loss. Tune CWnd based on the observation from packet loss

Indian Institute of Technology Kharagpur

30 of 66

Adjust CWnd based on AIMD

  • One of the most interesting ideas – use ACK for clocking

  • ACK returns to the sender at about the rate that packets can be sent over the slowest link in the path.

  • Trigger CWnd adjustment based on the rate at which ACK are received.

Indian Institute of Technology Kharagpur

31 of 66

Increase Rate Exponentially at the Beginning – The Slow Start

  • AIMD rule will take a very long time to reach a good operating point on fast networks if the CWnd is started from a small size.

  • A 10 Mbps link with 100 ms RTT
    • Appropriate CWnd = BDP = 1 Mbit
    • 1250 byte packets -> 100 packets to reach BDP
    • CWnd starts at 1 packet, and increased 1 packet at every RTT
    • 100 RTTs are required 10 sec before the connection reaches to a moderate rate

  • Slow Start - Exponential increase of rate to avoid slow convergence
    • Rate is not slow at all ! 😃
    • CWnd is doubled at every RTT

Indian Institute of Technology Kharagpur

32 of 66

TCP Slow Start

  • Every ACK segment allows two more segments to be sent

  • For each segment that is acknowledged before the retransmission timer goes off, the sender adds one segment’s worth of bytes to the congestion window.

Indian Institute of Technology Kharagpur

33 of 66

Slow Start Threshold

  • Slow start causes exponential growth, eventually it will send too many packets into the network too quickly.

  • To keep slow start under control, the sender keeps a threshold for the connection called the slow start threshold (ssthresh).

  • Initially ssthresh is set to BDP (or arbitrarily high), the maximum that a flow can push to the network.

  • Whenever a packet loss is detected by a RTO, the ssthresh is set to be half of the congestion window

Indian Institute of Technology Kharagpur

34 of 66

Additive Increase (Congestion Avoidance)

  •  

Indian Institute of Technology Kharagpur

35 of 66

Additive Increase – Packet Wise Approximation

Indian Institute of Technology Kharagpur

36 of 66

Triggering a Congestion

  • Two ways to trigger a congestion notification in TCP – (1) RTO, (2) Duplicate ACK

  • RTO: A sure indication of congestion, however time consuming

  • Duplicate ACK: Receiver sends a duplicate ACK when it receives out of order segment
    • A loose way of indicating congestion
    • TCP arbitrarily assumes that THREE duplicate ACKs (DUPACKs) imply that a packet has been lost – triggers congestion control mechanism
    • The identity of the lost packet can be inferred – the very next packet in sequence
    • Retransmit the lost packet and trigger congestion control

Indian Institute of Technology Kharagpur

37 of 66

Fast Retransmission – TCP Tahoe

  • Use THREE DUPACK as the sign of congestion

  • Once 3 DUPACKs have been received,
    • Retransmit the lost packet (fast retransmission) – takes one RTT
    • Set ssthresh as half of the current CWnd
    • Set CWnd to 1 MSS

Indian Institute of Technology Kharagpur

38 of 66

Fast Recovery – TCP Reno

  • Once a congestion is detected through 3 DUPACKs, do TCP really need to set CWnd = 1 MSS ?

  • DUPACK means that some segments are still flowing in the network – a signal for temporary congestion, but not a prolonged one

  • Immediately transmit the lost segment (fast retransmit), then transmit additional segments based on the DUPACKs received (fast recovery)

Indian Institute of Technology Kharagpur

39 of 66

Fast Recovery – TCP Reno

  • Fast recovery:
    1. set ssthresh to one-half of the current congestion window. Retransmit the missing segment.
    2. set cwnd = ssthresh + 3.
    3. Each time another duplicate ACK arrives, set cwnd = cwnd + 1. Then, send a new data segment if allowed by the value of cwnd.
    4. Once receive a new ACK (an ACK which acknowledges all intermediate segments sent between the lost packet and the receipt of the first duplicate ACK), exit fast recovery. This causes setting cwnd to ssthresh (the ssthresh in step 1). Then, continue with linear increasing due to congestion avoidance algorithm.

Indian Institute of Technology Kharagpur

40 of 66

Fast Recovery – TCP Reno

Indian Institute of Technology Kharagpur

41 of 66

User Datagram Protocol (UDP)

  • Just a wrapper on top of IP
  • Provides unreliable datagram service
    • Packets may be lost or delivered out of order
    • Users exchange datagrams (not streams)
    • Connectionless
    • Not buffered -- UDP accepts data and transmits immediately (no buffering before transmission)
    • Full duplex -- concurrent transfers can take place in both directions
    • No congestion control

Indian Institute of Technology Kharagpur

42 of 66

UDP Datagram Format

Indian Institute of Technology Kharagpur

43 of 66

UDP versus TCP

  • Choice of UDP versus TCP is based on
    • Functionality
    • Performance

  • Performance
    • TCP’s window-based flow control scheme leads to bursty bulk transfers (not rate based)
    • TCP’s “slow start” algorithm can reduce throughput
    • TCP has extra overhead per segment
    • UDP can send small, inefficient datagrams (constant bit-rate traffic)

Indian Institute of Technology Kharagpur

44 of 66

UDP versus TCP

  • Reliability
    • TCP provides reliable, in-order transfers
    • UDP provides unreliable service -- application must accept or deal with (a) packet loss due to overflows and errors, (b) out-of-order datagrams

  • Application complexity
    • Application-level framing can be difficult using TCP because of Nagle algorithm
    • Nagle algorithm controls when TCP segments are sent to use IP datagrams efficiently
    • Data may be received and read by applications in different units that how it was sent (message boundaries are not preserved in TCP)

Indian Institute of Technology Kharagpur

45 of 66

TCP and UDP Checksum Calculation

  • Pseudo Header contains
    • Source IP
    • Destination IP
    • Protocol Header (TCP, UDP or ICMP)
    • Segment or datagram length
    • Reserved 8 bit

  • An additional layer of verification that the packet has reached to the intended destination (IP also has its own checksum)

Indian Institute of Technology Kharagpur

46 of 66

UDP Checksum

Transmitted: 5 6 11

Goal: Detect errors (i.e., flipped bits) in transmitted segment

Received: 4 6 11

1st number

2nd number

sum

receiver-computed

checksum

sender-computed

checksum (as received)

=

Source: Computer Networking: A Top-Down Approach (8th Ed) by Jim Kurose, Keith Ross

Indian Institute of Technology Kharagpur

47 of 66

UDP Checksum

Sender:

  • Treat contents of UDP segment (including UDP header fields and Pseudo header) as sequence of 16-bit integers
  • Checksum: addition (one’s complement sum) of segment content
  • Checksum value put into UDP checksum field

Receiver:

  • Compute checksum of received segment
  • Check if computed checksum equals checksum field value:
    • Not equal - error detected
    • Equal - no error detected. But maybe errors nonetheless? More later ….

Goal: detect errors (i.e., flipped bits) in transmitted segment

Source: Computer Networking: A Top-Down Approach (8th Ed) by Jim Kurose, Keith Ross

Indian Institute of Technology Kharagpur

48 of 66

Internet Checksum: An Example

example: add two 16-bit integers

sum

checksum

Note: when adding numbers, a carryout from the most significant bit needs to be added to the result

1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0

1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

wraparound

1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0

0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

Source: Computer Networking: A Top-Down Approach (8th Ed) by Jim Kurose, Keith Ross

Indian Institute of Technology Kharagpur

49 of 66

Internet Checksum: Not the Best Protection

example: add two 16-bit integers

sum

checksum

1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0

1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

wraparound

1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0

0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

0 1

1 0

Even though numbers have changed (bit flips), no change in checksum!

Source: Computer Networking: A Top-Down Approach (8th Ed) by Jim Kurose, Keith Ross

Indian Institute of Technology Kharagpur

50 of 66

Middleware Model

Indian Institute of Technology Kharagpur

51 of 66

Quick UDP Internet Connections (QUIC)

  • Experimental protocol -- deployed at google starting in 2014
    • Between Google services and Chrome
    • Improved page load latency, video rebuffer rate
    • ~35% of Google’s egress traffic (~7% of Internet traffic)
    • Akamai deployed in 2016

  • A transport service (middleware) on top of UDP

Indian Institute of Technology Kharagpur

52 of 66

How Does a Web Page Look Like?

Indian Institute of Technology Kharagpur

53 of 66

HTTP/1.0 -> HTTP/1.1

  • Connection setup
    • 1 round-trip to set up a TCP connection
    • 2 round-trips to set up a TLS 1.2 connection

  • After the connection setup, HTTP requests/responses flow over the connection
    • Persistent HTTP -- all the embedded objects are transferred over the TCP/TLS connection

Indian Institute of Technology Kharagpur

54 of 66

HTTP/1.1 - Head of Line (HoL) Blocking

  • TCP congestion control blocks the sender side when there is a segment loss
    • The receiver keeps on sending the duplicate acknowledgements
    • The sender-side window is blocked until the lost packet is recovered, and a fresh ACK is received

Indian Institute of Technology Kharagpur

55 of 66

HTTP/1.1: HOL despite Pipelining

HTTP 1.1: client requests 1 large object (e.g., video file) and 3 smaller objects

client

server

GET O1

GET O2

GET O3

GET O4

O1

O2

O3

O4

object data requested

O1

O2

O3

O4

objects delivered in order requested: O2, O3, O4 wait behind O1

Source: Computer Networking: A Top-Down Approach (8th Ed) by Jim Kurose, Keith Ross

Indian Institute of Technology Kharagpur

56 of 66

HTTP/1.1 Parallel Connections

  • Use parallel connections for each HTTP request-response pair
  • Too much overhead!

Indian Institute of Technology Kharagpur

57 of 66

HTTP/1.1 -> HTTP/2

  • Use parallel streams -- multiplex the HTTP request-response streams over a single TCP connection (SPDY by Google -- the precursor of HTTP/2)
    • Use congestion control for each streams
    • How this is different from HTTP/1.1 (parallel connections)?

Indian Institute of Technology Kharagpur

58 of 66

HTTP/2: Mitigating HOL Blocking

HTTP/2: Objects divided into frames, frame transmission interleaved

client

server

GET O1

GET O2

GET O3

GET O4

O2

O4

object data requested

O1

O2

O3

O4

O2, O3, O4 delivered quickly, O1 slightly delayed

O3

O1

Source: Computer Networking: A Top-Down Approach (8th Ed) by Jim Kurose, Keith Ross

Indian Institute of Technology Kharagpur

59 of 66

Problems with TCP

  • Implementation Entrenchment
    • TCP is implemented in OS kernel -- needs kernel modifications for every updates -- less control for the application
    • Application-specific tuning is difficult -- need to push changes in the TCP stack -- requires OS upgrade

  • Handshake Delay
    • 3 round-trips are required for establishing a TCP/TLS connection
    • Most transfers on the Internet are short transfers -- 3 RTT handshaking is an overhead

Indian Institute of Technology Kharagpur

60 of 66

QUIC Streams

Indian Institute of Technology Kharagpur

61 of 66

QUIC Protocol Stack

Indian Institute of Technology Kharagpur

62 of 66

QUIC Handshaking -- Connection Establishment

  • QUIC client caches information about the origin
    • On subsequent connections to the same origin, the client can establish an encrypted connection with no additional round trips
    • 0-RTT connection to a known server

  • Connection establishment to an unknown server
    • 1-RTT if crypto keys are not new -- QUIC embeds the key exchange protocol within the transport protocol itself -- no separate handshaking like TCP and TLS
    • 2-RTTs if QUIC version negotiation needed

Indian Institute of Technology Kharagpur

63 of 66

QUIC Connection Establishment

  • The REJ message contains
    • a server config that contains server’s long-term Diffie-Hellman public value
    • a certificate chain authenticating the server
    • a signature of the server config
    • a source-address token: An authenticated encryption block that contains client’s IP address and a time-stamp by the server

Indian Institute of Technology Kharagpur

64 of 66

QUIC Performance

Indian Institute of Technology Kharagpur

65 of 66

QUIC Standardization

  • RFC 9000: QUIC: A UDP-Based Multiplexed and Secure Transport
    • May 2021
  • RFC 9369: QUIC Version 2
    • May 2023
  • Various other RFCs exist
    • RFC 9114: HTTP/3
    • RFC 9308: Applicability of the QUIC Transport Protocol

Indian Institute of Technology Kharagpur

66 of 66

This is a broad discussion of the Transport Layer … Next we’ll move to the discussion of Network layer and the IP Protocol

Indian Institute of Technology Kharagpur