1 of 44

2G Network Issues

observed while in India

Bryan McQuade

NOTE: Many slides include speaker notes with additional information.

2 of 44

Capture overview

Captured ~100 client-side traces while on Airtel 2G and 3G in India�• 100MB of tcpdump data�• 250K packets

Captured from various locations: hotels in Pune and Delhi, vehicles in Pune

Captured page loads for popular pages in India: zeenews.india.com, Amazon, OLX, IRCTC, ESPNCricinfo, google.com

Traces include video, tcpdump, netlog, and other data

3 of 44

High level observations

Generally good connectivity on 3G� • Minimal network errors� • ~70ms latency� • 6Mbps down, 2Mbps up

Consistent but slow connectivity on 2G� • Elevated network error rates� • 200ms to multi-second latency� • 100kbps up/down

The remainder of this presentation focuses on 2G.

4 of 44

5 of 44

???

6 of 44

Round trip times

When not otherwise using the network, RTTs typically ~200ms.

When making use of the network (such as loading web pages), round trip times increase to multiple seconds.

We infer that:�• ~200ms is the true network latency of the 2G connection�• Multi-second RTTs are caused by delays due to queue bloat in the cell tower

Observations:�• Frequent retransmission of SYNs at 1s, 3s.�• Actual BDP is 0.2s * 100kbps = 20kb = 2kB = <2 packets.

7 of 44

8 of 44

SYN retransmit at 1s�successful SYN/ACK after 1.4 seconds�dropped packet after 2.336422s (previous segment not captured)�receipt of retransmitted dropped packet 3.8 seconds later�additional delays during TLS handshake�due to drops and delays, TLS handshake never successfully completes

9 of 44

Packet loss, reordering, retransmits

Periodic loss of UDP (e.g. DNS) and TCP (both SYN and data) packets.

For TCP data packets, drops of multiple successive packets in a stream not uncommon.

Latency introduced by queue bloat, combined with drops in carrier networks, leads to retransmissions, out of order delivery, and many-second delays in delivering data to the application layer.

10 of 44

tcptrace for espncricinfo.com JS

32 second delay

11 of 44

12 of 44

Example to explain how buffering can lead to persistent queuing, high latency, and packet drops.

200ms latency

low bandwidth

packet buffer

low latency

high bandwidth

13 of 44

Client sends initial HTTP request.

HTTP GET

packet buffer

14 of 44

Packet buffer in cell tower is empty.�First response bytes are received by the client in 200ms.

packet buffer

response

icwnd=10

15 of 44

Data arrives from the server faster than it can be consumed by the client.

packet buffer

response

icwnd=10

16 of 44

Packet buffer in the cell tower absorbs data burst from server.

packet buffer

response

icwnd=10

17 of 44

Client sends second HTTP request.

HTTP GET

packet buffer

18 of 44

Second HTTP response is buffered behind existing packets.�Delivery of second response delayed in buffer. Multi-second latency.�

B

A

packet buffer

response

icwnd=10

19 of 44

Additional packet arrives, but packet buffer is full. Packet dropped at cell tower.

B

A

packet buffer

response

icwnd=10

C

20 of 44

Unaware of drops in the cell tower, server continues to send packets.�Buffer continues to fill. End-to-end latency remains high.

E

D

B

A

packet buffer

response

icwnd=10

21 of 44

G

F

E

D

B

A

packet buffer

response

icwnd=10

22 of 44

Multiple seconds later, client receives packets A, B, D.

J

I

H

G

F

E

packet buffer

23 of 44

Client notices dropped packet C.�Client sends duplicate ACK to notify server of missing packet.

DUP ACK (B)

J

I

H

G

F

E

packet buffer

24 of 44

Server retransmits dropped packet C.�If the buffer is full, the packet may be dropped again!

C

J

I

H

G

F

packet buffer

25 of 44

Observations and questions

Packet buffers in cell towers stay persistently full. Leads to:�• significantly higher latency: can be 10 seconds rather than 200ms�• packet drops, retransmits, and significant delays in delivery to app layer�• no improvements in throughput!

What can Chrome do to avoid bloating buffers in the cell network?�• reduce advertised receiver window size?�• reduce number of concurrent requests in flight?

26 of 44

Chrome Data Saver tcptrace

27 of 44

Chrome Data Saver tcptrace

33 second delay

28 of 44

29 of 44

Bandwidth

Median bandwidth on 2G is about 100kbps.�With chrome data saver enabled, bandwidth is higher, 150-200kbps.

Throughput highly variable during the course of a page load, despite being heavily bandwidth-bound.

• Are we able to achieve higher throughput by using fewer network connections?�• Are we failing to fully utilize network capacity, introducing further delays?

More investigation needed.

30 of 44

Throughput is highly variable during the lifetime of a page load.

Throughput peaks at 160kbps, drops, partially recovers, drops, recovers.

No connectivity issues. Network is active during idle periods.

What’s going on here? Traffic shaping? Move from dedicated to shared channel? TCP issues?

31 of 44

DNS, SYN, and ACK latencies increase significantly during periods of low throughput.

Delays consistently upwards of 10 seconds.

At peak, we encounter a 62 second delay between the time a client sends a packet, and the time it is acknowledged by the server.

32 of 44

33 of 44

34 of 44

35 of 44

36 of 44

37 of 44

TCP delay percentiles for heavy pages (seconds)

Percentile

TCP ACK delay (server-side)

TCP ACK delay (client-side)

DNS lookup delay

TCP SYN�delay

50

1.3

0.0

1.5

1.1

90

4.9

0.3

6.5

3.6

95

6.4

1.8

20.0

5.5

97

8.0

3.9

28.8

7.7

99

14.2

11.5

55.7

12.3

99.5

16.5

19.3

57.5

14.4

99.9

52.1

46.6

57.7

72.9

99.99

126.7

73.4

57.7

85.1

delays / page

hundreds

hundreds+

tens

tens

38 of 44

TCP delay percentiles for google.com (seconds)

Percentile

TCP ACK delay (server-side)

TCP ACK delay (client-side)

DNS lookup delay

TCP SYN�delay

50

1.0

0.0

1.5

0.7

90

3.0

0.0

17.3

3.0

95

4.5

0.9

20.5

6.1

97

5.8

1.7

23.8

11.5

99

15.5

4.1

28.0

19.4

99.5

20.1

6.1

28.4

19.8

99.9

24.6

10.3

28.9

23.2

99.99

26.8

25.1

29.5

23.6

39 of 44

Error rates

As network round trip times increase, the likelihood of page load errors also increases.

When the phone was making heavy use of the network (such as downloading large files in the background), page load error rates were significantly higher.

Consistent with earlier example hypothesizing effects of queue bloat and buffer overflows in the cell tower.

40 of 44

Next steps

How do we improve throughput and reduce latency, drops, and retransmissions?

Additional trace analysis needed:�• Why is throughput so variable?�• How can we better utilize the network?�• How can we reduce queue bloat/latency/drops?�• Need insight into radio state (QXDM)

Additional experimentation needed:�• can we achieve consistent high throughput by pacing packets at line rate?�• can we reduce buffering in the cell tower and e2e latency by reducing window size / concurrency?

Talk to Airtel and other providers to better understand their network configurations

41 of 44

extra slides

42 of 44

tcptrace for large file download

43 of 44

tcptrace for large file download

44 of 44

Drilling down on high ACK latency