2G Network Issues
observed while in India
Bryan McQuade
NOTE: Many slides include speaker notes with additional information.
Capture overview
Captured ~100 client-side traces while on Airtel 2G and 3G in India�• 100MB of tcpdump data�• 250K packets
Captured from various locations: hotels in Pune and Delhi, vehicles in Pune
Captured page loads for popular pages in India: zeenews.india.com, Amazon, OLX, IRCTC, ESPNCricinfo, google.com
Traces include video, tcpdump, netlog, and other data
High level observations
Generally good connectivity on 3G� • Minimal network errors� • ~70ms latency� • 6Mbps down, 2Mbps up
Consistent but slow connectivity on 2G� • Elevated network error rates� • 200ms to multi-second latency� • 100kbps up/down
The remainder of this presentation focuses on 2G.
???
Round trip times
When not otherwise using the network, RTTs typically ~200ms.
When making use of the network (such as loading web pages), round trip times increase to multiple seconds.
We infer that:�• ~200ms is the true network latency of the 2G connection�• Multi-second RTTs are caused by delays due to queue bloat in the cell tower
Observations:�• Frequent retransmission of SYNs at 1s, 3s.�• Actual BDP is 0.2s * 100kbps = 20kb = 2kB = <2 packets.
SYN retransmit at 1s�successful SYN/ACK after 1.4 seconds�dropped packet after 2.336422s (previous segment not captured)�receipt of retransmitted dropped packet 3.8 seconds later�additional delays during TLS handshake�due to drops and delays, TLS handshake never successfully completes
Packet loss, reordering, retransmits
Periodic loss of UDP (e.g. DNS) and TCP (both SYN and data) packets.
For TCP data packets, drops of multiple successive packets in a stream not uncommon.
Latency introduced by queue bloat, combined with drops in carrier networks, leads to retransmissions, out of order delivery, and many-second delays in delivering data to the application layer.
tcptrace for espncricinfo.com JS
32 second delay
Example to explain how buffering can lead to persistent queuing, high latency, and packet drops.
200ms latency
low bandwidth
packet buffer
low latency
high bandwidth
Client sends initial HTTP request.
HTTP GET
packet buffer
Packet buffer in cell tower is empty.�First response bytes are received by the client in 200ms.
packet buffer
response
icwnd=10
Data arrives from the server faster than it can be consumed by the client.
packet buffer
response
icwnd=10
Packet buffer in the cell tower absorbs data burst from server.
packet buffer
response
icwnd=10
Client sends second HTTP request.
HTTP GET
packet buffer
Second HTTP response is buffered behind existing packets.�Delivery of second response delayed in buffer. Multi-second latency.�
B
A
packet buffer
response
icwnd=10
Additional packet arrives, but packet buffer is full. Packet dropped at cell tower.
B
A
packet buffer
response
icwnd=10
C
Unaware of drops in the cell tower, server continues to send packets.�Buffer continues to fill. End-to-end latency remains high.
E
D
B
A
packet buffer
response
icwnd=10
G
F
E
D
B
A
packet buffer
response
icwnd=10
Multiple seconds later, client receives packets A, B, D.
J
I
H
G
F
E
packet buffer
Client notices dropped packet C.�Client sends duplicate ACK to notify server of missing packet.
DUP ACK (B)
J
I
H
G
F
E
packet buffer
Server retransmits dropped packet C.�If the buffer is full, the packet may be dropped again!
C
J
I
H
G
F
packet buffer
Observations and questions
Packet buffers in cell towers stay persistently full. Leads to:�• significantly higher latency: can be 10 seconds rather than 200ms�• packet drops, retransmits, and significant delays in delivery to app layer�• no improvements in throughput!
What can Chrome do to avoid bloating buffers in the cell network?�• reduce advertised receiver window size?�• reduce number of concurrent requests in flight?
Chrome Data Saver tcptrace
Chrome Data Saver tcptrace
33 second delay
Bandwidth
Median bandwidth on 2G is about 100kbps.�With chrome data saver enabled, bandwidth is higher, 150-200kbps.
Throughput highly variable during the course of a page load, despite being heavily bandwidth-bound.
• Are we able to achieve higher throughput by using fewer network connections?�• Are we failing to fully utilize network capacity, introducing further delays?
More investigation needed.
Throughput is highly variable during the lifetime of a page load.
Throughput peaks at 160kbps, drops, partially recovers, drops, recovers.
No connectivity issues. Network is active during idle periods.
What’s going on here? Traffic shaping? Move from dedicated to shared channel? TCP issues?
DNS, SYN, and ACK latencies increase significantly during periods of low throughput.
Delays consistently upwards of 10 seconds.
At peak, we encounter a 62 second delay between the time a client sends a packet, and the time it is acknowledged by the server.
TCP delay percentiles for heavy pages (seconds)
Percentile | TCP ACK delay (server-side) | TCP ACK delay (client-side) | DNS lookup delay | TCP SYN�delay |
50 | 1.3 | 0.0 | 1.5 | 1.1 |
90 | 4.9 | 0.3 | 6.5 | 3.6 |
95 | 6.4 | 1.8 | 20.0 | 5.5 |
97 | 8.0 | 3.9 | 28.8 | 7.7 |
99 | 14.2 | 11.5 | 55.7 | 12.3 |
99.5 | 16.5 | 19.3 | 57.5 | 14.4 |
99.9 | 52.1 | 46.6 | 57.7 | 72.9 |
99.99 | 126.7 | 73.4 | 57.7 | 85.1 |
delays / page | hundreds | hundreds+ | tens | tens |
TCP delay percentiles for google.com (seconds)
Percentile | TCP ACK delay (server-side) | TCP ACK delay (client-side) | DNS lookup delay | TCP SYN�delay |
50 | 1.0 | 0.0 | 1.5 | 0.7 |
90 | 3.0 | 0.0 | 17.3 | 3.0 |
95 | 4.5 | 0.9 | 20.5 | 6.1 |
97 | 5.8 | 1.7 | 23.8 | 11.5 |
99 | 15.5 | 4.1 | 28.0 | 19.4 |
99.5 | 20.1 | 6.1 | 28.4 | 19.8 |
99.9 | 24.6 | 10.3 | 28.9 | 23.2 |
99.99 | 26.8 | 25.1 | 29.5 | 23.6 |
Error rates
As network round trip times increase, the likelihood of page load errors also increases.
When the phone was making heavy use of the network (such as downloading large files in the background), page load error rates were significantly higher.
Consistent with earlier example hypothesizing effects of queue bloat and buffer overflows in the cell tower.
Next steps
How do we improve throughput and reduce latency, drops, and retransmissions?
Additional trace analysis needed:�• Why is throughput so variable?�• How can we better utilize the network?�• How can we reduce queue bloat/latency/drops?�• Need insight into radio state (QXDM)
Additional experimentation needed:�• can we achieve consistent high throughput by pacing packets at line rate?�• can we reduce buffering in the cell tower and e2e latency by reducing window size / concurrency?
Talk to Airtel and other providers to better understand their network configurations
extra slides
tcptrace for large file download
tcptrace for large file download
Drilling down on high ACK latency