1 of 23

Network Troubleshooting:�Techniques and Approaches

Eli Dart, Network Engineer

ESnet Science Engagement

Lawrence Berkeley National Laboratory

CaRCC Systems-Facing Track meeting

Virtual (Pandemic)

October 21, 2021

2 of 23

Outline

Context and Framing

Science Networks

Structure and relationship to the rest of the Internet
Commodity vs. R&E
The importance of topology

Common problems and how to spot them

Framing a troubleshooting task

Note: command syntax is out of scope for this talk.

2

10/20/21

3 of 23

The Internet

The Internet is composed of a large number of individual networks

Each is run by some entity for its own reasons

Google
US Department of Defense
Ford Motor Company
US Department of Energy
AT&T

Each network connects to others for its own reasons

In general, networks are more valuable when connected to each other

But remember – this connectivity happens for selfish reasons
Not all networks are the same – each exists for its own reasons

3

10/20/21

4 of 23

Selected networks and their missions

4

10/20/21

5 of 23

Notes about different networks

The previous diagram is a drastic simplification

https://www.caida.org/projects/cartography/as-core/2020/

Key points:

All networks exist for a specific reason

Some networks provide connectivity between networks
Some networks primarily serve their own users
Some networks provide services to users who access them via different networks (e.g. Google)

These lines are blurry, but it’s a useful way to think about it

Network mission influences engineering, policy, reliability, etc.

Not all networks are built the same way
Not all networks can support all use models
Science networks have a different traffic profile than commercial networks

5

10/20/21

6 of 23

Topology Matters

I always start with traceroute

No hard and fast rule about this, but I always seem to start at traceroute
Understand who/what is involved
Remember to check both directions (perfSONAR is valuable here)

Is this going over R&E or commodity?

Many shops are losing their BGP expertise

https://learn.nsrc.org/bgp/internet_routing

R&E paths often have more bandwidth and a longer AS path

Without policy action (e.g. localpref), commodity is often preferred
Commodity often comes with PTMUD breakage, packet loss, etc.

Invisible exchange points (Starlight, MANLAN, PacWave, Equinix, etc.)

Typically layer2 only, but it’s Someone Else in the path
When one network connects to another, it’s often at an exchange
Exchange points aren’t usually the problem (except when they are)

6

10/20/21

7 of 23

Common Problems

Poor tuning 🡪 poor performance

This has become less of a problem because of fasterdata
Still happens sometimes
Easy to spot for TCP – performance on a loss-free path varies linearly with latency

Routing problems (e.g. commodity vs. R&E)

Rate limiting, packet loss on commodity
Data transfer considered denial of service by commodity ISPs and dropped

Tool choice

Easy to fix in some environments (use Globus, Aspera, …)
Impossible in other environments (e.g. SSH or nothing, in-browser HTTP or nothing)

No Science DMZ

Prevented by funding
Prevented by policy
Prevented by process

Packet loss 🡪 poor performance

Note that the *reason* for the packet loss varies widely

7

10/20/21

8 of 23

Packet Loss

8

Metro Area

Local

(LAN)

Regional

Continental

International

Measured (TCP Reno)

Measured (HTCP)

Theoretical (TCP Reno)

Measured (no loss)

.

See Eli Dart, Lauren Rotman, Brian Tierney, Mary Hester, and Jason Zurawski. The Science DMZ: A Network Design Pattern for Data-Intensive Science. In Proceedings of the IEEE/ACM Annual SuperComputing Conference (SC13), Denver CO, 2013.

9 of 23

Common Causes of Packet Loss

Fan-in or speed reduction on cheap TOR or cut-through switches

Typically at or close to the end host (“WAN problem” isn’t actually a WAN problem)
Multiple ingress flows destined for a common egress interfaces
Speed change: 100G in, 10G out 🡪 packet loss from 100G DTNs
NANOG paper (Michael Smitasin, Brian Tierney): https://www.es.net/assets/pubs_presos/NANOG64mnsmitasinbltierneybuffersize.pptx.pdf

Dirty fiber, failing optics – just have to find these (more on that in a bit)
End host config (e.g. insufficient receiver resources – NIC, kernel, etc.)

Be careful to distinguish between the switch port and the host

9

10/20/21

10 of 23

Approaching WAN Performance Problems

Again – start with topology

What is talking to what, who is talking to whom
Each direction must be characterized separately (path may be different!)

Often DTNs aren’t great troubleshooting devices

Especially sealed DTNs (e.g. for data portals)
Find perfSONAR hosts near them
http://stats.es.net/ServicesDirectory/ (go to stats.es.net – easy to type)

If the problem isn’t trivial, a pedantically-accurate map is incredibly valuable

Every connection (sometimes people will find lost routers!)
Every organization in the path
Compare perfSONAR vs. DTN results – if perfSONAR works and the DTNs don’t, then the problem is probably near the DTN

10

10/20/21

11 of 23

Real-World Example – Using perfSONAR

Methodology is important
Segment-to-segment testing is unlikely to be helpful

TCP dynamics will be different, and in this case all the pieces do not equal the whole

E.g. high throughput on a 1ms path with high packet loss vs. the same segment in a longer 20ms path

Problem links can test clean over short distances
An exception to this is hops that go thru a firewall

Run long-distance tests

Run the longest clean test you can, then look for the shortest dirty test that includes the path of the clean test

In order for this to work, the testers need to be already deployed when you start troubleshooting

ESnet has at least one perfSONAR host at each hub location.

Many (most?) R&E providers in the world have deployed at least 1

Deployment of test and measurement infrastructure dramatically reduces time to resolution

Otherwise the problem resolution is burdened by a deployment effort

11 – ESnet Science Engagement (engage@es.net) - 10/20/21

12 of 23

Wide Area Testing – User Problem Statement

12 – ESnet Science Engagement (engage@es.net) - 10/20/21

13 of 23

Wide Area Testing – Full Context

13 – ESnet Science Engagement (engage@es.net) - 10/20/21

14 of 23

Wide Area Testing – Long Clean Test

14 – ESnet Science Engagement (engage@es.net) - 10/20/21

15 of 23

Wide Area Testing – Dirty Tests

15 – ESnet Science Engagement (engage@es.net) - 10/20/21

16 of 23

Wide Area Testing – Problem Localization

16 – ESnet Science Engagement (engage@es.net) - 10/20/21

Slow tests indicate likely problem area

17 of 23

Example: Chile to California via Miami

Test from Atacama Desert (for telescope data) to NERSC at LBL
This is the diagram after things were working
Temp 10GE bypassed switch problems in Santiago
Able to show performance baseline from mountain to NERSC

17

10/20/21

18 of 23

Example: KSTAR (Fusion) to DOE HPC – 2013

Test from Korean superconducting tokamak to DOE HPC at NERSC and ORNL
Workflow: transfer experiment data, analyze on HPC for experiment guidance
KSTAR security was very tight, challenge for perfSONAR
KREONET is now 100G across Pacific
This work has resulted in multiple papers in fusion journals – key looking forward to ITER

18

10/20/21

19 of 23

Example: LHC Data – Europe to Pakistan

Transfer of LHC data from multiple European WLCG sites to Pakistan site
Multiple challenges – packet loss, congestion, packet filters
Provisioning issue: 1GE for everything between PERN and TEIN
Latency issue: commercial carriers not constrained by R&E politics
Key points: congestion causes packet loss, politics affect routing, R&E paths are only better if they are resourced properly

19

10/20/21

20 of 23

Example: Petascale DTN

20

10/20/21

Data transfer between HPC DTN clusters
Very little WAN troubleshooting
Lots of DTN and filesystem work
Site network upgrades at multiple facilities contributed improvements
DTN cluster upgrades contributed improvements
DTN design issues were key

21 of 23

Managing Sociology

Performance troubleshooting is a socio-technical effort

Very important that people are willing to work together
Important to avoid shaming or insulting people

This seems obvious, but it’s sometimes hard because of cultural differences
I have made this mistake multiple times

Culture of collaboration is a strength of the R&E community

Making maps for troubleshooting can be challenging

Most people won’t give you their maps, because they are for internal consumption. You will have to draw the diagram based on input.
Often it’s easiest to draw something that’s wrong and send it out. People will correct you when they otherwise won’t contribute.
Don’t make them draw it in Visio or whatever – have them print it, sketch the corrections, and send you a picture with their phone. Then you incorporate the changes.
Learn to draw maps from traceroute output – sometimes they will send you a traceroute but won’t send you a map.

21

10/20/21

22 of 23

Community Resources

Most people in R&E networking are here for a reason

We believe in the research/science mission of our institutions
We want humanity to progress
We understand that cyberinfrastructure contributes to humanity

People want their stuff to work and be useful

Most people will work with you if approached correctly

“Hi there! I’m seeing something odd – can you please take a look at X?”

There are also community resources to help manage multiparty efforts

DOE space: ESnet

engage@es.net – science engagement group
trouble@es.net – opens a ticket with our NOC

NSF/University space: EPOC (handoff to Brenna Meade)

22

10/20/21

23 of 23

Thanks!

Eli Dart

Energy Sciences Network (ESnet)

Lawrence Berkeley National Laboratory

http://my.es.net/

http://www.es.net/

http://fasterdata.es.net/