1 of 23

Network Troubleshooting:�Techniques and Approaches

Eli Dart, Network Engineer

ESnet Science Engagement

Lawrence Berkeley National Laboratory

CaRCC Systems-Facing Track meeting

Virtual (Pandemic)

October 21, 2021

2 of 23

Outline

  • Context and Framing

  • Science Networks
    • Structure and relationship to the rest of the Internet
    • Commodity vs. R&E
    • The importance of topology

  • Common problems and how to spot them

  • Framing a troubleshooting task

  • Note: command syntax is out of scope for this talk.

2

10/20/21

3 of 23

The Internet

  • The Internet is composed of a large number of individual networks
    • Each is run by some entity for its own reasons
      • Google
      • US Department of Defense
      • Ford Motor Company
      • US Department of Energy
      • AT&T
    • Each network connects to others for its own reasons
  • In general, networks are more valuable when connected to each other
    • But remember – this connectivity happens for selfish reasons
    • Not all networks are the same – each exists for its own reasons

3

10/20/21

4 of 23

Selected networks and their missions

4

10/20/21

5 of 23

Notes about different networks

  • The previous diagram is a drastic simplification
  • Key points:
    • All networks exist for a specific reason
      • Some networks provide connectivity between networks
      • Some networks primarily serve their own users
      • Some networks provide services to users who access them via different networks (e.g. Google)
    • These lines are blurry, but it’s a useful way to think about it
  • Network mission influences engineering, policy, reliability, etc.
    • Not all networks are built the same way
    • Not all networks can support all use models
    • Science networks have a different traffic profile than commercial networks

5

10/20/21

6 of 23

Topology Matters

  • I always start with traceroute
    • No hard and fast rule about this, but I always seem to start at traceroute
    • Understand who/what is involved
    • Remember to check both directions (perfSONAR is valuable here)
  • Is this going over R&E or commodity?
    • Many shops are losing their BGP expertise
    • R&E paths often have more bandwidth and a longer AS path
      • Without policy action (e.g. localpref), commodity is often preferred
      • Commodity often comes with PTMUD breakage, packet loss, etc.
  • Invisible exchange points (Starlight, MANLAN, PacWave, Equinix, etc.)
    • Typically layer2 only, but it’s Someone Else in the path
    • When one network connects to another, it’s often at an exchange
    • Exchange points aren’t usually the problem (except when they are)

6

10/20/21

7 of 23

Common Problems

  • Poor tuning 🡪 poor performance
    • This has become less of a problem because of fasterdata
    • Still happens sometimes
    • Easy to spot for TCP – performance on a loss-free path varies linearly with latency
  • Routing problems (e.g. commodity vs. R&E)
    • Rate limiting, packet loss on commodity
    • Data transfer considered denial of service by commodity ISPs and dropped
  • Tool choice
    • Easy to fix in some environments (use Globus, Aspera, …)
    • Impossible in other environments (e.g. SSH or nothing, in-browser HTTP or nothing)
  • No Science DMZ
    • Prevented by funding
    • Prevented by policy
    • Prevented by process
  • Packet loss 🡪 poor performance
    • Note that the *reason* for the packet loss varies widely

7

10/20/21

8 of 23

Packet Loss

8

Metro Area

Local

(LAN)

Regional

Continental

International

Measured (TCP Reno)

Measured (HTCP)

Theoretical (TCP Reno)

Measured (no loss)

.

See Eli Dart, Lauren Rotman, Brian Tierney, Mary Hester, and Jason Zurawski. The Science DMZ: A Network Design Pattern for Data-Intensive Science. In Proceedings of the IEEE/ACM Annual SuperComputing Conference (SC13), Denver CO, 2013.

9 of 23

Common Causes of Packet Loss

  • Fan-in or speed reduction on cheap TOR or cut-through switches
    • Typically at or close to the end host (“WAN problem” isn’t actually a WAN problem)
    • Multiple ingress flows destined for a common egress interfaces
    • Speed change: 100G in, 10G out 🡪 packet loss from 100G DTNs
    • NANOG paper (Michael Smitasin, Brian Tierney): https://www.es.net/assets/pubs_presos/NANOG64mnsmitasinbltierneybuffersize.pptx.pdf
  • Dirty fiber, failing optics – just have to find these (more on that in a bit)
  • End host config (e.g. insufficient receiver resources – NIC, kernel, etc.)
    • Be careful to distinguish between the switch port and the host

9

10/20/21

10 of 23

Approaching WAN Performance Problems

  • Again – start with topology
    • What is talking to what, who is talking to whom
    • Each direction must be characterized separately (path may be different!)
  • Often DTNs aren’t great troubleshooting devices
    • Especially sealed DTNs (e.g. for data portals)
    • Find perfSONAR hosts near them
    • http://stats.es.net/ServicesDirectory/ (go to stats.es.net – easy to type)
  • If the problem isn’t trivial, a pedantically-accurate map is incredibly valuable
    • Every connection (sometimes people will find lost routers!)
    • Every organization in the path
    • Compare perfSONAR vs. DTN results – if perfSONAR works and the DTNs don’t, then the problem is probably near the DTN

10

10/20/21

11 of 23

Real-World Example – Using perfSONAR

  • Methodology is important
  • Segment-to-segment testing is unlikely to be helpful
    • TCP dynamics will be different, and in this case all the pieces do not equal the whole
      • E.g. high throughput on a 1ms path with high packet loss vs. the same segment in a longer 20ms path
    • Problem links can test clean over short distances
    • An exception to this is hops that go thru a firewall
  • Run long-distance tests
    • Run the longest clean test you can, then look for the shortest dirty test that includes the path of the clean test
  • In order for this to work, the testers need to be already deployed when you start troubleshooting
    • ESnet has at least one perfSONAR host at each hub location.
      • Many (most?) R&E providers in the world have deployed at least 1
    • Deployment of test and measurement infrastructure dramatically reduces time to resolution
      • Otherwise the problem resolution is burdened by a deployment effort

11 – ESnet Science Engagement (engage@es.net) - 10/20/21

12 of 23

Wide Area Testing – User Problem Statement

12 – ESnet Science Engagement (engage@es.net) - 10/20/21

13 of 23

Wide Area Testing – Full Context

13 – ESnet Science Engagement (engage@es.net) - 10/20/21

14 of 23

Wide Area Testing – Long Clean Test

14 – ESnet Science Engagement (engage@es.net) - 10/20/21

15 of 23

Wide Area Testing – Dirty Tests

15 – ESnet Science Engagement (engage@es.net) - 10/20/21

16 of 23

Wide Area Testing – Problem Localization

16 – ESnet Science Engagement (engage@es.net) - 10/20/21

Slow tests indicate likely problem area

17 of 23

Example: Chile to California via Miami

  • Test from Atacama Desert (for telescope data) to NERSC at LBL
  • This is the diagram after things were working
  • Temp 10GE bypassed switch problems in Santiago
  • Able to show performance baseline from mountain to NERSC

17

10/20/21

18 of 23

Example: KSTAR (Fusion) to DOE HPC – 2013

  • Test from Korean superconducting tokamak to DOE HPC at NERSC and ORNL
  • Workflow: transfer experiment data, analyze on HPC for experiment guidance
  • KSTAR security was very tight, challenge for perfSONAR
  • KREONET is now 100G across Pacific
  • This work has resulted in multiple papers in fusion journals – key looking forward to ITER

18

10/20/21

19 of 23

Example: LHC Data – Europe to Pakistan

  • Transfer of LHC data from multiple European WLCG sites to Pakistan site
  • Multiple challenges – packet loss, congestion, packet filters
  • Provisioning issue: 1GE for everything between PERN and TEIN
  • Latency issue: commercial carriers not constrained by R&E politics
  • Key points: congestion causes packet loss, politics affect routing, R&E paths are only better if they are resourced properly

19

10/20/21

20 of 23

Example: Petascale DTN

20

10/20/21

  • Data transfer between HPC DTN clusters
  • Very little WAN troubleshooting
  • Lots of DTN and filesystem work
  • Site network upgrades at multiple facilities contributed improvements
  • DTN cluster upgrades contributed improvements
  • DTN design issues were key

21 of 23

Managing Sociology

  • Performance troubleshooting is a socio-technical effort
    • Very important that people are willing to work together
    • Important to avoid shaming or insulting people
      • This seems obvious, but it’s sometimes hard because of cultural differences
      • I have made this mistake multiple times
    • Culture of collaboration is a strength of the R&E community
  • Making maps for troubleshooting can be challenging
    • Most people won’t give you their maps, because they are for internal consumption. You will have to draw the diagram based on input.
    • Often it’s easiest to draw something that’s wrong and send it out. People will correct you when they otherwise won’t contribute.
    • Don’t make them draw it in Visio or whatever – have them print it, sketch the corrections, and send you a picture with their phone. Then you incorporate the changes.
    • Learn to draw maps from traceroute output – sometimes they will send you a traceroute but won’t send you a map.

21

10/20/21

22 of 23

Community Resources

  • Most people in R&E networking are here for a reason
    • We believe in the research/science mission of our institutions
    • We want humanity to progress
    • We understand that cyberinfrastructure contributes to humanity
  • People want their stuff to work and be useful
    • Most people will work with you if approached correctly
      • “Hi there! I’m seeing something odd – can you please take a look at X?”
  • There are also community resources to help manage multiparty efforts
    • DOE space: ESnet
      • engage@es.net – science engagement group
      • trouble@es.net – opens a ticket with our NOC
    • NSF/University space: EPOC (handoff to Brenna Meade)

22

10/20/21

23 of 23

Thanks!

Eli Dart

Energy Sciences Network (ESnet)

Lawrence Berkeley National Laboratory