1 of 45

perfSONAR Monitoring Update

Shawn McKee / U Michigan, Marian Babik / CERN

on behalf of WLCG Network Throughput WG

2024

At the #52 LHCOPN/LHCONE Meeting, Catania, Italy

https://indico.cern.ch/event/1349135/

2 of 45

Outline

  • Infrastructure Status and Updates
  • DC24 Review
  • Network Analytics

2

#52 LHCOPN/LHCONE Mtg

3 of 45

perfSONAR News

  • perfSONAR 5.0.8 is the latest release
    • Number of bug-fixes since 5.0
    • Weekly meetings with the developers
    • Update campaign in WLCG
      • Various issues, mostly archiving, but also e.g. legacy limits configuration (fix)
    • Toolkit support for latest CC7 and Alma/Rocky 8 and 9 compatible systems (Alma).
      • Sites should plan to update by June (end-of-life for CentOS7)
  • 5.1 Beta Release
    • New Grafana interface - replacing toolkit and maddash graphs
    • Threaded iperf3 support
    • Enhanced instrumentation, better troubleshooting of archiving issues
    • OS support: Alma/Rocky 9, Debian 11/12, Ubuntu 20/22 (updated docker),� No support for CentOS 7, Debian 10, Ubuntu 18

3

#52 LHCOPN/LHCONE Mtg

4 of 45

Network Measurement Platform Status

  • Our platform collects, stores, configures and transports all network metrics
  • Evolution based on the perfSONAR 5 already partially implemented.
    • Now directly publishing results from perfSONARs to ES@UC
    • Collector used only as a fallback;
    • WLCG CRIC now used for topology

4

Collector

Store (long-term)

Store (short-term)

pS Monitoring

pS Configuration

Tape

Experiments, Sites, NRENs

pS Dashboard

HTTParchiver

#52 LHCOPN/LHCONE Mtg

5 of 45

Network Measurement Platform Plans

  • Evolution based on the perfSONAR 5 already partially implemented.
    • Forwarding to UNL and backup to FNAL still to be implemented
    • pS Monitoring - update to latest Checkmk and enable SSO authentication
    • ps Dashboard - integrate with Analytics Platform/Grafana (retire maddash)
    • ps Configuration - clarify development roadmap and support

5

Collector

Store (long-term)

Network Analytics�Platform

pS Monitoring

pS Configuration

Tape

Experiments, Sites, NRENs

pS Dashboard

HTTParchiver

pS Dashboard

#52 LHCOPN/LHCONE Mtg

6 of 45

perfSONAR Infrastructure

  • Most of the infrastructure already on 5.0.8, but only small fraction on EL8/9
    • We still have some perfSONAR 4.0 !
  • 17% of the infrastructure now with 100Gbps (few on 200Gbps)
    • Core infrastructure still runs on 10Gbps, which is fine
  • Core deployments are still on 10Gbps, we have about 17% with 100Gbps
    • For WLCG/OSG testing purposes 10Gbps is still sufficient
    • Important to refresh HW along with the update to 5.1
  • Most of the infrastructure is on 5.0.8, but significant fraction still on CentOS 7
  • MTU - around 40% on jumbo frames (9000), rest is on standard frames (1500)
  • We have small testbed with about 10 perfSONARs with BBRv3 enabled
    • Enabled testing TCP congestion algorithm benefits and jumbo frame trade-offs
    • Open for participation

6

#52 LHCOPN/LHCONE Mtg

7 of 45

DC24

7

8 of 45

WLCG DC24

WLCG Data Challenge 2024 took place in Feb 2024; targeting 25% of HL-LHC

Our DC24 plans included the following:

  • Update and utilize perfSONAR to clean up links and fix problems before DC24.
  • Instrument and document site networks, for at least our largest sites.
  • Network planning: we need to make sure our sites and their local and regional networks are aware of our requirements and timeline and are planning appropriately
  • IPv6 should be enabled everywhere not just because of packet marking, but because it will allow us to get back to a single stack sooner!

8

#52 LHCOPN/LHCONE Mtg

9 of 45

psDash Network Status

Splits metrics into Infrastructure (pS issues) and Network related (e.g. BW drop)�Classifies metrics into critical, warning and ok and aggregates them into status

9

Network Status dashboard - part of Network Analytics platform - shows network performance based on perfSONAR measurements. Status (ok/warning/critical/unknown) aggregates network and infrastructure metrics.

#52 LHCOPN/LHCONE Mtg

10 of 45

Site Network Utilisation

Grafana dashboard showing network utilization inbound/outbound based on the data exposed by sites (snmp counters exposed via json API)

10

Site Network Utilisation - computed from aggregated utilisation (SNMP counters) provided by sites via simple API. Screenshot shows network utilisation during DC24 as seen by the sites.

#52 LHCOPN/LHCONE Mtg

11 of 45

Analytics

11

12 of 45

Alarms & Alerts Interface

12

Two main improvements needed: Acknowledging alerts that are being worked on and adding user notification mailing lists

https://psa.osg-htc.org

(Uses EDUGain/InCommon)

Purpose: provides user-subscribable alerting for specific types of network issues found by analyzing perfSONAR data

#52 LHCOPN/LHCONE Mtg

13 of 45

Subscription Interface

13

#52 LHCOPN/LHCONE Mtg

14 of 45

Alarm Types and Relation to perfSONAR Data

14

#52 LHCOPN/LHCONE Mtg

15 of 45

psDash Alarms Dashboard

15

#52 LHCOPN/LHCONE Mtg

16 of 45

Network Analytics R&D

  • Investigate ML models/methods to process network measurements�
  • Data-preprocessing, e.g.
    • Train neural networks to predict network paths, e.g. help us fill the gaps in traceroute(s)
  • Build model(s) that represents our network(s)
    • Network measurements are inherently noisy and therefore require robust models
  • Use ML models for anomaly detection (for alerts & alarms)
    • Neural networks (which ones ?), Bayesian/probabilistic approaches,
    • Detect anomalies in network paths and bandwidth measurements
    • Compare with the existing heuristic algorithms that we have developed
  • Correlate with other data
    • Traceroutes with throughput for example, but also outside of perfSONAR, e.g. FTS
    • New types of data appearing (high-touch, scitags, in-band telemetry, etc.)

16

#52 LHCOPN/LHCONE Mtg

17 of 45

Plans for the Analytics Platform

  • Production of the anomaly detection based on Bayesian inference
    • Uses RTT, traceroutes, TTLs as input and detects anomalies
  • Continue working on the neural network models that correlate throughputs and traceroutes
    • Generating real-world model of our �entire network (all routers)
    • Not only detecting anomalies, but �also trying to pinpoint the location �of the issue

  • Improve infrastructure alarming to the point where we can reliably differentiate infrastructure and network issues
  • Network availability dashboard in production

17

#52 LHCOPN/LHCONE Mtg

18 of 45

Summary

  • Updates to perfSONAR and OSG/WLCG network measurement platform
    • perfSONAR 5.1 is coming with new features and will require all sites to update OS.
    • Plan to adapt the network measurement platform to benefit from changes in 5.1
  • Ongoing efforts in network analytics and ML methods for our data
    • Focus on pre-processing (gaps, predictive models) and anomaly detection
    • Opportunity to collaborate on models and data sets
  • We are preparing monthly meetings with site network teams:
    • Discuss how sites are deploying, managing and planning for WLCG networking requirements
    • Next meeting April 18th 10am EST (to join mail wlcg-site-net-requests@umich.edu) �
  • We have to continue to watch our network monitoring infrastructure as it is a complex system with lots of areas for issues to develop.

Questions / Discussion?

18

#52 LHCOPN/LHCONE Mtg

19 of 45

Acknowledgements

We would like to thank the WLCG, HEPiX, perfSONAR and OSG organizations for their work on the topics presented.

In addition we want to explicitly acknowledge the support of the National Science Foundation which supported this work via:

  • OSG: NSF MPS-1148698
  • IRIS-HEP: NSF OAC-1836650

19

#52 LHCOPN/LHCONE Mtg

20 of 45

Useful URLs

20

#52 LHCOPN/LHCONE Mtg

21 of 45

Backup Slides Follow

21

22 of 45

Tools and Applications for Network Data

  • To organize access to all the various resources we recommend using our Toolkitinfo page: https://toolkitinfo.opensciencegrid.org/
  • Reminder: we already have Kibana dashboards looking at
  • For this meeting we want to update our recent work towards a user subscribable alerting and alarming service
    • User interface to subscribe is AAAS (ATLAS Alerting and Alarming Service)
    • Tool to explore alerts is pS-Dash (Plotly base perfSONAR dashboard UI tool)

22

#52 LHCOPN/LHCONE Mtg

23 of 45

Alarms & Alerts Service

https://psa.osg-htc.org

(Uses EDUGain/InCommon)

Purpose: provides user-subscribable alerting for specific types of network issues found by analyzing perfSONAR data

23

#52 LHCOPN/LHCONE Mtg

24 of 45

Alarms & Alerts Service

The Alerting and Alarming Tools Subscription Interface

24

#52 LHCOPN/LHCONE Mtg

25 of 45

Alarm Types and Relation to perfSONAR Data

25

#52 LHCOPN/LHCONE Mtg

26 of 45

pSDash (perfSONAR Dashboard)

26

#52 LHCOPN/LHCONE Mtg

27 of 45

27

#52 LHCOPN/LHCONE Mtg

28 of 45

28

#52 LHCOPN/LHCONE Mtg

29 of 45

WLCG perfSONAR Path Statistics

We uniquely identify each traceroute (route IP path) with a SHA1 hash.

29

Link=”hop” (IP-to-IP)

Node=”router”

Statistics on the left concern all the “paths” we are tracking with about 20K unique paths found

About 50% of src-dest pairs have 4 or less paths.

#52 LHCOPN/LHCONE Mtg

30 of 45

AS (Autonomous System) Path Changed

30

[7896, 7896, 7896, 7896, 57, 57, 57, 293, 293, 293, 293, 293, 293, 43]

[7896, 7896, 293, 293, 293, 293, 293, 293, 43]

[7896, 7896, 293, 293, 293, 293, 293, 293]

ASN sequence

[7896, 293, 43]

[7896, 293]

[7896, 57, 293, 43]

Reduced ASNs

[7896, 293, 43]

Baseline

57

[7896, 293, 43]

[7896, 293]

[7896, 57, 293, 43]

99.3%

0.3%

0.3%

Path used in

NOTE: Paths denoted by route IP are too noisy; instead use AS number

#52 LHCOPN/LHCONE Mtg

31 of 45

Example: LHCOPN/LHCONE Load Balancing

31

#52 LHCOPN/LHCONE Mtg

32 of 45

Example: LHCOPN Alternate via ESnet

32

#52 LHCOPN/LHCONE Mtg

33 of 45

Example: FNAL Incident (BW drop)

33

#52 LHCOPN/LHCONE Mtg

34 of 45

Example: Fail-over to Commodity Network (HURRICANE)

34

#52 LHCOPN/LHCONE Mtg

35 of 45

Challenges and Ongoing Work

35

IPv4

IPv6

Paths differ significantly

#52 LHCOPN/LHCONE Mtg

36 of 45

Correlating Tests with Paths: Two Timescales

36

#52 LHCOPN/LHCONE Mtg

37 of 45

Connecting Throughput to Traceroute

37

Our starting choice: Use both tracepaths (just before; just after) as valid paths and attribute BW to both.

Have to see if this is superior to just using the last measured route before the measurement…

#52 LHCOPN/LHCONE Mtg

38 of 45

Attaching Throughput Results to Sets of Routers/Links

38

Each colored box represents a specific router along the path

#52 LHCOPN/LHCONE Mtg

39 of 45

Example Throughput Attribution by Router

39

Each router on the path gets the closest (in time) throughput values

#52 LHCOPN/LHCONE Mtg

40 of 45

Checking Router Results vs Time

40

#52 LHCOPN/LHCONE Mtg

41 of 45

Initial Example Result: One Router; Throughput vs Time

41

Each point represents the throughput values collected when the node was on the path

Mb/s

Mbits/sec

#52 LHCOPN/LHCONE Mtg

42 of 45

Other Activities / Plans

Working to organize and annotate our data for ML/AI work (Petya Vasileva)

Working with the RNTWG (see previous RNTWG update talk) on identifying and monitoring network traffic details via the SciTags initiative.

Exploring other network monitoring activities in the perfSONAR space including ARGUS

Planning to augment WLCG-CRIC (yesterday’s discussion) network meta data (which paths/networks are LHCOPN / LHCONE / Research&Education / Commercial)

42

#52 LHCOPN/LHCONE Mtg

43 of 45

Distributions of Throughput

43

#52 LHCOPN/LHCONE Mtg

44 of 45

WLCG Network Throughput Support Unit

Support channel where sites and experiments can report potential network performance incidents:

  • Relevant sites, (N)RENs are notified and perfSONAR infrastructure is used to narrow down the problem to particular link(s) and segment. Also tracking past incidents.
  • Feedback to WLCG operations and LHCOPN/LHCONE community�

Most common issues: MTU, MTU+Load Balancing, routing (mainly remote sites), site equipment/design, firewall, workloads causing high network usage

�As there is no consensus on the MTU to be recommended on the segments connecting servers and clients, LHCOPN/LHCONE working group was established to investigate and produce a recommendation. (See coming talk :) )

44

#52 LHCOPN/LHCONE Mtg

45 of 45

Importance of Measuring Our Networks

  • End-to-end network issues are difficult to spot and localize
    • Network problems are multi-domain, complicating the process
    • Performance issues involving the network are complicated by the number of components involved end-to-end
    • Standardizing on specific tools and methods focuses resources more effectively and provides better self-support.
  • Network problems can severely impact experiments workflows and have taken weeks, months and even years to get addressed!
  • perfSONAR provides a number of standard metrics we can use
    • Latency, Bandwidth and Traceroute
    • These measurements are critical for network visibility
  • Without measuring our complex, global networks we wouldn’t be able to reliably use those network to do science

45

#52 LHCOPN/LHCONE Mtg