1 of 23

WLCG Network Monitoring and Analytics Updates

Shawn McKee / U Michigan, Marian Babik / CERN

Petya Vasileva / U Michigan, Ilija Vukotic / U Chicago

on behalf of WLCG Network Throughput WG

2024

Spring 2024 HEPiX, Paris, France

https://indico.cern.ch/event/1377701/timetable/#20240417

2 of 23

Outline

  • Infrastructure Status and Updates
  • DC24 Review
  • Network Analytics

2

HEPiX, Spring 2024

3 of 23

perfSONAR News

  • perfSONAR 5.0.8 is the latest release
    • Number of bug-fixes since 5.0
    • Weekly meetings with the developers
    • Update campaign in WLCG
      • Various issues, mostly archiving, but also e.g. legacy limits configuration (fix)
    • Toolkit support for latest CC7 and Alma/Rocky 8 and 9 compatible systems (Alma).
      • Sites should plan to update by June (end-of-life for CentOS7)
  • 5.1 Beta Release
    • New Grafana interface - replacing toolkit and maddash graphs
    • Threaded iperf3 support
    • Enhanced instrumentation, better troubleshooting of archiving issues
    • OS support: Alma/Rocky 9, Debian 11/12, Ubuntu 20/22 (updated docker),� No support for CentOS 7, Debian 10, Ubuntu 18

3

HEPiX, Spring 2024

4 of 23

Network Measurement Platform Status

  • Our platform collects, stores, configures and transports all network metrics
  • Evolution based on the perfSONAR 5 already partially implemented.
    • Now directly publishing results from perfSONARs to ES@UC
    • Collector used only as a fallback;
    • WLCG CRIC now used for topology

4

Collector

Store (long-term)

Store (short-term)

pS Monitoring

pS Configuration

Tape

Experiments, Sites, NRENs

pS Dashboard

HTTParchiver

HEPiX, Spring 2024

5 of 23

Network Measurement Platform Plans

  • Evolution based on the perfSONAR 5 already partially implemented.
    • Forwarding to UNL and backup to FNAL still to be implemented
    • pS Monitoring - update to latest Checkmk and enable SSO authentication
    • ps Dashboard - integrate with Analytics Platform/Grafana (retire maddash)
    • ps Configuration - clarify development roadmap and support

5

Collector

Store (long-term)

Network Analytics�Platform

pS Monitoring

pS Configuration

Tape

Experiments, Sites, NRENs

pS Dashboard

HTTParchiver

pS Dashboard

HEPiX, Spring 2024

6 of 23

perfSONAR Infrastructure

  • Most of the infrastructure already on 5.0.8, but only small fraction on EL8/9
    • We still have some perfSONAR 4.0 !
  • 17% of the infrastructure now with 100Gbps (few on 200Gbps)
    • Core infrastructure still runs on 10Gbps, which is fine
  • Core deployments are still on 10Gbps, we have about 17% with 100Gbps
    • For WLCG/OSG testing purposes 10Gbps is still sufficient
    • Important to refresh HW along with the update to 5.1
  • Most of the infrastructure is on 5.0.8, but significant fraction still on CentOS 7
  • MTU - around 40% on jumbo frames (9000), rest is on standard frames (1500)
  • We have small testbed with about 10 perfSONARs with BBRv3 enabled
    • Enabled testing TCP congestion algorithm benefits and jumbo frame trade-offs
    • Open for participation

6

HEPiX, Spring 2024

7 of 23

DC24

7

8 of 23

WLCG DC24

WLCG Data Challenge 2024 took place in Feb 2024; targeting 25% of HL-LHC

Our DC24 plans included the following:

  • Update and utilize perfSONAR to clean up links and fix problems before DC24.
  • Instrument and document site networks, for at least our largest sites.
  • Network planning: we need to make sure our sites and their local and regional networks are aware of our requirements and timeline and are planning appropriately
  • IPv6 should be enabled everywhere not just because of packet marking, but because it will allow us to get back to a single stack sooner!

8

HEPiX, Spring 2024

9 of 23

psDash Network Status

Splits metrics into Infrastructure (pS issues) and Network related (e.g. BW drop)�Classifies metrics into critical, warning and ok and aggregates them into status

9

Network Status dashboard - part of Network Analytics platform - shows network performance based on perfSONAR measurements. Status (ok/warning/critical/unknown) aggregates network and infrastructure metrics.

HEPiX, Spring 2024

10 of 23

Site Network Utilisation

Grafana dashboard showing network utilization inbound/outbound based on the data exposed by sites (snmp counters exposed via json API)

10

Site Network Utilisation - computed from aggregated utilisation (SNMP counters) provided by sites via simple API. Screenshot shows network utilisation during DC24 as seen by the sites.

HEPiX, Spring 2024

11 of 23

Analytics

11

12 of 23

Alarms & Alerts Interface

12

Two main improvements needed: Acknowledging alerts that are being worked on and adding user notification mailing lists

https://psa.osg-htc.org

(Uses EDUGain/InCommon)

Purpose: provides user-subscribable alerting for specific types of network issues found by analyzing perfSONAR data

HEPiX, Spring 2024

13 of 23

Subscription Interface

13

HEPiX, Spring 2024

14 of 23

Alarm Types and Relation to perfSONAR Data

14

HEPiX, Spring 2024

15 of 23

psDash Alarms Dashboard

15

HEPiX, Spring 2024

16 of 23

Network Analytics R&D

  • Investigate ML models/methods to process network measurements
  • Data-preprocessing, e.g.
    • Train neural networks to predict network paths,�e.g. help us fill the gaps in traceroute(s)
  • Build model(s) that represents�our network(s)
    • Network measurements are inherently

noisy and therefore require robust

models

  • Use ML models for anomaly detection (for alerts & alarms)
    • Neural networks (which ones ?), Bayesian/probabilistic approaches,
    • Detect anomalies in network paths and bandwidth measurements
    • Compare with the existing heuristic algorithms that we have developed
  • Correlate with other data
    • Traceroutes with throughput for example, but also outside of perfSONAR, e.g. FTS
    • New types of data appearing (high-touch, scitags, in-band telemetry, etc.)

16

Uncertainty in

traceroute data

Repairing the path

HEPiX, Spring 2024

17 of 23

Plans for the Analytics Platform

  • Production of the anomaly detection based on Bayesian inference
    • Uses RTT, traceroutes, TTLs as input and detects anomalies
  • Continue working on the neural network models that correlate throughputs and traceroutes
    • Generating real-world model of our �entire network (all routers)
    • Not only detecting anomalies, but �also trying to pinpoint the location �of the issue

  • Improve infrastructure alarming to the point where we can reliably differentiate infrastructure and network issues
  • Network availability dashboard in production

17

HEPiX, Spring 2024

18 of 23

Summary

  • Updates to perfSONAR and OSG/WLCG network measurement platform
    • perfSONAR 5.1 is coming with new features and will require all sites to update OS.
    • Plan to adapt the network measurement platform to benefit from changes in 5.1
  • Ongoing efforts in network analytics and ML methods for our data
    • Focus on pre-processing (gaps, predictive models) and anomaly detection
    • Opportunity to collaborate on models and data sets
  • We are preparing monthly meetings with site network teams:
    • Discuss how sites are deploying, managing and planning for WLCG networking requirements
    • Next meeting April 18th 10am EST (to join mail wlcg-site-net-requests@umich.edu) �
  • We have to continue to watch our network monitoring infrastructure as it is a complex system with lots of areas for issues to develop.

Questions / Discussion?

18

HEPiX, Spring 2024

19 of 23

Acknowledgements

We would like to thank the WLCG, HEPiX, perfSONAR and OSG organizations for their work on the topics presented.

In addition we want to explicitly acknowledge the support of the National Science Foundation which supported this work via:

  • OSG: NSF MPS-1148698
  • IRIS-HEP: NSF OAC-1836650

19

HEPiX, Spring 2024

20 of 23

Useful URLs

20

HEPiX, Spring 2024

21 of 23

Backup Slides Follow

21

22 of 23

WLCG Network Throughput Support Unit

Support channel where sites and experiments can report potential network performance incidents:

  • Relevant sites, (N)RENs are notified and perfSONAR infrastructure is used to narrow down the problem to particular link(s) and segment. Also tracking past incidents.
  • Feedback to WLCG operations and LHCOPN/LHCONE community�

Most common issues: MTU, MTU+Load Balancing, routing (mainly remote sites), site equipment/design, firewall, workloads causing high network usage

�As there is no consensus on the MTU to be recommended on the segments connecting servers and clients, LHCOPN/LHCONE working group was established to investigate and produce a recommendation.

22

HEPiX, Spring 2024

23 of 23

Importance of Measuring Our Networks

  • End-to-end network issues are difficult to spot and localize
    • Network problems are multi-domain, complicating the process
    • Performance issues involving the network are complicated by the number of components involved end-to-end
    • Standardizing on specific tools and methods focuses resources more effectively and provides better self-support.
  • Network problems can severely impact experiments workflows and have taken weeks, months and even years to get addressed!
  • perfSONAR provides a number of standard metrics we can use
    • Latency, Bandwidth and Traceroute
    • These measurements are critical for network visibility
  • Without measuring our complex, global networks we wouldn’t be able to reliably use those network to do science

23

HEPiX, Spring 2024