WLCG Network Monitoring and Analytics Updates
Shawn McKee / U Michigan, Marian Babik / CERN
Petya Vasileva / U Michigan, Ilija Vukotic / U Chicago
on behalf of WLCG Network Throughput WG
2024
Spring 2024 HEPiX, Paris, France
https://indico.cern.ch/event/1377701/timetable/#20240417
�
Outline
2
HEPiX, Spring 2024
perfSONAR News
3
HEPiX, Spring 2024
Network Measurement Platform Status
4
Collector
Store (long-term)
Store (short-term)
pS Monitoring
pS Configuration
Tape
Experiments, Sites, NRENs
pS Dashboard
HTTParchiver
HEPiX, Spring 2024
Network Measurement Platform Plans
5
Collector
Store (long-term)
Network Analytics�Platform
pS Monitoring
pS Configuration
Tape
Experiments, Sites, NRENs
pS Dashboard
HTTParchiver
pS Dashboard
HEPiX, Spring 2024
perfSONAR Infrastructure
6
HEPiX, Spring 2024
DC24
7
WLCG DC24
WLCG Data Challenge 2024 took place in Feb 2024; targeting 25% of HL-LHC
Our DC24 plans included the following:
8
HEPiX, Spring 2024
psDash Network Status
Splits metrics into Infrastructure (pS issues) and Network related (e.g. BW drop)�Classifies metrics into critical, warning and ok and aggregates them into status
9
Network Status dashboard - part of Network Analytics platform - shows network performance based on perfSONAR measurements. Status (ok/warning/critical/unknown) aggregates network and infrastructure metrics.
HEPiX, Spring 2024
Site Network Utilisation
Grafana dashboard showing network utilization inbound/outbound based on the data exposed by sites (snmp counters exposed via json API)
10
Site Network Utilisation - computed from aggregated utilisation (SNMP counters) provided by sites via simple API. Screenshot shows network utilisation during DC24 as seen by the sites.
HEPiX, Spring 2024
Analytics
11
Alarms & Alerts Interface
12
Two main improvements needed: Acknowledging alerts that are being worked on and adding user notification mailing lists
(Uses EDUGain/InCommon)
Purpose: provides user-subscribable alerting for specific types of network issues found by analyzing perfSONAR data
HEPiX, Spring 2024
Subscription Interface
13
HEPiX, Spring 2024
Alarm Types and Relation to perfSONAR Data
14
HEPiX, Spring 2024
psDash Alarms Dashboard
15
HEPiX, Spring 2024
Network Analytics R&D
noisy and therefore require robust
models
16
Uncertainty in
traceroute data
Repairing the path
HEPiX, Spring 2024
Plans for the Analytics Platform
17
HEPiX, Spring 2024
Summary
Questions / Discussion?
18
HEPiX, Spring 2024
Acknowledgements
We would like to thank the WLCG, HEPiX, perfSONAR and OSG organizations for their work on the topics presented.
In addition we want to explicitly acknowledge the support of the National Science Foundation which supported this work via:
19
HEPiX, Spring 2024
Useful URLs
�
20
HEPiX, Spring 2024
Backup Slides Follow
21
WLCG Network Throughput Support Unit
Support channel where sites and experiments can report potential network performance incidents:
Most common issues: MTU, MTU+Load Balancing, routing (mainly remote sites), site equipment/design, firewall, workloads causing high network usage
�As there is no consensus on the MTU to be recommended on the segments connecting servers and clients, LHCOPN/LHCONE working group was established to investigate and produce a recommendation.
22
HEPiX, Spring 2024
Importance of Measuring Our Networks
23
HEPiX, Spring 2024