perfSONAR Monitoring Update
Shawn McKee / U Michigan, Marian Babik / CERN
on behalf of WLCG Network Throughput WG
2024
At the #52 LHCOPN/LHCONE Meeting, Catania, Italy
https://indico.cern.ch/event/1349135/
�
Outline
2
#52 LHCOPN/LHCONE Mtg
perfSONAR News
3
#52 LHCOPN/LHCONE Mtg
Network Measurement Platform Status
4
Collector
Store (long-term)
Store (short-term)
pS Monitoring
pS Configuration
Tape
Experiments, Sites, NRENs
pS Dashboard
HTTParchiver
#52 LHCOPN/LHCONE Mtg
Network Measurement Platform Plans
5
Collector
Store (long-term)
Network Analytics�Platform
pS Monitoring
pS Configuration
Tape
Experiments, Sites, NRENs
pS Dashboard
HTTParchiver
pS Dashboard
#52 LHCOPN/LHCONE Mtg
perfSONAR Infrastructure
6
#52 LHCOPN/LHCONE Mtg
DC24
7
WLCG DC24
WLCG Data Challenge 2024 took place in Feb 2024; targeting 25% of HL-LHC
Our DC24 plans included the following:
8
#52 LHCOPN/LHCONE Mtg
psDash Network Status
Splits metrics into Infrastructure (pS issues) and Network related (e.g. BW drop)�Classifies metrics into critical, warning and ok and aggregates them into status
9
Network Status dashboard - part of Network Analytics platform - shows network performance based on perfSONAR measurements. Status (ok/warning/critical/unknown) aggregates network and infrastructure metrics.
#52 LHCOPN/LHCONE Mtg
Site Network Utilisation
Grafana dashboard showing network utilization inbound/outbound based on the data exposed by sites (snmp counters exposed via json API)
10
Site Network Utilisation - computed from aggregated utilisation (SNMP counters) provided by sites via simple API. Screenshot shows network utilisation during DC24 as seen by the sites.
#52 LHCOPN/LHCONE Mtg
Analytics
11
Alarms & Alerts Interface
12
Two main improvements needed: Acknowledging alerts that are being worked on and adding user notification mailing lists
(Uses EDUGain/InCommon)
Purpose: provides user-subscribable alerting for specific types of network issues found by analyzing perfSONAR data
#52 LHCOPN/LHCONE Mtg
Subscription Interface
13
#52 LHCOPN/LHCONE Mtg
Alarm Types and Relation to perfSONAR Data
14
#52 LHCOPN/LHCONE Mtg
psDash Alarms Dashboard
15
#52 LHCOPN/LHCONE Mtg
Network Analytics R&D
16
#52 LHCOPN/LHCONE Mtg
Plans for the Analytics Platform
17
#52 LHCOPN/LHCONE Mtg
Summary
Questions / Discussion?
18
#52 LHCOPN/LHCONE Mtg
Acknowledgements
We would like to thank the WLCG, HEPiX, perfSONAR and OSG organizations for their work on the topics presented.
In addition we want to explicitly acknowledge the support of the National Science Foundation which supported this work via:
19
#52 LHCOPN/LHCONE Mtg
Useful URLs
�
20
#52 LHCOPN/LHCONE Mtg
Backup Slides Follow
21
Tools and Applications for Network Data
22
#52 LHCOPN/LHCONE Mtg
Alarms & Alerts Service
(Uses EDUGain/InCommon)
Purpose: provides user-subscribable alerting for specific types of network issues found by analyzing perfSONAR data
23
#52 LHCOPN/LHCONE Mtg
Alarms & Alerts Service
The Alerting and Alarming Tools Subscription Interface
24
#52 LHCOPN/LHCONE Mtg
Alarm Types and Relation to perfSONAR Data
25
#52 LHCOPN/LHCONE Mtg
pSDash (perfSONAR Dashboard)
26
#52 LHCOPN/LHCONE Mtg
27
#52 LHCOPN/LHCONE Mtg
28
#52 LHCOPN/LHCONE Mtg
WLCG perfSONAR Path Statistics
We uniquely identify each traceroute (route IP path) with a SHA1 hash.
29
Link=”hop” (IP-to-IP)
Node=”router”
Statistics on the left concern all the “paths” we are tracking with about 20K unique paths found
About 50% of src-dest pairs have 4 or less paths.
#52 LHCOPN/LHCONE Mtg
AS (Autonomous System) Path Changed
30
[7896, 7896, 7896, 7896, 57, 57, 57, 293, 293, 293, 293, 293, 293, 43]
[7896, 7896, 293, 293, 293, 293, 293, 293, 43]
[7896, 7896, 293, 293, 293, 293, 293, 293]
ASN sequence
[7896, 293, 43]
[7896, 293]
[7896, 57, 293, 43]
Reduced ASNs
[7896, 293, 43]
Baseline
57
[7896, 293, 43]
[7896, 293]
[7896, 57, 293, 43]
99.3%
0.3%
0.3%
Path used in
NOTE: Paths denoted by route IP are too noisy; instead use AS number
#52 LHCOPN/LHCONE Mtg
Example: LHCOPN/LHCONE Load Balancing
31
#52 LHCOPN/LHCONE Mtg
Example: LHCOPN Alternate via ESnet
32
#52 LHCOPN/LHCONE Mtg
Example: FNAL Incident (BW drop)
33
#52 LHCOPN/LHCONE Mtg
Example: Fail-over to Commodity Network (HURRICANE)
34
#52 LHCOPN/LHCONE Mtg
Challenges and Ongoing Work
35
IPv4
IPv6
Paths differ significantly
#52 LHCOPN/LHCONE Mtg
Correlating Tests with Paths: Two Timescales
36
#52 LHCOPN/LHCONE Mtg
Connecting Throughput to Traceroute
37
Our starting choice: Use both tracepaths (just before; just after) as valid paths and attribute BW to both.
Have to see if this is superior to just using the last measured route before the measurement…
#52 LHCOPN/LHCONE Mtg
Attaching Throughput Results to Sets of Routers/Links
38
Each colored box represents a specific router along the path
#52 LHCOPN/LHCONE Mtg
Example Throughput Attribution by Router
39
Each router on the path gets the closest (in time) throughput values
#52 LHCOPN/LHCONE Mtg
Checking Router Results vs Time
40
#52 LHCOPN/LHCONE Mtg
Initial Example Result: One Router; Throughput vs Time
41
Each point represents the throughput values collected when the node was on the path
Mb/s
Mbits/sec
#52 LHCOPN/LHCONE Mtg
Other Activities / Plans
Working to organize and annotate our data for ML/AI work (Petya Vasileva)
Working with the RNTWG (see previous RNTWG update talk) on identifying and monitoring network traffic details via the SciTags initiative.
Exploring other network monitoring activities in the perfSONAR space including ARGUS
Planning to augment WLCG-CRIC (yesterday’s discussion) network meta data (which paths/networks are LHCOPN / LHCONE / Research&Education / Commercial)
42
#52 LHCOPN/LHCONE Mtg
Distributions of Throughput
43
#52 LHCOPN/LHCONE Mtg
WLCG Network Throughput Support Unit
Support channel where sites and experiments can report potential network performance incidents:
Most common issues: MTU, MTU+Load Balancing, routing (mainly remote sites), site equipment/design, firewall, workloads causing high network usage
�As there is no consensus on the MTU to be recommended on the segments connecting servers and clients, LHCOPN/LHCONE working group was established to investigate and produce a recommendation. (See coming talk :) )
44
#52 LHCOPN/LHCONE Mtg
Importance of Measuring Our Networks
45
#52 LHCOPN/LHCONE Mtg