OSG / PATh Staff Meeting
OSG Networking Update
October 23, 2024
Shawn McKee reporting for the extended team:
Marian Babik, Ilija Vukotic and Petya Vasileva
Overview
Shawn McKee - OSG/PATh Staff Mtg
2
10/23/2024
Infrastructure: Status Summary
Shawn McKee - OSG/PATh Staff Mtg
3
10/23/2024
The OSG/WLCG Global Network Pipeline
The current version of the pipeline is shown in the graphic below and features the use of the HTTParchiver to get data into Elasticsearch.
Shawn McKee - OSG/PATh Staff Mtg
4
10/23/2024
Only for data recovery
Send from UC Logstash
Backup Nebraska EL to FNAL tape to be configured
OSG Networking Infrastructure Summary
Shawn McKee - OSG/PATh Staff Mtg
5
10/23/2024
perfSONAR Infrastructure Monitoring
6
perfSONAR (Re)-Deployment Plans
As noted, many of our sites are failing to operate consistently over time.
This leads to missing data and lack of coverage for significant parts of our global networks
We (WLCG/OSG-PATh network monitoring team) feel that one issue is the co-deployment of Opensearch for the measurement archive.
We are exploring a new option for our sites: deploy a perfSONAR testpoint container
To do this we need some minor fixes from the perfSONAR team so that the JSON metadata is still available for a testpoint deployment and mesh info is propagated.
Needs testing and we are planning to validate this new option in the coming months.
7
WLCG Network Data Challenges
Shawn McKee - OSG/PATh Staff Mtg
8
10/23/2024
WLCG “Network” Data Challenges
A planned series of end-to-end tests to demonstrate the progress of infrastructure, applications and middleware of WLCG sites and experiments to reach the capacity and capability required for HL-LHC scale operations.
These data challenges provide many benefits, allowing sites, networks and experiments to evaluate their progress, motivate and validate their developments in hardware and software and show readiness of technologies at suitable scale.
After DC24 (Feb 2024) we had a lot of data to examine and correlate. Many site issues were identified and especially shortcomings in some central data distribution components like FTS, Rucio and associated monitoring.
We have new plans to enable “mini-challenges” that won’t require involvement of experts to allow us to “benchmark” parts of our infrastructure.
9
DC26/27 Network Planning
The next WLCG data challenge will likely be in late 2026 or early 2027 (DC26/27) and we have a number of targets to work on.
10
OSG-LHC/IRIS-HEP Planning
At the IRIS-HEP retreat in September 2024, we discussed how to prepare for DC26
As mentioned, mini-challenges are an important tool that we want to enable
Goals for the next DC:
Rough plan:
Will report on this at the upcoming WLCG DOMA meeting
11
Upgrades and Updates
Shawn McKee - OSG/PATh Staff Mtg
12
10/23/2024
perfSONAR 5.1.4
Recently released (Oct 17th) (~1/3rd of perfSONARs on 5.1.4) Includes additional fixes for OSG-LHC identified issues
Fixes for configuration and error handling to increase robustness of toolkits
Main features (5.1):
We really need a testpoint deployment container with this version that adds in some needed monitoring and configuration components. Working with perfSONAR devs on this.
13
perfSONAR Dashboard Updates
14
perfSONAR Dashboard Traceroute Mesh
15
Dashboard Large Mesh Example
16
Scitags Updates (Flow & Packet Marking)
17
The SciTags Initiative has been working on making R&E network traffic “visible” anywhere in the network and since the last presentation, we have a few updates.
Details are available in the recent LHCONE/LHCOPN presentation October 11, 2024
Updated “flowd” deployment (see https://github.com/scitags/flowd ) containers with support for Alma8, RHEL8 and RHEL9.
ESnet monitoring of fireflies: https://dashboard.stardust.es.net/goto/RrVzQwLIg?orgId=2
Currently have EOS, StoRM and Xrootd SciTags-capable and are working on dCache
New dCache golden release (10.2) has POC version of fireflies incorporated
Needs to use Owner/Activity from FTS to be functional and discussing plans with dCache team
We have a couple demonstrations coming up at Supercomputing 2024 in November
Target reminder: have 80% of ALL production traffic labeled by DC26
Analytics: Finding and Alerting on Problems
Shawn McKee - OSG/PATh Staff Mtg
18
10/23/2024
Reminder - Alarming: Why and What?
Alarming for us is defined as the identification of some issue that needs fixing.
An alarm:
We have been exploring various algorithms to define alarms of interest.
Since it can be very difficult to identify certain types of issues from our complex data, we are trying various Artificial Intelligence (AI) & Machine Learning (ML) techniques.
As reported last time, we recruited an IRIS-HEP fellow (Yana Holoborodko) to work on this and we will report on her project results below.
19
Student Summer Project
For 2024, we have identified one student to work with us on Alerting & Alarming:
Yana had a challenging task in that she had to not only learn about our tools, data formats and previous results, but also had to come up to speed on the basics of networking, data transfers and how they can go wrong!
She completed her project at the beginning of October and she reported on it at a Fellows talk on October 7, 2024
We may have the option to provide a follow-on Fellowship at CERN for 6-9 months!
20
Analytics Work
Shawn McKee - OSG/PATh Staff Mtg
21
10/23/2024
Continuing Analytics Work
There are a number of areas of work ongoing in analyzing our network data.
To support this, we have also been working on determining what is “normal” for our measurements along certain paths or between specific site pairs, e.g., what is the expected behaviour for our measurements when things are working as expected, so we can more accurately identify when there are problems.
The update for this time is really covered in Petya’s CHEP presentation (given TODAY)
22
CHEP 2024 Presentation on Network Analytics
Petya Vasileva is presenting our recent work on analytics at CHEP 2024 in Krakow today.
Details are available in the presentation
The core of the work is on correctly
identifying the network path and
correlating it with our metrics
The new methodology uses AI/ML to
identify missing components on the network path
This is critical for then correlating problems with particular paths/locations.
23
Outreach & Plans
Shawn McKee - OSG/PATh Staff Mtg
24
10/23/2024
Outreach: LHCONE/LHCOPN Workshop and Regular Meetings
We will had two sessions at the LHCONE/LHCOPN meeting this fall in Beijing:
There are also two Supercomputing 2024 demos for SciTags demonstrating hop-by-hop and 400Gbps packet marking
In 2 weeks we will also discuss and present during HEPiX in Oklahoma
Our new meeting series supporting WLCG site networking on the 3rd Thursday of each month has had 5 presentations and 4 meetings (notes/agenda)
We continue our regular network analytics discussion meetings on Thursday (except the 3rd Thursday (notes/agenda)
25
Near Term Work Areas
Shawn McKee - OSG/PATh Staff Mtg
26
10/23/2024
Summary and Conclusion
The collaboration of OSG, WLCG and various research projects have created an extensive infrastructure to monitor our networks via perfSONAR and provide associated analytics and visualization.
Questions or Comments?
Shawn McKee - OSG/PATh Staff Mtg
27
10/23/2024
Acknowledgements
We would like to thank the WLCG, HEPiX, perfSONAR and OSG organizations for their work on the topics presented.
In addition we want to explicitly acknowledge the support of the National Science Foundation which supported this work via:
28
Useful URLs to Reference
Shawn McKee - OSG/PATh Staff Mtg
29
10/23/2024