1 of 29

OSG / PATh Staff Meeting

OSG Networking Update

October 23, 2024

Shawn McKee reporting for the extended team:

Marian Babik, Ilija Vukotic and Petya Vasileva

2 of 29

Overview

  • This is the eighth OSG/PATh presentation on networking for OSG/PATh staff and I will focus on activities and plans since the June 2024 presentation
  • There have been three high-level activities since last time:
    • Planning for our next steps in WLCG Data Challenges
    • Debugging and hardening the perfSONAR 5.1 deployment
    • Improving our network alerting and alarming
  • Please feel free to ask questions at any time.

Shawn McKee - OSG/PATh Staff Mtg

2

10/23/2024

3 of 29

Infrastructure: Status Summary

Shawn McKee - OSG/PATh Staff Mtg

3

10/23/2024

4 of 29

The OSG/WLCG Global Network Pipeline

The current version of the pipeline is shown in the graphic below and features the use of the HTTParchiver to get data into Elasticsearch.

  • We have a data recovery mode that still lets us use the collector.
  • Working on restoring the replication to the Nebraska ES and FNAL tape backups.

Shawn McKee - OSG/PATh Staff Mtg

4

10/23/2024

Only for data recovery

Send from UC Logstash

Backup Nebraska EL to FNAL tape to be configured

5 of 29

OSG Networking Infrastructure Summary

  • Fixes are still planned but not yet implemented to restore the data replication to Nebraska and tape backup at FNAL (stalled for now)
    • This was impacted by the UNL Elasticsearch to Opensearch migration
  • perfSONAR has been stabilized with the broad release of 5.1.4 but we still see many resiliency issues from perfSONAR.
    • See status on dashboard
    • Major issue is that a large fraction of heavily used perfSONARs still fail to run measurements after 48-72 hours in operation. Still too early to see if 5.1.4 helps.
  • Our IRIS-HEP metric: number of sites marking their traffic with Experiment/Activity
    • This has increased by 100% !! ( Nebraska is now joined by UCSD in marking)
  • Analytics tools and alerting components continue being evolved and tested
  • On the to-do are plans to work with perfSONAR devs to improve things…

Shawn McKee - OSG/PATh Staff Mtg

5

10/23/2024

6 of 29

perfSONAR Infrastructure Monitoring

  • Updated to Checkmk 2.3.0 (from 1.6.0)
  • Integration with CILogon (single-sign on) - moving away from x509 certs
  • New tests
    • Node diagnostics based on “pscheduler troubleshoot” command
    • Tracking measurements in central ElasticSearch
  • Now in pre-production at psetf-itb.aglt2.org (need to move to EL9 OS)�

6

7 of 29

perfSONAR (Re)-Deployment Plans

As noted, many of our sites are failing to operate consistently over time.

This leads to missing data and lack of coverage for significant parts of our global networks

We (WLCG/OSG-PATh network monitoring team) feel that one issue is the co-deployment of Opensearch for the measurement archive.

  • Too resource intensive
  • Too many components for a typical WLCG perfSONAR node

We are exploring a new option for our sites: deploy a perfSONAR testpoint container

  • Provides a very simple set of needed services to measure our networks
  • Needs docker or similar
  • Data continues to be gathered by our central Elasticsearch

To do this we need some minor fixes from the perfSONAR team so that the JSON metadata is still available for a testpoint deployment and mesh info is propagated.

Needs testing and we are planning to validate this new option in the coming months.

7

8 of 29

WLCG Network Data Challenges

Shawn McKee - OSG/PATh Staff Mtg

8

10/23/2024

9 of 29

WLCG “Network” Data Challenges

A planned series of end-to-end tests to demonstrate the progress of infrastructure, applications and middleware of WLCG sites and experiments to reach the capacity and capability required for HL-LHC scale operations.

These data challenges provide many benefits, allowing sites, networks and experiments to evaluate their progress, motivate and validate their developments in hardware and software and show readiness of technologies at suitable scale.

After DC24 (Feb 2024) we had a lot of data to examine and correlate. Many site issues were identified and especially shortcomings in some central data distribution components like FTS, Rucio and associated monitoring.

We have new plans to enable “mini-challenges” that won’t require involvement of experts to allow us to “benchmark” parts of our infrastructure.

9

10 of 29

DC26/27 Network Planning

The next WLCG data challenge will likely be in late 2026 or early 2027 (DC26/27) and we have a number of targets to work on.

  • Network planning: make sure our sites and their local and regional networks are aware of WLCG requirements and timeline, and plan accordingly.
    • We need to socialize our requirements at sites and with regional networks
  • Update and utilize perfSONAR to clean up links & fix problems before DC26/27.
  • Instrument and document site networks, for at least our largest sites.
  • IPv6 should be enabled everywhere not just because of packet marking, but because it will allow us to get back to a single stack sooner!
  • Optimize our ability to utilize the network (jumbo frames, protocols, pacing)
  • Improve net traffic visibility (SciTags)
  • Explore and demonstrate the value of Network Orchestration/SDN.

10

11 of 29

OSG-LHC/IRIS-HEP Planning

At the IRIS-HEP retreat in September 2024, we discussed how to prepare for DC26

As mentioned, mini-challenges are an important tool that we want to enable

Goals for the next DC:

  • Move the majority of our data via IPv6 and have one or more sites IPv6-only
  • Have 80%+ of our traffic identified by SciTags
  • Have SENSE/Rucio used in production at one or more sites
  • Improved site network monitoring to traffic traffic by LHCONE, LHCOPN, R&E and commodity

Rough plan:

  • Before the end of 2024 re-run capacity tests for US sites to determine current values
  • Around February 2025, execute a joint USATLAS-USCMS capacity mini-challenge for North America (identify current throughput limits by site and in aggregate)
  • Early-to-mid Summer 2025, execute a joint USATLAS-USCMS capability mini-challenge for North America (storage tokens, traffic shaping, SciTags, SDN, etc.)

Will report on this at the upcoming WLCG DOMA meeting

11

12 of 29

Upgrades and Updates

Shawn McKee - OSG/PATh Staff Mtg

12

10/23/2024

13 of 29

perfSONAR 5.1.4

Recently released (Oct 17th) (~1/3rd of perfSONARs on 5.1.4) Includes additional fixes for OSG-LHC identified issues

Fixes for configuration and error handling to increase robustness of toolkits

Main features (5.1):

  • New Grafana interface (replaces MaDDash!!)
  • Threaded iperf3 support
  • Conversion of pSConfig from Perl to Python (better support; easier features)
  • Removal of EL7, Debian 10 and Ubuntu 18
  • Supports EL9, Debian 11-12, Ubuntu 20, 22
  • Multiple fixes for Elmond, Logstash, registration daemon, pscheduler, …
  • Install help: curl -s https://raw.githubusercontent.com/perfsonar/project/master/install-perfsonar | sh -s - --help

We really need a testpoint deployment container with this version that adds in some needed monitoring and configuration components. Working with perfSONAR devs on this.

13

14 of 29

perfSONAR Dashboard Updates

  • New dashboard service replacing previous MaDDash
  • Based on Grafana - based on perfSONAR 5 code base
    • Modified to use the central ElasticSearch, which uses different schema
    • Generation of dashboards required some new code which is now upstream

14

15 of 29

perfSONAR Dashboard Traceroute Mesh

15

16 of 29

Dashboard Large Mesh Example

16

17 of 29

Scitags Updates (Flow & Packet Marking)

17

The SciTags Initiative has been working on making R&E network traffic “visible” anywhere in the network and since the last presentation, we have a few updates.

Details are available in the recent LHCONE/LHCOPN presentation October 11, 2024

Updated “flowd” deployment (see https://github.com/scitags/flowd ) containers with support for Alma8, RHEL8 and RHEL9.

ESnet monitoring of fireflies: https://dashboard.stardust.es.net/goto/RrVzQwLIg?orgId=2

Currently have EOS, StoRM and Xrootd SciTags-capable and are working on dCache

New dCache golden release (10.2) has POC version of fireflies incorporated

Needs to use Owner/Activity from FTS to be functional and discussing plans with dCache team

We have a couple demonstrations coming up at Supercomputing 2024 in November

Target reminder: have 80% of ALL production traffic labeled by DC26

18 of 29

Analytics: Finding and Alerting on Problems

Shawn McKee - OSG/PATh Staff Mtg

18

10/23/2024

19 of 29

Reminder - Alarming: Why and What?

Alarming for us is defined as the identification of some issue that needs fixing.

An alarm:

  • Identifies a specific issue
  • Involving one or more sites
  • Occurs at a point in time or exists in a time range

We have been exploring various algorithms to define alarms of interest.

Since it can be very difficult to identify certain types of issues from our complex data, we are trying various Artificial Intelligence (AI) & Machine Learning (ML) techniques.

As reported last time, we recruited an IRIS-HEP fellow (Yana Holoborodko) to work on this and we will report on her project results below.

19

20 of 29

Student Summer Project

For 2024, we have identified one student to work with us on Alerting & Alarming:

Yana had a challenging task in that she had to not only learn about our tools, data formats and previous results, but also had to come up to speed on the basics of networking, data transfers and how they can go wrong!

She completed her project at the beginning of October and she reported on it at a Fellows talk on October 7, 2024

We may have the option to provide a follow-on Fellowship at CERN for 6-9 months!

20

21 of 29

Analytics Work

Shawn McKee - OSG/PATh Staff Mtg

21

10/23/2024

22 of 29

Continuing Analytics Work

There are a number of areas of work ongoing in analyzing our network data.

  • One of the most challenging is accurately determining the path specific measurements took through the network.
  • We then need ways to combine diverse measurements into meaningful indicators of problems, including possible problem locations.

To support this, we have also been working on determining what is “normal” for our measurements along certain paths or between specific site pairs, e.g., what is the expected behaviour for our measurements when things are working as expected, so we can more accurately identify when there are problems.

The update for this time is really covered in Petya’s CHEP presentation (given TODAY)

22

23 of 29

CHEP 2024 Presentation on Network Analytics

Petya Vasileva is presenting our recent work on analytics at CHEP 2024 in Krakow today.

Details are available in the presentation

The core of the work is on correctly

identifying the network path and

correlating it with our metrics

The new methodology uses AI/ML to

identify missing components on the network path

This is critical for then correlating problems with particular paths/locations.

23

24 of 29

Outreach & Plans

Shawn McKee - OSG/PATh Staff Mtg

24

10/23/2024

25 of 29

Outreach: LHCONE/LHCOPN Workshop and Regular Meetings

We will had two sessions at the LHCONE/LHCOPN meeting this fall in Beijing:

  • perfSONAR Monitoring Update
  • SciTags deployment, configuration and use

There are also two Supercomputing 2024 demos for SciTags demonstrating hop-by-hop and 400Gbps packet marking

In 2 weeks we will also discuss and present during HEPiX in Oklahoma

Our new meeting series supporting WLCG site networking on the 3rd Thursday of each month has had 5 presentations and 4 meetings (notes/agenda)

We continue our regular network analytics discussion meetings on Thursday (except the 3rd Thursday (notes/agenda)

25

26 of 29

Near Term Work Areas

  • Analyze ESnet “high-touch” data captured during DC24
    • We have identified a post-doc at PIC (Marc Felix Diez) working on this till Feb 2025
  • Define Post-DC24 plans for mini-challenges, technology prototyping and testing and preparations for DC26 (DC27?)
  • Evolve or replace pSDash, improving problem identification and localization
    • Exploring possible LLM augmentation / replacement for pS-Dash…
  • Continue network topology cleaning, analyzing and alerting (Yana continues?)
  • Retire MaDDash and reconfigure to support new Grafana replacement
    • psmad.opensciencegrid.org will be retired/repurposed.
  • Maintaining the engagement with the Global research community
    • Access to R&E monitoring and integration with our tools and datastores
    • Working with ESnet on Stardust dashboard, flow data presentation and other items

Shawn McKee - OSG/PATh Staff Mtg

26

10/23/2024

27 of 29

Summary and Conclusion

The collaboration of OSG, WLCG and various research projects have created an extensive infrastructure to monitor our networks via perfSONAR and provide associated analytics and visualization.

  • We maintain our engagement with various groups, working on improving networking for scientific research.
  • While we continue to monitor and maintain our infrastructure, we need to also develop and tune our tools and applications.
  • We have lots of activity “pre DC26” to both incorporate improvements and fix issues

Questions or Comments?

Shawn McKee - OSG/PATh Staff Mtg

27

10/23/2024

28 of 29

Acknowledgements

We would like to thank the WLCG, HEPiX, perfSONAR and OSG organizations for their work on the topics presented.

In addition we want to explicitly acknowledge the support of the National Science Foundation which supported this work via:

  • OSG: NSF MPS-1148698
  • IRIS-HEP: NSF OAC-1836650

28

29 of 29

Useful URLs to Reference

Shawn McKee - OSG/PATh Staff Mtg

29

10/23/2024