1 of 37

Science Driven Requirements for 400Gbps Networking

Kate Robinson - katerobinson@es.net

Network Engineering

Jason Zurawski - zurawski@es.net

Science Engagement

Quilt meeting

February 5th, 2025

2 of 37

Agenda

  • ESnet Overview
  • ESnet Requirements Reviews - HEP Focus
  • ESnet HEP Support + Trans-atlantic Connectivity Updates
  • Ramp up to HL-LHC - WWNFY
  • Data Challenge 24 Lessons, next steps
  • Engagement & Planning Process

2 – ESnet Science Engagement (engage@es.net) - February 2025

3 of 37

ESnet is the DOE’S data circulatory system…

  • ESnet supports the DOE scientific research ecosystem.
  • Interconnects all national labs and user facilities
  • Provides reliable, high-performance connectivity to global research collaborations, the Cloud, and the larger Internet.
  • "User Facility" charged with supporting DOE mission needs
    • ESnet6 network is the backbone of DOE and large scale science

3 – ESnet Science Engagement (engage@es.net) - February 2025

4 of 37

…and the stage for a global science laboratory.

ESnet’s Vision

Scientific progress will be completely unconstrained by the physical location of instruments, people, computational resources, or data.

ESnet’s Mission

Networking that accelerates science.

4 – ESnet Science Engagement (engage@es.net) - February 2025

5 of 37

Requirements Review Overview

ESnet’s core partnership program created to comprehensively evaluate:

    • Major science experiments and facilities, both in operation and planned.
    • The process of science used for knowledge discovery, and including scientists’ interactions with the instruments and facilities.
    • The volume of data produced now, and anticipated in the future, with an emphasis on geographical location of where the data must be shared, computed and/or stored.
    • The current technology capabilities (network, computation, storage, and software stack) used by each science collaboration/facility as well as any planned upgrades, additions or improvements.

5 – ESnet Science Engagement (engage@es.net) - February 2025

6 of 37

2024 Requirements Review

  • Review not published (yet), but here is the 2020 version:
    • Zurawski, Jason, Brown, Benjamin, Carder, Dale, Colby, Eric, Dart, Eli, Miller, Ken, Patwa, Abid, Robinson, Kate, Rotman, Lauren, and Wiedlea, Andrew. High Energy Physics Network Requirements Review (Final Report, July-October 2020). United States: N. p., 2021. Web. doi:10.2172/1804717.
    • https://doi.org/10.2172/1804717

  • 2024 had a heavy focus on:
    • HL-LHC (we will talk a lot about this today), but also DUNE, Rubin/LSST, and more)
    • The role of commercial computing infrastructures
    • Integrated Research Infrastructure (IRI) - new DOE initiative to support scientific research
    • Friction points (networking, storage, software)

6 – ESnet Science Engagement (engage@es.net) - February 2025

7 of 37

Scientific Case Studies (14 Total, about 350pg)

  • Cosmological Simulation Research
  • Dark Energy Science Collaboration (DESC)
  • Dark Energy Spectroscopic Instrument (DESI)
  • The Vera C. Rubin Observatory (Rubin Observatory) & the Legacy Survey of Space and Time (LSST)
  • Cosmic Microwave Background - Stage 4 (CMB-S4)
  • Alpha Magnetic Spectrometer (AMS) Experiment
  • Muon Experimentation at Fermilab
    • Muon G minus two (g-2)
    • Muon-to-electron-conversion experiment (Mu2e)
  • Belle II Experiment
  • Neutrino Experiments at Fermilab
    • Short-Baseline Neutrino Program (SBN)
    • The Deep Underground Neutrino Experiment (DUNE)
  • Super Cryogenic Dark Matter Search (Super CDMS)
  • Large Hadron Collider (LHC) Experimentation and Operation
    • ATLAS (A Toroidal LHC ApparatuS) Experiment
    • Compact Muon Solenoid (CMS) Experiment
    • LHC Operations
    • High Luminosity (HL) Era of the LHC

7 – ESnet Science Engagement (engage@es.net) - February 2025

8 of 37

Technical Case Studies (7 Total, about 100pg)

  • Argonne Leadership Computing Facility (ALCF)
  • Argonne National Laboratory (ANL)
  • Brookhaven National Laboratory (BNL)
  • Fermi National Accelerator Laboratory (Fermilab)
  • Oak Ridge Leadership Computing Facility (OLCF)
  • SLAC National Accelerator Laboratory (SLAC)
  • The National Energy Research Scientific Computing Center (NERSC)

8 – ESnet Science Engagement (engage@es.net) - February 2025

9 of 37

Participating Orgs (36 total)

  • Federal
    • Department of Energy Office of Science
  • International:
    • Beihang University
    • CIEMAT
    • European Organization for Nuclear Research (CERN)
    • Rubin Observatory
    • U.A. Madrid
    • University of Toronto
    • Utrecht University
  • DOE Labs & Facilities
    • Argonne Leadership Computing Facility (ALCF)
    • Argonne National Laboratory
    • Brookhaven National Laboratory (BNL)
    • Energy Sciences Network (ESnet)
    • Fermi National Accelerator Laboratory (Fermilab)
    • Lawrence Berkeley National Laboratory (LBNL)
    • Oak Ridge Leadership Computing Facility (OLCF)
    • Oak Ridge National Laboratory
    • SLAC National Accelerator Laboratory (SLAC)
    • The National Energy Research Scientific Computing Center (NERSC)
  • Universities
    • Duke University
    • Harvard University
    • Indiana University
    • Oregon State University
    • Princeton University
    • San Diego Computing Center (SDSC)
    • Stonybrook University
    • Texas A&M University
    • University of California Berkeley
    • University of California San Diego
    • University of Chicago
    • University of Hawaii at Manoa
    • University of Illinois Urbana Champaign
    • University of Mass
    • University of Michigan
    • University of Texas at Arlington
    • University of Wisconsin-Madison
    • Vanderbilt University

9 – ESnet Science Engagement (engage@es.net) - February 2025

10 of 37

Breaking news: (rough) 2024 Findings

  • Data volume Increases:
    • Across the board - PB era is now common for the large experiments. Small have graduated to TB scale
    • Frequency of production increasing, along with fidelity
  • Storage still an issue:
    • Wut?!
    • “We still treat the network as infinite storage … but we shouldn’t have to”
  • Ability to support multi-facility workflows:
    • Sensor, Computation, Storage, and People may all be in different locations - regularly.
    • Focus in DOE space: Integrated Research Infrastructure (https://iri.science)
    • Ability to execute ‘near real-time’ workflows is now routine (thanks R&E Networking Community!)
    • This is the only way we will scale - but requires significant coordination and cooperation

10 – ESnet Science Engagement (engage@es.net) - February 2025

11 of 37

Breaking news: (rough) 2024 Findings

  • Cloud use:
    • Would still consider this the ‘dabbling’ stage for the majority. A significant number of workflows are now ‘cloud-ready’ and can burst there if the costs allow it (hint: they still do not)
    • Some experiments have built successful hybrid model (Rubin): cloud ‘front end’ to support the users, backed against DOE-based storage and production/analysis computation.
  • R&D activities have positive impact on reducing network use (Caching, Data Lake, SENSE / Rucio, SciTag, etc.):
    • Reducing the user of networks is a goal, since they are not an unlimited resource.
    • The work between ‘now’ and 2029 is critical, because that’s when the hammer drops on the next generation in data intensive science for the HL-LHC
  • Data Challenges = Good Thing ( more later … )

11 – ESnet Science Engagement (engage@es.net) - February 2025

12 of 37

… but

ESnet is here as a partner: Let’s work out a plan on this together, just in case you may not be ready.

12 – ESnet Science Engagement (engage@es.net) - February 2025

13 of 37

Why ESnet as a partner?

  • ESnet = Mission Network. Our role is to support DOE Office of Science traffic
  • For other RENS: Connecting & Peering with ESnet is FREE
    • It’s already included in our cost of operation
  • ESnet values our partnerships with regional exchange points to create a connection that values all involved
  • The US LHC Tier 1 sites for CMS and ATLAS, Brookhaven National Laboratory and Fermi National Laboratory have primary connectivity via ESnet
  • Significant transatlantic connectivity (current and planned)
  • Half of ESnet’s traffic is LHC, we are invested in this traffic flow

13 – ESnet Science Engagement (engage@es.net) - February 2025

14 of 37

US - Europe Connectivity

  • Collaborating with GEANT to share spectrum on subsea
  • Plan: collectively acquire optical spectrum across 4+ diverse cables
  • Both Amitie 400Gs are ready and in use
  • ESnet completed requisition of “cable 1” spectrum with Aquacomms AEC-1
  • Subsea spectrum IRUs are 15+ year contracts
  • Subsea spectrum should provide ~10Tbps aggregate to ESnet
  • Meet HL-LHC commitments plus growth

  • 2027 commitment to LHC community
    • 3.2Tbps of ESnet transatlantic bandwidth

14 – ESnet Science Engagement (engage@es.net) - February 2025

15 of 37

Trans-Atlantic IGP Traffic Engineering

  • Treat all links possible as "internal to ESnet"
    • End to end traffic engineering is then tractable vs a piecemeal approach
    • See Michael Sinatra's talk(s) on Segment Routing
  • Configured policy on ESnet routers that have LHCONE connections to load-balance traffic to/from CERN across all available TA paths
    • Overrides shortest-path routing
    • Up to 6 paths to choose from
    • Can weight some paths heavier than others.

15 – ESnet Science Engagement (engage@es.net) - February 2025

16 of 37

Trans-Atlantic IGP Traffic Engineering Example

16 – ESnet Science Engagement (engage@es.net) - February 2025

17 of 37

Current US Tier-1 Status

  • BNL
    • 1.6 Tbit/s
      • 800G (2x400GE) primary for LHCOPN + R&E connectivity
      • 800G (2x400GE) primary for LHCONE
  • FNAL
    • 2.4 Tbit/s
      • 1.6 Tbit/s (4x400GE) primary for LHCOPN & LHCONE
      • 800G (2x400GE) for R&E connectivity

  • In both cases, traffic can failover between links (and sometimes has to …)

17 – ESnet Science Engagement (engage@es.net) - February 2025

18 of 37

ESnet growth

as of January 2025

18 – ESnet Science Engagement (engage@es.net) - February 2025

LHC accounts for ~ 50% of ESnet traffic*

* via our traffic engineering. T2 to T3 traffic may not be captured, etc.

19 of 37

19 – ESnet Science Engagement (engage@es.net) - February 2025

High Luminosity LHC

20 of 37

Preparation for High-Luminosity LHC (~2030)

  • Current projections:
    • 400G connectivity needed per T2
      • recall there are (4) T2's connected to OmniPoP
    • Needs likely driven by storage, w/ tradeoffs on placement.
      • ~5-12 PB typical now, possibly 20-40 PB per site in 2030
    • in-network caches undergoing R&D - not the answer, but part of it ….
  • Data Challenges (full software stack)
    • 10% of the target 2021 # should match Run 3
    • 20% in 2024 # rescoped to < 100g
    • 50% in 2027 # possible bursts into 400G
    • 100% in 2030 # 400G for longer durations

20 – ESnet Science Engagement (engage@es.net) - February 2025

Dates have moved out towards 2030, but activities run up to the turn-on

21 of 37

Ongoing Mini Data challenges

  • Data challenges must be treated as production from the network so that the entire software stack can be tested.
    • Not just moving bits, but also testing workflow schedulers, data movement algorithms, storage performance, etc.
    • Provide a global baseline for performance capabilities

  • Mini challenges have emerged to target specific use cases
    • Hone in and refine specific areas
    • Prove out new technologies or capabilities as warranted

21 – ESnet Science Engagement (engage@es.net) - February 2025

22 of 37

Data Challenge 2024: Huge Win(s)

  • 2022 Quilt Meeting described many successes of DC1
  • Operational highlights going forward
    • Load balancing needs some engineering work
    • Multiple cases of saturation or close-to-saturation on paths
      • Some in ESnet
      • Some in other networks
    • 100G Connectivity will be inadequate

  • Network upgrades are a focus of conversations for both Run 3 (now) & 4 (~2030)

22 – ESnet Science Engagement (engage@es.net) - February 2025

23 of 37

ESnet peak of 1.83 Tbit/sec offered load

during DC24

https://dashboard.stardust.es.net/d/Bi0-rzg4z/welcome?orgId=1&from=1708437425634&to=1708467355046

23 – ESnet Science Engagement (engage@es.net) - February 2025

24 of 37

LHCOPN (OSCARS) & LHCONE DC24 Traffic vs

Everything Else on ESnet

24 – ESnet Science Engagement (engage@es.net) - February 2025

Non-LHC Traffic in gray

LHCOPN

LHCONE

25 of 37

ESnet Transatlantic Usage

25 – ESnet Science Engagement (engage@es.net) - February 2025

26 of 37

ESnet portal weather map (DC24)

26 – ESnet Science Engagement (engage@es.net) - February 2025

27 of 37

OmniPoP 2 x 4 x 100G*

27 – ESnet Science Engagement (engage@es.net) - February 2025

* During DC24, now = 2 x 2 x 400G

28 of 37

Characteristic Tier-2 data rates during DC24

Bursts believed to be an artifact of storage system outpacing job submission

28 – ESnet Science Engagement (engage@es.net) - February 2025

Caltech

29 of 37

Characteristic Tier-2 data rates

Site known to have an internal storage limitation, no bursts

29 – ESnet Science Engagement (engage@es.net) - February 2025

30 of 37

Characteristic Tier-2 data rates

Site known to have a 10G network bottleneck

30 – ESnet Science Engagement (engage@es.net) - February 2025

31 of 37

Characteristic Tier-2 data rates

University of Florida, 100G network bottleneck during DC24 now fixed.

31 – ESnet Science Engagement (engage@es.net) - February 2025

visible snmp polling errors under investigation

32 of 37

Distributed Networks

  • Interconnection between multiple networks
    • Trans-continent, national, regional, campus,
    • LHCONE traverses many disparate parties, glued together by ESnet

  • In order to prepare for increased load, multiple parties need to plan together:
    • Funding agencies
    • ESnet
    • Exchange points, like OmniPoP, MANLAN, WIX, etc.
    • Regional providers, like LEARN, iLight, GPN, etc.
    • Campus / Research IT
    • PI's and Staff

32 – ESnet Science Engagement (engage@es.net) - February 2025

KR

33 of 37

Exchange Points/RON’s and LHCONE: By The Numbers

  • Significant number of exchange point interfaces undergoing upgrades
  • Multi-party effort to get this done end to end, through to campuses
  • Huge lead times for infrastructure
  • We need to start now!

33 – ESnet Science Engagement (engage@es.net) - February 2025

34 of 37

ESnet LHC-Engagement Activities

  • ESnet making the rounds talking to every T2 site
    • Outreach starting again this year to support DC27
  • Gathering and helping synchronize plans from
    • Individual PI's
    • Departmental Support Staff
    • Campus IT
    • Regional Networks
    • R&E Exchange points
  • Near-term upgrades for Run 3
    • n*100G is sufficient …
  • …but for the HL-LHC era
    • 400G +

34 – ESnet Science Engagement (engage@es.net) - February 2025

35 of 37

Outyear planning

  • Consider not just your components involved to meet the expectations for large scale science.
  • ESnet is trying to get the word out ahead of budgeting & planning,
    • in multiple forums
    • and funding agencies
  • Lead times for Infrastructure investments are large
  • Give your campus CIOs and Provosts a heads-up now, and continuously engage for investments through 2029.
  • Engage w/ NSF and DOE in support of R&D efforts

35 – ESnet Science Engagement (engage@es.net) - February 2025

36 of 37

Audience Q/A

  • Tell us:
    • Who you are, what you represent, and where you are in preparation?
    • How can ESnet help you in your future network upgrades?
  • Unrelated … but related:
    • https://fasterdata.es.net/
    • We are going through a large update, but to be effective we can always use help:
      • Are RONs willing to share architecture choices (can be redacted)?
      • What equipment is working well for the use cases?
      • Specific config choices/tips/tricks?
    • We only succeed as a community when we share and encourage progress

36 – ESnet Science Engagement (engage@es.net) - February 2025

37 of 37

Science Driven Requirements for 400Gbps Networking

Kate Robinson - katerobinson@es.net

Network Engineering

Jason Zurawski - zurawski@es.net

Science Engagement

Quilt meeting

February 5th, 2025