Institute for Research and Innovation in Software
for High Energy Physics (IRIS-HEP)
PI: Peter Elmer (Princeton), co-PIs: Brian Bockelman (Morgridge Institute), Gordon Watts (U.Washington) with
UC-Berkeley, University of Chicago, University of Cincinnati, Cornell University, Indiana University, MIT, U.Notre Dame, U.Michigan-Ann Arbor, U.Nebraska-Lincoln, New York University, Stanford University, UC-Santa Cruz, UC-San Diego, U.Illinois at Urbana-Champaign, U.Puerto Rico-Mayaguez and U.Wisconsin-Madison
OAC-1836650
Coordinated Ecosystem for HL-LHC Computing R&D
Plenary Talk
Morning I - Parallel 1: Network & Storage System R&D
Summary by Brian Bockelman
Topic Description
What are the storage and network requirements for successfully executing the HL-LHC’s science program?
On a shorter term, what activities should be considered for DC23, particularly including active Storage and Networking R&D?
Summary of the Discussion - Setting the Requirements
Existing planning efforts have provided several of the HL-LHC requirements for facilities. Total storage and WAN requirements are particularly well-scoped.
Missing:
There was general agreement that items (1) and (2) could be spec’d out in a dedicated workshop.
Summary of Discussion / Current State of R&D
What are the funded projects in this area (nationally? Internationally?)
Below is the reverse of the prior slide – mapping of existing coordinated projects to R&D activities.
What are possible blueprints and existing forums for discussion?
What should this topic look like in 3 years?�How do we know whatever solutions we have will work once deployed?
The organizing checkpoint for this activity is DC23 (~18 months from now). Goals for this challenge should include:
Based on DC23 outcomes, additional activities may include
Topic Description
Discussion mainly focused on three topics:
Summary of the Discussion
What is the current state?
Integrated RSE teams
Training
DEI
What are the funded projects in this area (nationally? Internationally?)
Integrated RSE teams:
Training:
DEI:
What are possible blueprints and existing forums for discussion?
Integrated RSE teams:
Training:
DEI:
What should this topic look like in 3 years?
Integrated RSE teams:
Training:
DEI:
How do we know whatever solutions we have will work once deployed?
Integrated RSE teams:
Training:
DEI:
Morning II - Parallel 1: AI/ML Coordinated Ecosystem
Topic Description
Observation:
Guidance:
Summary of the Discussion
Evolution from this period of rapid R&D, prototyping, and bespoke solutions for deployment to a more mature / established set of practices for ML in various contexts (e.g. trigger, reconstruction, simulation, analysis)
What are the funded projects in this area (nationally? Internationally?)
See talks from Monday afternoon:
What is the current state?
Lots of activity:
Not (yet) much use of “foundation” models, unlike pattern seen in other fields
What are possible blueprints and existing forums for discussion?
Instead of constantly replacing ML components with a new state-of-the-art approach, �we will likely move into a mode where the ML components used in production stabilize.
What are possible blueprints and existing forums for discussion?
Infrastructure for supporting benchmarks and challenges
Does the “challenge” infrastructure include the training or just evaluation?
What should this topic look like in 3 years?
How do we know whatever solutions we have will work once deployed?
Baseline is current development > deployment > results process
A “Retraining challenge” should be defined by community (similar to the Analysis Grand Challenge and Data Challenge) so this should be baked in to some level
Something analogous to SSL as a testbed for approaches that use FPGAs
Morning II - Parallel 2: Resources & Coordination
Topic Description
How do we go beyond that? What other “plots” we need to produce?
For example: cost estimate, energy/power/carbon footprint estimates
Summary of the Discussion
Given projected resource shortage we will need HPC resources to complement T1/T2
To make a stronger case with ASCR and NSF HPCs we need to go beyond “we are running x% of our production on your system”
HPCs like to know we are running unique workflows on their hardware that enable us to increase our Physics reach. Some ideas:
What are possible blueprints and existing forums for discussion?
CompHEP (via HEP-CCE?) could sponsor a requirements gathering step inspired by the ESnet requirements process (on a smaller scale!), perhaps starting from the joint US ATLAS/US CMS opportunities computing document currently in draft format
What is the current state?
What are the funded projects in this area (nationally? Internationally?)
What should this topic look like in 3 years?
How do we know whatever solutions we have will work once deployed?
Afternoon I - Parallel 1: Analysis Facility & Evolution of Facilities
Topic Description
Summary of the Discussion
What is the current state?
What are the funded projects in this area (nationally? Internationally?)
What are possible blueprints and existing forums for discussion?
What should this topic look like in 3 years?
How do we know whatever solutions we have will work once deployed?
Topic Description
Summary of the Discussion
What is the current state?
Only a handful of experts within HEP-CEE, IRIS-HEP, ATLAS and CMS who have the sufficient knowledge to program algorithms on GPUs
Needs specialist/GPU knowledge and expertise with experimental software framework.
Also need to understand the algorithms or work closely with people who do (“understand/speak the language”)
Need continuity since projects are not short-term
What are the funded projects in this area (nationally? Internationally?)
HEP-CCE, IRIS-HEP, SCIDAC, LHC experiments’ Operations programs, DOE computational traineeship programs, Research programs…
What are possible blueprints and existing forums for discussion?
What should this topic look like in 3 years?
We should have built a larger pool of experts and people who can program on GPUs
A larger fraction of the experimental code/algorithms should be ported to GPUs (need to define fraction)
How do we know whatever solutions we have will work once deployed?