1 of 50

Institute for Research and Innovation in Software

for High Energy Physics (IRIS-HEP)

PI: Peter Elmer (Princeton), co-PIs: Brian Bockelman (Morgridge Institute), Gordon Watts (U.Washington) with

UC-Berkeley, University of Chicago, University of Cincinnati, Cornell University, Indiana University, MIT, U.Notre Dame, U.Michigan-Ann Arbor, U.Nebraska-Lincoln, New York University, Stanford University, UC-Santa Cruz, UC-San Diego, U.Illinois at Urbana-Champaign, U.Puerto Rico-Mayaguez and U.Wisconsin-Madison

OAC-1836650

2 of 50

Coordinated Ecosystem for HL-LHC Computing R&D

Plenary Talk

3 of 50

Morning I - Parallel 1: Network & Storage System R&D

Summary by Brian Bockelman

4 of 50

Topic Description

What are the storage and network requirements for successfully executing the HL-LHC’s science program?

On a shorter term, what activities should be considered for DC23, particularly including active Storage and Networking R&D?

5 of 50

Summary of the Discussion - Setting the Requirements

Existing planning efforts have provided several of the HL-LHC requirements for facilities. Total storage and WAN requirements are particularly well-scoped.

Missing:

  • What are the requirements for data rates from site storage to (local) compute? Need to cover all US facilities, not just T1s.
  • How should the resources toward HL-LHC be deployed over time? Need to balance participation in data challenges with the real costs of “buying ahead” too early.
  • How do new concepts - analysis facilities, inference services - get their requirements captured?

There was general agreement that items (1) and (2) could be spec’d out in a dedicated workshop.

6 of 50

Summary of Discussion / Current State of R&D

  • Leveraging engineered network paths: Dynamically bring up dedicated network paths (providing some sort of guarantee) associated with a flow.
  • Packet marking: Either through IPv6 headers or “firefly packets”, inform network monitoring of identity of the flow (CMS? ATLAS? Priority? Data type?).
  • Demonstrating capabilities inside the network: Have network locations run additional services beyond packet processing (ServiceX in FAB, Caches at network POPs).
  • Introducing storage specialization for workflows: Instead of having a single hardware/software/service solution for all HL-LHC workflows, can we utilize some differentiation to optimize cost? Example: tiered storage at BNL - less IO-intensive workflows on lower-QOS disk?
  • Demonstrate scaling of existing solutions. Can we take an existing set of production software (XRootD + FTS + Rucio) and run it at larger scale on testbed hardware?
  • Technology evaluations across the community: As always, continuous evaluation of new storage solutions like Ceph (and/or including vendor solutions - e.g., VAST).
  • Other (storage R&D, not necessarily tied to data challenge): computational storage, increasingly lossy data formats, next-gen HDF5.

7 of 50

What are the funded projects in this area (nationally? Internationally?)

Below is the reverse of the prior slide – mapping of existing coordinated projects to R&D activities.

  • SENSE (ESNet/DOE): Engineered network path.
  • ESNet6: Packet marking & monitoring
  • IRIS-HEP:
    • Scaling of existing solutions.
    • Packet marking.
    • Engineered network path.
    • Demonstrating capabilities inside the network
  • Facility activities (CREST FNAL): Technology evaluations across the community
  • FAB/FABRIC: Demonstrating capabilities inside the network
  • BNL LDRD: Introducing storage specialization for workflows
  • HEP CCE: IOS activities have overlap with many of the topics in “other”
  • Unknown / Needs to be explored:
    • Unclear what other ASCR activities exist in terms of complex WAN data flows like HL-LHC.
    • Intersection with the DOE’s active projects around I/O (particularly, Adios).
    • Need engagement with the Integrated Research Infrastructure activity.

8 of 50

What are possible blueprints and existing forums for discussion?

  • Site-level design and requirements workshop: Based on the existing documents (WLCG Network Challenge doc, ESNet Requirements Review) start filling in more details for facilities that need to plan their hardware evolution.
  • Dedicated US-DC23 planning workshop: Identify the US resources to use for DC23, develop a more concrete set of goals for the US resources, and plan any intermediate mini-challenges.
  • Unknown: How do we engage with DOE facilities around IRI (or other ASCR projects)?
    • After their report comes out we should contact Ben and ask how we engage in a discussion of next steps

9 of 50

What should this topic look like in 3 years?�How do we know whatever solutions we have will work once deployed?

The organizing checkpoint for this activity is DC23 (~18 months from now). Goals for this challenge should include:

  • Complete pre-challenge activities showing technologies are ready on identified testbed (non-WLCG) sites.
  • Demonstrate the prescribed WAN data rates to existing WLCG facilities.
  • Have CMS utilize SENSE-engineered network paths with production facilities.
  • Have packet marking widely deployed (ATLAS & CMS, T1s & T2s) and show the ability to monitor flows.
  • Utilization of token-based authorization at storage endpoints.

Based on DC23 outcomes, additional activities may include

  • Next round of scaling improvements (moving from 30% -> 60% scale).
  • Broader utilization of SENSE-engineered network paths.

10 of 50

Morning I - Parallel 2: RSE’s in the field

(Google Doc Discussion)

11 of 50

Topic Description

Discussion mainly focused on three topics:

  • Integrated RSE / grad student / postdoc teams
    • Finding RSEs, RSE career advancement
  • Coordinating different training efforts
    • Training for the skills that we want in students and RSEs
  • Emphasis on DEI and recruitment
    • Ensuring that engagement with and recruitment of RSEs has DEI clearly in mind

12 of 50

Summary of the Discussion

  • Integrated RSE / grad student / postdoc teams
    • RSEs can be an accelerator for research
    • Benefits come from tight integration of students/postdocs with RSE
      • Example: GPU programming
    • Having a centralized field pool of RSEs could be advantageous, as not all institutions are going to be able to have their own pool.
  • Coordinating different training efforts
    • HSF has training modules for software engineering skills that we want in the field
    • Other initiatives funded to develop material; require coordination to avoid duplicated efforts; require input from users about prioritization (HSF “State of Training” survey in preparation)
    • Central Entry Point project to organize HEP Software materials from different organizations
  • Emphasis on DEI and recruitment
    • Getting RSEs into the field (postdoc conversion, university hires groups)
    • RSE field is maturing and self organizing as well, so can engage with them directly on some of these questions
    • Be thinking about diversity when engaging with RSE community
      • Hiring not just the same types of people that already exist
    • Need to raise awareness of handling of Code of Conduct / violation reporting and on practices on inclusion

13 of 50

What is the current state?

Integrated RSE teams

  • Much of the RSE-like work is performed by postdocs or physicists that have transitioned into software engineering like roles
  • Some examples of universities with centers / institutes that employ RSEs that work on HEP projects (funded via grants or ops program).
  • RSE community is also self organized and targeting these ideas themselves

Training

  • Historically training resources has been distributed but disparate and static (set of slides on webpage).
  • Now have HSF Training which is beginning to unify training resources
    • Resources are websites + assets generated from version controlled Markdown

DEI

  • Multiple trainings exist by different orgs (e.g. FNAL has upcoming trainings)
  • General agreement in session that need strong field support for DEI training and practically how to have and enforce a Code of Conduct

14 of 50

What are the funded projects in this area (nationally? Internationally?)

Integrated RSE teams:

  • Several of our large grants and the Ops programs support RSEs in some way
  • There are professional organizations (US-RSE)
  • There are philanthropically funded efforts that support RSE groups (e.g. S.F.)

Training:

  • IRIS-HEP, CODAS-HEP, DOE Traineeship, Energy Frontier (LPC, HATS)

DEI:

  • Everything implicitly? We talked about new DOE requirements.

15 of 50

What are possible blueprints and existing forums for discussion?

Integrated RSE teams:

  • Federated pool of RSEs at universities labs that our community can tap into
    • Schmidt Futures (U Washington), M2Lines (Earth System Modeling), NCSA, Cornell, UCSD, Princeton, Notre Dame

Training:

  • Tulika mentioned “obvious next steps”
  • Follow up to HSF “State of Training” survey, “Central Entry Point” project (similar to learn.astropy.org ) that lists all training material in an easy-to-filter way

DEI:

  • Challenges associated with how we report / respond to CoC complaints & violations: multiple institutes, multiple countries, legal issues, HEP power structures that cross institutes.
    • If HEP could share input (challenges, needs, suggestions) to universities, institutes would be valuable
    • A3D3 placed large emphasis on reporting workflows

16 of 50

What should this topic look like in 3 years?

Integrated RSE teams:

  • Field accepted understanding of resources and methods for procuring RSE for projects

Training:

  • HSF Training continues to grow and support resources for the community

DEI:

  • Formal documents of support for enhanced focus on training ensure relationships with RSEs are positive and successful
  • Unified Code of Conducts and frameworks for reporting violations
    • Global issue that is being targeted across fields, so should learn from other experiences
  • Have training be viewed as similar norm as workplace safety training

17 of 50

How do we know whatever solutions we have will work once deployed?

Integrated RSE teams:

  • Continual engagement: Can continue to engage with university orgs �and other communities that already.

Training:

  • Training is already being used, so deployment is already happening, �but can monitor engagement and contributions

DEI:

  • Not a HEP specific problem, so can engage with existing frameworks that are used across fields and industry
    • e.g. PyData Conference CoC (based on NumFOCUS)

18 of 50

Morning II - Parallel 1: AI/ML Coordinated Ecosystem

19 of 50

Topic Description

Observation:

  • AI Institutes and HDR Institutes did not exist during the last version of this meeting

Guidance:

  • Focus not on AI / ML techniques, opportunities, but on coordination given all the current activities and future investments that are anticipated

20 of 50

Summary of the Discussion

Evolution from this period of rapid R&D, prototyping, and bespoke solutions for deployment to a more mature / established set of practices for ML in various contexts (e.g. trigger, reconstruction, simulation, analysis)

21 of 50

What are the funded projects in this area (nationally? Internationally?)

See talks from Monday afternoon:

  • NSF AI Institutes (eg IAIFI)
  • NSF HDR institutes (e.g. A3D3)
  • AI 4 HEP awards, etc.
  • Misc AI/ML awards FAIR4HEP
  • Lab efforts: e.g. Fermilab’s Computational Science and Artificial Intelligence Directorate (CSAID)
  • International: ELLIS, Punch4NFDI, etc.
  • Some component of many base grants
  • Projects: SONIC, …
  • This list is not complete, see talks from Monday

22 of 50

What is the current state?

Lots of activity:

  • Prototyping / exploratory research often done outside of experiments on fast simulation etc. Leaves significant gap to integrate into the experiments.
    • Gap depends on where it will be integrated (trigger, reco, simulation, analysis).
    • Generally differences in development workflow (e.g. creating training data sets, infrastructure used for training & evaluation) and what is used in production (which is still evolving / maturing. No equivalent to “FastJet”)
  • Generally very fragmented. Some efforts to organize benchmarks / challenges.
    • Some open data to support R&D, but generally lacking in some way (complexity, realism, volume)
  • Progress is very fast: the network architecture / approach being used today likely to be supplanted by a new state-of-the-art approach next year.

Not (yet) much use of “foundation” models, unlike pattern seen in other fields

  • Influences usage patterns, requirements, etc. in various ways

23 of 50

What are possible blueprints and existing forums for discussion?

Instead of constantly replacing ML components with a new state-of-the-art approach, �we will likely move into a mode where the ML components used in production stabilize.

  • However, we will still need to retrain them to adjust to new running conditions, re-calibration of inputs, etc.
  • Points to new pattern where the training and deployment needs to be automated.
  • Resonates with themes around reproducibility, automated scientific workflows, version control, etc. Also resonates with AI/ML themes around “distribution shift”
  • Also points to need for significant computing resources for training as part of operations program.
    • This led to connections to expenditures connected to NAIRR https://www.ai.gov/nairrtf/ and question if HEP is ready / well positioned to partner with computing centers to be a core science driver
  • Scope out a “Retraining Challenge”?

24 of 50

What are possible blueprints and existing forums for discussion?

Infrastructure for supporting benchmarks and challenges

  • How can it be designed to shrink the gap between exploratory research and production
    • E.g. could use similar services used in experiments (e.g. SONIC)
    • E.g. could use similar data formats
  • Can they be designed to be address same problem for multiple experiments (e.g. an ATLAS track, a CMS track, and an open data / simulation track?)
  • What effort around open data / open simulations would help (see link)
    • Examples: common calorimeter simulation similar to what is used for tracking
    • Frank said OSG could host a mirror of CERN open data. Would be nice if integrated identifiers
    • Develop package that wraps such datasets: `pip install torch-CERNOpenData`
  • ML community is embracing benchmarks / challenges, and HEP voices have already shown some leadership in this space.

Does the “challenge” infrastructure include the training or just evaluation?

  • Could incorporate the continual retraining theme from previous slide

25 of 50

What should this topic look like in 3 years?

  • HEP is identified as a core science driver and has partnership with computing centers that are responding to funding calls as part of the NAIRR TF’s recommendation
  • HEP has identified various projects for HL-LHC that have topics / challenges that intersect with other science drivers and are of interest to industry (making us well positioned to respond to NSF TIPS)
  • Users can do something like pip install torch-CERNOpenData,has some workplan to develop infrastructure for benchmarking that integrates some of the same components / services used in production environments
  • Retraining challenge” defined and underway (not completed)

26 of 50

How do we know whatever solutions we have will work once deployed?

Baseline is current development > deployment > results process

  • Emphasis is to streamline this process (and not stifle innovation in the process)
  • Reduce bespoke components, increase adoption of common patterns
  • Suggests metrics:

A “Retraining challenge” should be defined by community (similar to the Analysis Grand Challenge and Data Challenge) so this should be baked in to some level

  • Should provide input that can inform computing model as ML is increasingly used in trigger, reco, simulation, and analysis (e.g. GPU needs for retraining as part of operations)

Something analogous to SSL as a testbed for approaches that use FPGAs

27 of 50

Morning II - Parallel 2: Resources & Coordination

28 of 50

Topic Description

  • Resource models have done their job in helping us communicate we have a problem for HL-LHC resources

How do we go beyond that? What other “plots” we need to produce?

For example: cost estimate, energy/power/carbon footprint estimates

  • Catalogue current HL-LHC activities following the categories formulated during 2019 CUA blueprint meeting (see notes)

29 of 50

Summary of the Discussion

Given projected resource shortage we will need HPC resources to complement T1/T2

To make a stronger case with ASCR and NSF HPCs we need to go beyond “we are running x% of our production on your system”

HPCs like to know we are running unique workflows on their hardware that enable us to increase our Physics reach. Some ideas:

  • Run major production campaigns on a single facility (e.g. a EOY production)
  • Run a “data reduction facility” on an HPC, exploiting their new data-intensive & AI/ML friendly hardware capabilities.
    • Allow analysis groups to customize their analysis derivation formats
    • Possible connection with HEP object store investigation at FNAL(US CMS), and CCE
    • Find out possible connections with similar data reduction activities in other domains (light sources, geospatial,...)

30 of 50

What are possible blueprints and existing forums for discussion?

CompHEP (via HEP-CCE?) could sponsor a requirements gathering step inspired by the ESnet requirements process (on a smaller scale!), perhaps starting from the joint US ATLAS/US CMS opportunities computing document currently in draft format

31 of 50

What is the current state?

32 of 50

What are the funded projects in this area (nationally? Internationally?)

33 of 50

What should this topic look like in 3 years?

34 of 50

How do we know whatever solutions we have will work once deployed?

35 of 50

Afternoon I - Parallel 1: Analysis Facility & Evolution of Facilities

36 of 50

Topic Description

  • Making GPUs accessible to users
    • GPUs available to CMS and ATLAS are not widely used in day-to-day analysis
    • ATLAS experience: BNL provides access to batch GPUs and interactive GPUs through jupyter, not much uptake
    • CMS users know that GPUs are useful especially with respect to ML, but tend to use cloud resources for purposes like training, etc.
    • How to do we get people to effective make use of these resources?
  • Facility Evolution at T1s and T2s
    • What do T1s and T2s need to do to evolve into HL-LHC era?
    • How do T1s and T2s evolve in the presence of an analysis facility?
  • Analysis Facility Evolution over the next 5 years
    • When and how to we solidify the definition of an analysis facility so it is a well defined concept?
    • How do we ensure that users are able to make effective use of analysis facilities and their software?

37 of 50

Summary of the Discussion

  • Making GPUs accessible to users
    • Users tend to find GPUs difficult to use realistically in analysis
    • First experiences with GPUs predominantly through environments that significantly ease access (colab)
    • Need to disseminate information and set expectations with users to improve adoption
  • Facility Evolution at T1s and T2s
    • Need to understand how T1s and T2s evolve in the face of analysis facilities?
    • Do we share resources with analysis facilities (is it needed?), what is the support model?
    • How are resources exposed in these different contexts consistently?
  • Analysis Facility Evolution over the next 5 years
    • Approaches to analysis facility are widely varied between ATLAS and CMS and within different groups (software focus vs. hardware focus predominant)
    • How do we release users from the confines of their experiment frameworks to get things done?
    • Documentation in the form of “user guides” (worked out examples with explanation) are sorely missing
      • Physics debugging experience pointed out as extremely varied
    • A critical missing piece is the user interface for accessing (and caching) missing analysis specific data

38 of 50

What is the current state?

  • Making GPUs accessible to users
    • Many GPUs available on OSG, cms connect, at T2s and T1s, and various AF prototypes
    • GPU use is predominantly interactive (ML training) and single GPU
    • Most of the GPU usage on OSG is from IceCube (3M / 3.7M GPU-hours, photon transport model)
  • Facility Evolution at T1s and T2s
    • Most analysis facility work is being done next to T2s and T1s (in the US), but not wide spread
    • Access/authentication to resources is currently being moved to more modern designs
    • GPU usage in offline production is very limited, often to tests, little operational experience
  • Analysis Facility Evolution over the next 5 years
    • GPU usage in analysis facilities is in the very early prototype phase but demonstrated to be useful
    • User experience is inconsistent but support is improving
    • Knowing what services can be deployed at what sites while keeping a relatively uniform interface is starting to be understood
    • Scope of analysis lifecycle at AF, and the scope of AFs themselves still being defined.

39 of 50

What are the funded projects in this area (nationally? Internationally?)

  • VO Resource Provisioning: CMS Connect, OSG Connect
  • Analysis Facilities Projects: Coffea-Casa, EAF@FNAL, AF@BNL
  • Inference as a Service: SONIC
  • Training for AFs in particular: None
  • Data analysis tools: awkward-array, uproot, RDF

40 of 50

What are possible blueprints and existing forums for discussion?

  • Blueprints
    • Dynamic resource provisioning between AFs and existing T1/T2 sites
    • Creating a user-facing data extraction facility for exascale datasets with ServiceX
      • Composable / extensible NanoAOD w/ dynamic joins
    • Creating a user guide for next-generation analysis tools
    • Defining the analysis lifecycle with HL-LHC tools
    • Effective patterns for GPU usage adoption in HEP analysis
  • Existing Fora
    • OSG / LCG
    • CMS / ATLAS S&C
    • IRIS-HEP / HEP-CCE

41 of 50

What should this topic look like in 3 years?

  • We understand well what the viable concept(s) of an analysis facility is(are) and we’re all on the same page.
  • Implementation of FAIR principles at AF?
  • GPU usage in our compute infrastructure should be approachable for new students but used when needed (say 20% of analysis routinely use GPUs?)
  • Production-level implementations of analysis facilities with significant knowledge bases
  • A prototype system of caches with policies reasonable for analysis-level use should be readily available
  • Users should be able to cache analysis specific data (self-generated or from heavier data-tiers) in this system of caches

42 of 50

How do we know whatever solutions we have will work once deployed?

  • Users are using analysis facilities and find the experience better than using resources directly
  • There is a user-driven contention for GPU resources
  • There are > 2 analysis facilities deployed hosting > 50 users each, with routine use
  • Users are able to easily derive custom additions to thin, analysis oriented datatiers

43 of 50

Afternoon I - Parallel 2: GPUs and Algorithms

Google doc

44 of 50

Topic Description

  • Algorithms - coordination, future, plans?
    • What do we to to speed this development this up
    • What new things should we be doing?
    • Coordination?
  • Making GPU’s accessible to users
  • Portability Frameworks

45 of 50

Summary of the Discussion

  • Progress in the last few years related to GPU programming, offloading algorithms to GPUs, portability studies.
  • GPU expertise in HEP is limited (handful of experts)
  • Important to speed up the development
  • For this, we need to build teams that include GPU programming experts, domain experts (e.g. reco experts) and newcomers who want to learn
    • grow the pool of experts for the future
  • HEP-CCE and IRIS-HEP should play a role in helping train the next generation of parallel computing experts
    • In addition to participating in projects that design GPU-based algorithms, portability studies
    • Explore co-funding with experimental/operations programs and leverage traineeship programs
  • Other items:
    • useful to have standardized testbeds/setups (take advantage of what is available at the labs, experimental GPU clusters…)
    • Important to have agreement on how to benchmark performance

46 of 50

What is the current state?

Only a handful of experts within HEP-CEE, IRIS-HEP, ATLAS and CMS who have the sufficient knowledge to program algorithms on GPUs

Needs specialist/GPU knowledge and expertise with experimental software framework.

Also need to understand the algorithms or work closely with people who do (“understand/speak the language”)

Need continuity since projects are not short-term

47 of 50

What are the funded projects in this area (nationally? Internationally?)

HEP-CCE, IRIS-HEP, SCIDAC, LHC experiments’ Operations programs, DOE computational traineeship programs, Research programs…

48 of 50

What are possible blueprints and existing forums for discussion?

  • Create an ecosystem or community that brings people from different experiments together on a regular basis
  • Should include a combination of blueprints/workshops, Slack channels, hackathons and traineeship programs
    • Important to ensure continuity i.e. people do not disperse and lose contact
  • Create a forum that brings together IRIS-HEP, HEP-CCE, Ops, traineeship programs and other interested parties…

49 of 50

What should this topic look like in 3 years?

We should have built a larger pool of experts and people who can program on GPUs

A larger fraction of the experimental code/algorithms should be ported to GPUs (need to define fraction)

50 of 50

How do we know whatever solutions we have will work once deployed?

  • Important to not “throw R & D code over the wall to the experiments and expect them to pick it up”.
    • code should be integrated within the experiments codebase
  • Do we have a larger pool of experts in about 3 years ? Are these experts getting recruited and hired by the labs, projects such a HEP-CCE etc ?