1 of 10

U.S. CMS Operations Program R&D:� Goals, Structure, Collaborative Nature

Lindsey Gray for the U.S. CMS S&C Operations Program

A Coordinated Ecosystem for HL-LHC Computing R&D

7 November 2022

U.S. CMS

Operations

Program

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

2 of 10

Opening points

Disclaimer

  • I am speaking for the U.S. CMS S&C Operations Program
  • I cannot speak for the whole U.S. CMS Collaboration
    • The S&C Operations Program has in the past enabled and driven new capabilities and guided the Collaboration
    • IRIS-HEP, HEP-CCE and other efforts are funding many projects that the U.S. CMS Collaboration is interested in
    • The S&C Operations Program will continue enabling the U.S. CMS Collaboration together with many partners

Introduction & Outline

  • Collaboration have been beneficial and symbiotic (see coffea-related things)
  • The U.S. CMS S&C Operations Program regularly updates its HL-LHC R&D strategic plan for the agencies, along 4 grand challenges (term borrowed from IRIS-HEP)
    • GC1: Modernize physics software and improve algorithms
    • GC2: Build infrastructure for exabyte-scale datasets
    • GC3: Transform the scientific data analysis process
    • GC4: Transition from R&D to operations
  • Focus on future issues/concerns from the CMS side, and how partners can stay engaged
  • Forward-looking, focused on the eventual end game for HL-LHC
  • This talk: current USCMS-ops R&D effort levels in areas that are aligned with partner goals

2

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

3 of 10

Big Picture: Resource Projections into HL-HLC Era

July 2022 Estimates for CPU, Tape, and Disk usage for the next 15 years

  • Each plot has band showing a 10% (low) to 20% (high) increase in resources year by year
  • Dashed blue lines are (solid) no R&D improvements and (dashed) likely R&D improvements
  • Takeaways:
    • CPU use is most critical
    • Tape is going to be on the edge
      • More thought needed
    • Disk is probably ok
    • But, we have to actually do and implement the R&D projects that are underway/planned!

3

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

4 of 10

Big Picture: Resource Projections into HL-HLC Era

July 2022 Estimates for CPU, Tape, and Disk usage, showing representative fractions

  • Plots show 2031 situation with no R&D
  • Reconstruction algorithms are CPU driver
  • Followed by SIM (GEANT)

  • Raw formats will dominate disk usage
  • Adoption of smaller formats (NanoAOD) can have large impact on disk usage
    • Almost 90% of approved analyses already using nanoAOD (critical difference to ATLAS model)

4

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

5 of 10

On-going CMS R&D Efforts

Full high-level list from US CMS program:

  • Analysis Systems:
    • Software Tools (Awkward, Columnar Arrays, RDF incorporation, analysis infrastructure, visualization)
    • Hardware/System configurations (EAF@FNAL, Coffea Casa, + other flavors)
  • Computing and Software Infrastructure
    • Storage (Object Stores, Data Lakes, Archival Storage (CTA))
    • Compute Services (Token migration, K8s infrastructure)
    • HPC Integration and Development (Production workflows on LCFs, strategic infrastructure development)
    • Workflow Development (incorporation of heterogeneous resources)
    • Computing on Heterogeneous Resources (infrastructure development, SONIC, GPU workflows)
    • Advanced Networking (SENSE/Rucio, etc.)
  • Physics Algorithms
    • Adaptation of current algorithms to run on accelerators (GPU & FPGA, portability)
    • Advanced Algorithms (ML-based simulation and reconstruction techniques, new tracking algorithms)

5

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

6 of 10

Other R&D Areas that need attention: Networking

Network Capacity/Smart Networks/etc.

  • Recall last year’s ESNet Needs review
    • Estimated that 100Gb/s average capability to sites will be necessary for “standard” operations/data transfers/etc.
      • Achievable by brute force/buying new hardware
    • 400 Gb/s burst speed for larger transfers
      • Even with new hardware, can we get this?
    • Role of intelligent networks?
      • Work in ESNet/SENSE/Rucio collaboration to understand data transfer/management infrastructure performance/needs
    • Further collaborative work with partners for Data Challenges essential
  • Need to understand potential impact of GPU offloading/Inference as a Service models (e.g. SONIC)
    • Not included in ESNet Needs review
      • Need to understand potential scope of deployment
      • Could be used in production workflows or analysis
    • Should, in principle, be low latency
      • Can’t wait days for a response from IaaS
    • Addressing these questions needs further development of GPU offloading and IaaS models for use in CMS

6

2 FTE

2.5 FTE

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

7 of 10

Other R&D Areas: Heterogeneous Resources

Fully exploiting facilities with large accelerator resources requires:

  • Having applications that can run efficiently on (at minimum) different GPU flavors
    • Will require much more work in code/algorithm conversion to optimize for accelerators
    • What about training?
      • Having software engineers well-versed in GPU programming and the CMS framework could be very beneficial
        • Provide guidance to eager novices
        • Aid in further development of GPU-ready applications
        • Implement portability migration based on CCE recommendations
  • Having appropriate workflow management infrastructure
    • Must understand coprocessor needs for each workflow piece
    • Must understand how to discover site resources and match to jobs
    • Work ongoing to define aspects of the problem so that R&D can proceed
  • Potentially: having more fine-grained network and data management control
    • Move data off of large HPC sites mid-workflow?
    • Still defining aspects of the problem
  • Heterogeneous Platforms/Resources
    • Further development of CMS Framework, infrastructure such as SONIC

7

*Note: a significant fraction of our HLT algorithms are or will be running on GPU resources in our HLT farm. We plan to have production workflows using some GPU resources in 2023.

1.25 FTE

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

8 of 10

Other R&D Areas that need attention: AF Infrastructure

Desire for some Analysis Facilities to be production-ready in Run 3

  • Several “default” configurations, “coffea-casa” “elastic analysis facility”
    • Could be ok, but makes generic support more difficult, should try to share infrastructure across efforts
  • Need for broader user base
    • Support for Root RDF under development
    • Help from IRIS-HEP has been critical here
  • Working model to add columns to a CMS nanoAOD file not yet demonstrated
    • Although, much progress has been made with auxiliary services like ServiceX/Skyhook
    • Still developing user base with default service modes
  • Infrastructure not yet ready for production
    • Configurations seem to be fragile
    • Not enough experience operating these ecosystems
    • ServiceX usage on CMS not widespread, but awkward and uproot very much so
      • Candidate for central service to add columns to CMS nanoAOD in AFs
      • Need more engagement
    • Rapid progress, though!

8

2.6 FTE across coffea casa, EAF

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

9 of 10

Other R&D Areas that need attention: DOMA

Aspects of exabyte-scale data ecosystem need to be fleshed out

Analysis Facility Related:

  • Storage R&D
    • Exploration of object stores for efficient retrieval of analysis-level data
      • Promising recent work with S3 + Ceph
  • Role of caches?
    • Will probably be critical in delivering data to Analysis Facilities
    • Central or regional?
    • Can we use OSG infrastructure? ESNet network caches?
      • How to guarantee data privacy?
    • Network caching studies underway
  • Caches will likely be part of larger “data lake” infrastructure
    • SoCal Cache studies
    • Larger-scale test beds need to be planned to demonstrate performance
      • Internal CMS R&D effort beginning work in this direction, further alignment with IRIS-HEP grand challenges beneficial
  • Analysis and Data Challenges will continue to be useful for benchmarking

9

1.5 FTE, not including tape

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022

10 of 10

Summary & Outlook

Summary & Outlook

  • Outlined R&D foci with significant overlap with partner goals with USCMS R&D effort
  • Partner + USCMS R&D has been wildly successful in delivering useful products that are scalable to HL-LHC
    • awkward, uproot, hist, dask-collaborations in particular
    • Hardware & infrastructure-level data lake research
  • We would like to collaborate further and round out the set of tools we’re developing
    • Utilization of heterogeneous computing in production workflows and analysis
    • Dynamic generation of user analysis data and corresponding caching
    • Data access tools and infrastructure that can deal with exascale datasets
    • Analysis Facilities that service a wide variety of user needs throughout the analysis life cycle
  • We hope that a future incarnation of IRIS-HEP will continue to align with our Grand Challenges and CMS R&D efforts

10

U.S. CMS

Operations

Program

A Coordinated Ecosystem for HL-LHC Computing R&D - U.S. CMS S&C Ops Program

Lindsey Gray for the U.S. CMS S&C Operations Program, November 7, 2022