1 of 10

SSL perspectives: workshop outcomes and action items

1

IRIS-HEP AGC workshop 2023

05.05.2023

Rob Gardner & lots of help from others!

Enrico Fermi Institute

University of Chicago

2 of 10

thanks for the contributions

  • Wei, Fengping, Doug, Ofer, Oksana

as this was shovelled together at the last minute, apologies for the incomplete coverage

2

3 of 10

Some initial thoughts

  • In addition to being a first point of contact for development releases and scalability tests, the SSL is about identifying models and patterns for provisioning infrastructure in a flexible manner
    • What we mean by "Facilty R&D"
    • Which includes integration with dependent systems for context (e.g. caches, DDM, "backend" scheduling)
    • Leading up to the September showcase events, it seems we have some significant dev cycles for both Sx and Coffea
  • In this week's talks we've seen common service bundles (Sx, C-C) deployed in a number of facilities -- demonstrating "common substrate" approach identified in 2019 AS/SSL workshop at NYU (this is a significant departure from before)
  • But further strengthened with gitOps innovations allowing reproducibility across facilities and collaborations, with some rigor too (principles discussed at pre-CHEP2019 WLCG workshop on AS), IRIS-HEP 2020 blueprint, HSF Ecosystems, ... (also significant)
  • AF designs, Challenges, Demonstrators continue to attract outsize attention in our community (experiment's S&C, HSF, ..) and the diverse implementations we've seen across our facilities are providing valuable lessons
  • Keep in mind we are doing this concurrent to supporting Run3 analyzers in production

3

4 of 10

Activities at the UC AF

  • Upgrade XCache network
  • Deploy NVMe Ceph
  • Prepare 100% local datasets on NVMe
  • Redo ServiceX benchmarks with improvements
  • Triton service delivery improvements
  • Test scaling to MWT2-UC cores with Coffea
  • Support new physics workflows with ServiceX

4

5 of 10

Activities at UC, continued

  • ServiceX dev releases on River
  • ML-flow
  • XCache token auth
  • AGC benchmarking
  • Additional analysis benchmarks
  • ServiceX on FABRIC at CERN

5

6 of 10

BNL

  • Exploring deployments and integration with OKD
    • Obstacles imposed by VPN
  • Resources needed for object store output
  • Uncertain path for adoption of existing (command-line) users to JupterHub
  • Exploring Dask on busy, shared resources
  • Exploring A100 scheduling and fair share

6

7 of 10

Aside - modality of AF usage in US ATLAS (last 30 days)

7

8 of 10

SLAC AF

  • Prepare to migrate all AF users and envs to a new facility.
  • Testing and validating new ML and/or analysis containers built by ATLAS.
  • Testing methods to allow users to expanded existing containers by themselves.

8

9 of 10

Updates at UNL: coffea-casa related items

  • Update coffea-casa to use latest z2jh chart version
  • Review coffea-casa images
    • Cleanup and support more flavours: update CC7, add Alma8?
  • Publish Helm charts for coffea-casa
  • Look into Binderhub charts

9

10 of 10

Updates at UNL: preparing for AGC setup

  • Switch OD prod facility to use K8s HTCondor helm chart
    • After testing CMS prod facility will be next to switch
  • Test new xcache deployment at CMS prod facility
  • Deploy MLFlow preferably with token integration

10