1 of 20

OSG Collaborations

An Update from the Collaboration Support Area

1

PATh Staff Meeting

June 2022

Pascal Paschos

Area Coordinator for Collaborations

Enrico Fermi Institute, University of Chicago

2 of 20

Objectives of this presentation

  • Provide an update on select engagements and projects during the past 6 months
  • Collaboration support compiles monthly reports, found here. The bulk of the information is integrated into the monthly PATh report.

Contributions to Collaboration Support for this presentation from:

  • Lincoln Bryant: Hosted CEs infrastructure & instances on IRIS-HEP SSL & Storage
  • Judith Stephen - Rucio services, storage, systems and services for SPT-3G & XENON
  • Farnaz Golnaraghi & David Jordan - Infrastructure Support at UChicago
  • Mats Rynge (OSG VO/OSPool, Pegasus for XENON & VERITAS, Snowmass21)
  • Jason Patton, Diego Davilla, Fabio Andrijauskas, Jeff Peterson, Jeff Dost, Derek Weitzel, Brian Lin, Igor Sfiligoi, Brian Bockelman and others in operations, production and software services
  • Also contributions from James Clark (IGWN), Kurt Strosahl, Wesley Moore & Bryan Hess (JLAB), Evan Shockley (XENON), David Schultz,(Icecube)

2

3 of 20

Overall Scope of Support

3

  • Collaboration support coordinates and facilitates service delivery between PATh support groups and collaborations
  • Supports 15 collaborations and virtual organizations (VOs) in
    • Compute, Storage, Network Infrastructure and Hosting Services
    • Scientific Workflows, Data management and Delivery
    • User and Site support
    • Planning and Documentation
    • Dashboards and monitoring
  • Freshdesk, jira and internal tickets, regularly sync meetings and technical publications
  • Refer to Ken Herner’s presentation for updates on Fermilab supported collaborations (DUNE, Muon g-2, etc)

4 of 20

Compute Usage (past 6 months)

4

  • ~ 40 million core hours in mix of opportunistic and dedicated access. Core hours usage by Project name. Excludes Collaborations supported by Fermilab (e.g. DUNE)
  • Some collaborations are not their own VO and report under OSG - from Connect APs
  • Some projects are under two separate VOs (e.g. MOLLER, EIC)
  • Includes projects supported out of the Duke Connect AP

5 of 20

New collaborations

5

  • MOLLER, KOTO, SoLID, UHE-Neutrino Observatory Trinity, EUSO-SPB2
  • Access for MOLLER via the JLAB and OSG APs. To become its own VO and join the collaborations supported jointly by JLAB & PATh staff.
  • KOTO, SoLID, Trinity Observatory and EUSO-SPB2 are/will be using the collab.ci-connect access point.
  • New policies for access
    • New collaborations will be supported until matured to independence
      • Be their own VO
      • Have their own their own institutional APs, storage and dedicated pools
    • Provisioned with a modest amount of stash storage based on needs
    • Emphasis on training and familiarization with OSPool while fostering some level of production

6 of 20

A visual introduction to the science

6

EUSO-SB2 gondola

CP symmetry violation experiment in rare kaon decays

P-violating asymmetry in polarized Møller scattering

Large acceptance forward scattering spectrometer for JLAB’s 12GeV beam - SIDIS,PVDIS & J/ψ

Meson production experiments

7 of 20

PATh Infrastructure Updates

7

  • Completed relocation of existing Collab Support assets to a new data center
  • Meetings with Frank and Brian Bockelman to discuss infrastructure planning - for both RF (OSG Connect) and Collab Support goals
  • Desired capabilities:
    • Replacement for the 2015-era Stash cluster
      • Erasure Coded CephFS, multi-PB of usable storage.
    • High-performance Connect access points
      • Locally attached NVMe storage with XRootD origin for OSG Connect APs
      • Locally attached high capacity storage for Collab Connect APs
      • 100Gbps networking
      • Plus a testbed server for ongoing development
    • Kubernetes Platform for hosted services development and operation
      • Ongoing support for Harbor, HostedCEs, etc
  • RFQ submitted, awaiting vendor proposals

Contribution by Lincoln Bryant, Judith Stephen, David Jordan, Farnaz Golnaraghi

8 of 20

Bridge Infrastructure

8

  • Need to bridge the gap between Stash and the PATh Ceph cluster
    • Stash has been out of warranty for 2.5 years.
    • Unstable nodes, Ceph version upgrades at a dead-end (requires EL8+)
    • We need a temporary contingency
  • Typhoon - new Kubernetes cluster deployed with Rook on donated hardware
    • 5 storage nodes, 1 head, 3 general Kubernetes workers (runs MDS, Mon, Dashboard, etc)
    • Rook with 2+2 Erasure Code CephFS (50% storage overhead)
    • ~1PB usable
    • Also a testbed for proving out what we want to do with the new PATh hardware
    • We are syncing:
      • /collab (ongoing, 200TB so far)
      • /public (complete, with weekly incrementals)
  • Will replace it when the new PATh equipment is in production

Contributions by Lincoln Bryant & Farnaz Golnaraghi

9 of 20

Collaboration impacting projects

9

  • OSDF performance monitoring
  • Informational Web pages on collaborations
    • Provide a geolocated infographic for the institutional membership, and access/entry points for each collaboration

10 of 20

OSDF monitoring

10

  • Provide monitoring end-to-end: caches and origins throughput, caches files access and performance, and more than 15 performance metrics.
  • Of interest to collaborations that distribute data over cvmfs (e.g. IGWN, Icecube, REDTOP)
  • Challenges:
    • Detect problems before the user.
    • Monitoring world-wide network.

Contribution by Fabio Andrjiauskas

11 of 20

Tracking Institutions/Resources

11

  • Adding all collaborations with available data.
    • Institutions: Author-lists provided by collaboration representatives.
    • Execution and Access Points from Topology. Eventually live data from jobs.

Contribution by Cannon Lock

12 of 20

IGWN (LIGO et al)

12

  • Most HTCondor CE updated to 3.6 or on 3.5 upcoming managed at CERN gWMS
  • Bulk of support effort in the following areas (past 6 months)
    • CE/AP updates, pilot bearer tokens, IDToken exchanges with Schedd-Collectors,
    • OSDF: New node at PSU, operationalizing OZstar node, updating Gatech
    • Deploy cvmfs in user space for Compute Canada (CC).
    • Periodic validation of auth access to CVMFS (requires automating token fetching)
    • Execution/Access points dropping off from the pool - no singular reason
      • Mapping to service users in scitoken-mapfile
      • Request_Disk requirement for local batch jobs affected pilots
      • Frontier Squid services that are not registered with WLCG
      • Host Certs expired or missing attributes in cert
      • Failures in Auth cvmfs access for either pilots or payloads
      • Sites disabled at frontend and/or factory
      • Hitting FD limits on CE. Not restricted to IGWN.

Challenges in resolution due to limited access to a globally distributed infrastructure

Contributions by Jason Patton & James Clark

13 of 20

JLAB / BNL Collaborations

13

  • GlueX, CLAS12, MOLLER, EIC (JLAB+BNL)
    • As with IGWN, bulk of support was on transition to OSG 3.6 & tokenizing infrastructure in the JLAB and GlueX pools
    • Expanded to 3 APs at JLAB - to spread the load between VOs ( +1 development node)
      • APs were affected by the auto clustering bug at the OSPool CM
    • Completed access of GlueX pool via the JLAB AP
      • GlueX VOMS server issues at UConn resolved - caused by a single point failure in hardware
    • Persistent problems with access to dedicated resources for GlueX and CLAS12 (Compute Canada, ScotGrid, Lamar)
    • Similar to IGWN, scitoken mapfiles needed to edited for service users other than default
    • Conflicts between CLAS12 and GlueX on the JLAB APs and JLAB CE (Periodic Release, CE configuration)
    • JLAB standing up their own scitoken issuer via CILogon

Contributions by Jason Patton,

Kurt Stroshal & Wesley Moore

14 of 20

Snowmass, REDTOP, Icecube

14

  • Snowmass
    • Resumed submissions after white papers submitted
    • New round of jobs will further increase storage footprint (100TB at present) on stash. Expect to follow on a multiple origin plan discussed 2 years with BNL and FNAL.
  • REDTOP
    • Completed jobs to OSPool in March. Will resume in July.
    • Collab support contributed to REDTOP white paper
    • Existing data migrating to bridge storage
  • IceCube
    • Production OAuth2 token based file transfer mechanism in progress
      • HTTP based transfers to DESY (replacing GSIFTP)
    • APs can now submit jobs to OSPool
    • Increased usage of Stash storage at CHTC and data distribution over OSDF.
    • Successfully used k8s-native kubernetes provisioning for GKE resources
    • Studies of Auto-scaling of HTCondor pools and GPU accounting accepted at PEARC

Contributions by David Schultz & Igor Sfiligoi

15 of 20

SPT-3G

15

  • Typical requests in user support and storage management
    • On average SPT-3G uses 95% of available storage
    • Additional storage provisioned at Argonne (500TB) - besides dCache mounted on their APs and allocations at NERSC and UChicago HPC storage
  • APs have not been updated to OSG 3.6 yet. Update delayed due to upload missing job history in GRACC (now fixed). To be completed in June
  • Jobs require GFAL2 GridFTP support and still need to use osg-software 3.5 in OASIS. Will need to fix this in the future
  • Dedicated access to Illinois Campus Cluster
    • Issues with the hosted entry point was been resolved. Collaboration resumed jobs there via the OSPool to whole node pilots (* at present hitting FD limits on the CE).
    • The addition of two 1TB memory nodes requires adding a separate entry in CE - for high memory scheduling (similar to CPU/GPUs)

Contributions by Judith Stephen, Sasha Rahlin, Jeff Peterson and Brian Lin

16 of 20

XENON

16

  • Added an additional XENON RSE hosted on NSDF storage at SDSC
    • Ran into lack of support for XRootD TPC. Switched to using the HTTPS plugin instead
  • Added WebDAV door at UChicago and HTTPS support for FTS
  • Enabled multi-hop transfers in Rucio to facilitate transfers between sites with incompatible protocols
  • Adding a WebDAV door to the XENON data origin (LNGS DAQ)
  • Uploaded missing gratia job history from AP. Update to OSG 3.6 in June.
  • Persistent issues:
    • Large number of small files transfers running storage sites out of inodes
    • Large number of small files also limiting access performance on tape systems
    • Tape-to-tape transfers are very slow, limiting progress
    • Rucio is susceptible to backlogs when large number of requests queued or fail

Contributions by Judith Stephen, Brian Bockelman & the Rucio Support Team

17 of 20

XENON Rucio Distributed Storage

17

Contributions by Judith Stephen & Ilija Vukovic

18 of 20

KOTO

18

  • KOTO is an active HEP experiment (rare Kaon decays) at the KEK laboratory in Japan with participating US institutions
  • Collaboration has only run on KEK-CC HPC cluster - both data heavy reconstruction workflows and simulations
  • Interest in moving simulations on OSG to achieve high rate and volume
  • Will use collab AP and stash storage
  • Software stack containerized, deployed in the singularity repo and tested
  • Challenges encountered
    • Expectations on completion - from initial pull request to availability in the singularity repo
    • Documentation to run jobs directly using docker containers - conversion to singularity happens at runtime, requiring adjusting storage & memory requests
    • The strong recommendation to use OSGVO base images can conflict with software building requirements

Contributions by Mats Rynge & Brian Lin

19 of 20

Summary

19

  • Past 6 months dominated by the transition to using tokens and the support items it created
  • Initiated projects that facilitate information sharing both internally and user/collaboration facing - infographics, monitoring, and dashboards
  • Continue to try to balance across a diverse range of engagements - working with users & local admins, syncing with stakeholder and maintaining internal coordination
  • Refresh of Connect infrastructure under PATh is underway
  • Onboarding new collaborations on Collab Connect infrastructure to adhere to policies that aim to preempt open-ended engagements. Less of a concern when scaling up the number of collaborations with institutionally supported infrastructure and even less when there is dedicated support teams on their end.

20 of 20

Questions?

20