1 of 20

OSG Collaborations

An Update from the Collaboration Support Area

1

PATh Staff Meeting

June 2022

Pascal Paschos

Area Coordinator for Collaborations

Enrico Fermi Institute, University of Chicago

2 of 20

Objectives of this presentation

Provide an update on select engagements and projects during the past 6 months
Collaboration support compiles monthly reports, found here. The bulk of the information is integrated into the monthly PATh report.

Contributions to Collaboration Support for this presentation from:

Lincoln Bryant: Hosted CEs infrastructure & instances on IRIS-HEP SSL & Storage
Judith Stephen - Rucio services, storage, systems and services for SPT-3G & XENON
Farnaz Golnaraghi & David Jordan - Infrastructure Support at UChicago
Mats Rynge (OSG VO/OSPool, Pegasus for XENON & VERITAS, Snowmass21)
Jason Patton, Diego Davilla, Fabio Andrijauskas, Jeff Peterson, Jeff Dost, Derek Weitzel, Brian Lin, Igor Sfiligoi, Brian Bockelman and others in operations, production and software services
Also contributions from James Clark (IGWN), Kurt Strosahl, Wesley Moore & Bryan Hess (JLAB), Evan Shockley (XENON), David Schultz,(Icecube)

2

3 of 20

Overall Scope of Support

3

Collaboration support coordinates and facilitates service delivery between PATh support groups and collaborations
Supports 15 collaborations and virtual organizations (VOs) in

Compute, Storage, Network Infrastructure and Hosting Services
Scientific Workflows, Data management and Delivery
User and Site support
Planning and Documentation
Dashboards and monitoring

Freshdesk, jira and internal tickets, regularly sync meetings and technical publications
Refer to Ken Herner’s presentation for updates on Fermilab supported collaborations (DUNE, Muon g-2, etc)

4 of 20

Compute Usage (past 6 months)

4

~ 40 million core hours in mix of opportunistic and dedicated access. Core hours usage by Project name. Excludes Collaborations supported by Fermilab (e.g. DUNE)
Some collaborations are not their own VO and report under OSG - from Connect APs
Some projects are under two separate VOs (e.g. MOLLER, EIC)
Includes projects supported out of the Duke Connect AP

5 of 20

New collaborations

5

MOLLER, KOTO, SoLID, UHE-Neutrino Observatory Trinity, EUSO-SPB2
Access for MOLLER via the JLAB and OSG APs. To become its own VO and join the collaborations supported jointly by JLAB & PATh staff.
KOTO, SoLID, Trinity Observatory and EUSO-SPB2 are/will be using the collab.ci-connect access point.
New policies for access

New collaborations will be supported until matured to independence

Be their own VO
Have their own their own institutional APs, storage and dedicated pools

Provisioned with a modest amount of stash storage based on needs
Emphasis on training and familiarization with OSPool while fostering some level of production

6 of 20

A visual introduction to the science

6

EUSO-SB2 gondola

CP symmetry violation experiment in rare kaon decays

P-violating asymmetry in polarized Møller scattering

Large acceptance forward scattering spectrometer for JLAB’s 12GeV beam - SIDIS,PVDIS & J/ψ

Meson production experiments

7 of 20

PATh Infrastructure Updates

7

Completed relocation of existing Collab Support assets to a new data center
Meetings with Frank and Brian Bockelman to discuss infrastructure planning - for both RF (OSG Connect) and Collab Support goals
Desired capabilities:

Replacement for the 2015-era Stash cluster

Erasure Coded CephFS, multi-PB of usable storage.

High-performance Connect access points

Locally attached NVMe storage with XRootD origin for OSG Connect APs
Locally attached high capacity storage for Collab Connect APs
100Gbps networking
Plus a testbed server for ongoing development

Kubernetes Platform for hosted services development and operation

Ongoing support for Harbor, HostedCEs, etc

RFQ submitted, awaiting vendor proposals

Contribution by Lincoln Bryant, Judith Stephen, David Jordan, Farnaz Golnaraghi

8 of 20

Bridge Infrastructure

8

Need to bridge the gap between Stash and the PATh Ceph cluster

Stash has been out of warranty for 2.5 years.
Unstable nodes, Ceph version upgrades at a dead-end (requires EL8+)
We need a temporary contingency

Typhoon - new Kubernetes cluster deployed with Rook on donated hardware

5 storage nodes, 1 head, 3 general Kubernetes workers (runs MDS, Mon, Dashboard, etc)
Rook with 2+2 Erasure Code CephFS (50% storage overhead)
~1PB usable
Also a testbed for proving out what we want to do with the new PATh hardware
We are syncing:

/collab (ongoing, 200TB so far)
/public (complete, with weekly incrementals)

Will replace it when the new PATh equipment is in production

Contributions by Lincoln Bryant & Farnaz Golnaraghi

9 of 20

Collaboration impacting projects

9

OSDF performance monitoring
Informational Web pages on collaborations

Provide a geolocated infographic for the institutional membership, and access/entry points for each collaboration

https://osg-htc.org/collaboration-support

10 of 20

OSDF monitoring

10

Provide monitoring end-to-end: caches and origins throughput, caches files access and performance, and more than 15 performance metrics.
Of interest to collaborations that distribute data over cvmfs (e.g. IGWN, Icecube, REDTOP)
Challenges:

Detect problems before the user.
Monitoring world-wide network.

Contribution by Fabio Andrjiauskas

11 of 20

Tracking Institutions/Resources

11

Adding all collaborations with available data.

Institutions: Author-lists provided by collaboration representatives.
Execution and Access Points from Topology. Eventually live data from jobs.

Contribution by Cannon Lock

12 of 20

IGWN (LIGO et al)

12

Most HTCondor CE updated to 3.6 or on 3.5 upcoming managed at CERN gWMS
Bulk of support effort in the following areas (past 6 months)

CE/AP updates, pilot bearer tokens, IDToken exchanges with Schedd-Collectors,
OSDF: New node at PSU, operationalizing OZstar node, updating Gatech
Deploy cvmfs in user space for Compute Canada (CC).
Periodic validation of auth access to CVMFS (requires automating token fetching)
Execution/Access points dropping off from the pool - no singular reason

Mapping to service users in scitoken-mapfile
Request_Disk requirement for local batch jobs affected pilots
Frontier Squid services that are not registered with WLCG
Host Certs expired or missing attributes in cert
Failures in Auth cvmfs access for either pilots or payloads
Sites disabled at frontend and/or factory
Hitting FD limits on CE. Not restricted to IGWN.

Challenges in resolution due to limited access to a globally distributed infrastructure

Contributions by Jason Patton & James Clark

13 of 20

JLAB / BNL Collaborations

13

GlueX, CLAS12, MOLLER, EIC (JLAB+BNL)

As with IGWN, bulk of support was on transition to OSG 3.6 & tokenizing infrastructure in the JLAB and GlueX pools
Expanded to 3 APs at JLAB - to spread the load between VOs ( +1 development node)

APs were affected by the auto clustering bug at the OSPool CM

Completed access of GlueX pool via the JLAB AP

GlueX VOMS server issues at UConn resolved - caused by a single point failure in hardware

Persistent problems with access to dedicated resources for GlueX and CLAS12 (Compute Canada, ScotGrid, Lamar)
Similar to IGWN, scitoken mapfiles needed to edited for service users other than default
Conflicts between CLAS12 and GlueX on the JLAB APs and JLAB CE (Periodic Release, CE configuration)
JLAB standing up their own scitoken issuer via CILogon

Contributions by Jason Patton,

Kurt Stroshal & Wesley Moore

14 of 20

Snowmass, REDTOP, Icecube

14

Snowmass

Resumed submissions after white papers submitted
New round of jobs will further increase storage footprint (100TB at present) on stash. Expect to follow on a multiple origin plan discussed 2 years with BNL and FNAL.

REDTOP

Completed jobs to OSPool in March. Will resume in July.
Collab support contributed to REDTOP white paper
Existing data migrating to bridge storage

IceCube

Production OAuth2 token based file transfer mechanism in progress

HTTP based transfers to DESY (replacing GSIFTP)

APs can now submit jobs to OSPool
Increased usage of Stash storage at CHTC and data distribution over OSDF.
Successfully used k8s-native kubernetes provisioning for GKE resources
Studies of Auto-scaling of HTCondor pools and GPU accounting accepted at PEARC

Contributions by David Schultz & Igor Sfiligoi

15 of 20

SPT-3G

15

Typical requests in user support and storage management

On average SPT-3G uses 95% of available storage
Additional storage provisioned at Argonne (500TB) - besides dCache mounted on their APs and allocations at NERSC and UChicago HPC storage

APs have not been updated to OSG 3.6 yet. Update delayed due to upload missing job history in GRACC (now fixed). To be completed in June
Jobs require GFAL2 GridFTP support and still need to use osg-software 3.5 in OASIS. Will need to fix this in the future
Dedicated access to Illinois Campus Cluster

Issues with the hosted entry point was been resolved. Collaboration resumed jobs there via the OSPool to whole node pilots (* at present hitting FD limits on the CE).
The addition of two 1TB memory nodes requires adding a separate entry in CE - for high memory scheduling (similar to CPU/GPUs)

Contributions by Judith Stephen, Sasha Rahlin, Jeff Peterson and Brian Lin

16 of 20

XENON

16

Added an additional XENON RSE hosted on NSDF storage at SDSC

Ran into lack of support for XRootD TPC. Switched to using the HTTPS plugin instead

Added WebDAV door at UChicago and HTTPS support for FTS
Enabled multi-hop transfers in Rucio to facilitate transfers between sites with incompatible protocols
Adding a WebDAV door to the XENON data origin (LNGS DAQ)
Uploaded missing gratia job history from AP. Update to OSG 3.6 in June.
Persistent issues:

Large number of small files transfers running storage sites out of inodes
Large number of small files also limiting access performance on tape systems
Tape-to-tape transfers are very slow, limiting progress
Rucio is susceptible to backlogs when large number of requests queued or fail

Contributions by Judith Stephen, Brian Bockelman & the Rucio Support Team

17 of 20

XENON Rucio Distributed Storage

17

Contributions by Judith Stephen & Ilija Vukovic

18 of 20

KOTO

18

KOTO is an active HEP experiment (rare Kaon decays) at the KEK laboratory in Japan with participating US institutions
Collaboration has only run on KEK-CC HPC cluster - both data heavy reconstruction workflows and simulations
Interest in moving simulations on OSG to achieve high rate and volume
Will use collab AP and stash storage
Software stack containerized, deployed in the singularity repo and tested
Challenges encountered

Expectations on completion - from initial pull request to availability in the singularity repo
Documentation to run jobs directly using docker containers - conversion to singularity happens at runtime, requiring adjusting storage & memory requests
The strong recommendation to use OSGVO base images can conflict with software building requirements

Contributions by Mats Rynge & Brian Lin

19 of 20

Summary

19

Past 6 months dominated by the transition to using tokens and the support items it created
Initiated projects that facilitate information sharing both internally and user/collaboration facing - infographics, monitoring, and dashboards
Continue to try to balance across a diverse range of engagements - working with users & local admins, syncing with stakeholder and maintaining internal coordination
Refresh of Connect infrastructure under PATh is underway
Onboarding new collaborations on Collab Connect infrastructure to adhere to policies that aim to preempt open-ended engagements. Less of a concern when scaling up the number of collaborations with institutionally supported infrastructure and even less when there is dedicated support teams on their end.

20 of 20

Questions?

20