1 of 6

OSG Staff Updates

2 of 6

IGWN Computing Workshop readout

Last weekend, I attended an IGWN computing workshop where I gave a “how OSG Services work”-style overview to the team. Lots of wide-ranging discussions – but wanted to share the highlights:

    • O4: IGWN’s O4 run is planned for March 2023; currently trying to finalize computing infrastructure. Much heavier reliance on OSG compared to O3.
    • OSDF: IGWN currently uses OSDF+CVMFS to distribute its proprietary data. Looking to use HTCondor file transfer plugins for data access (for O4, complimenting CVMFS-based access) a-la OSG-Connect.
    • User support: IGWN has a small-but-helpful operations team; how can we best leverage them?

2

3 of 6

OSDF and IGWN

IGWN is interested in leveraging the OSDF more heavily:

    • Distribute (potentially private) containers and other large files from the AP. Not doable via CVMFS currently.
    • Force users to declare up-front the frame files needed to reduce during-the-job failures.

We identified three items needed for IGWN to happily use HTCondor file transfer:

    • Enable the OSDF file transfer plugin in the IGWN pool. Done!
      • Aside: rather silly we do this pool-by-pool. Goal is to make this a HTCSS feature.
    • (Correct) Token discovery. The plugin currently uses one token from the environment; IGWN foresees many tokens per job. Need to specify which token to use for stage-in.
    • End-to-end checksums. While IGWN frame files are self-checksumming, there’s no mechanism to specify a generic checksum for a downloaded file.
      • Development opportunities in both OSDF and HTCSS.

Once there, they’ll want to enable per-AP origins to allow users to move containers.

3

4 of 6

User support discussion

IGWN has a small (about to grow from 1 person to 2 people!) operations team. Spent ~2 hours walking through a few past support scenarios. Conclusions:

    • They love the ability to reach out on Slack, hate Freshdesk.
      • Slack feels like talking to people; higher ‘cost’ to OSG staff and doesn’t form an institutional memory.
      • Freshdesk is near-impossible for team-to-team coordination; solely oriented toward team-to-customer communication.
      • Staff thoughts?
    • They’ve been able to triage several problems themselves. Value access to logfiles very highly. Some missing items / wishlist:
      • Ability to directly change/manage frontend configuration. (Separate namespace/flux setup on Tiger for IGWN?)
      • Access to HTCondor-CE logs for hosted CEs. (Syslog forwarding of CEs?)
      • Auto-unpack HTCondor logs from GlideinWMS logs. (Simple addition to factory rsync?)

4

5 of 6

And now for something completely different – Tiger Kubernetes Cluster updates

The Tiger cluster @ Morgridge is in a major refresh cycle:

    • Changing to new IP address block, enabling IPv6.
      • Didn’t talk about PATh Facility today but IPv6 is required for the AMPATH site.
    • Removal of Docker (use containerd.io engine directly). Required for latest Kubernetes upgrade.
    • Removal of problematic network configurations that cause nodes to go bad post-reboot.
    • Need to schedule two major changes, each involving a ~2 hour outage:
      • Upgrade of the Postgres operator from V4 to V5. Effectively, a backup + restore from backup.
      • How should we coordinate this? Affects OSPool…

5

6 of 6