1 of 15

Production Support Update

Ken Herner

OSG Area Coordinators

8 November 2023

2 of 15

6-month activity review

Presenter | Presentation Title or Meeting Title

2

11/9/23

3 of 15

Pilots past 6 months

Presenter | Presentation Title or Meeting Title

3

11/9/23

4 of 15

Pilots past 6 months (2)

  • About 1.6 B wall hours => ~3.2 B hour annualized pace. Just over 2.6 B in 2022. Usual caveats apply to pilot hour reports

Presenter | Presentation Title or Meeting Title

4

11/9/23

5 of 15

Payload hours last 6 months

Presenter | Presentation Title or Meeting Title

5

11/9/23

6 of 15

Opportunistic Payload hours by VO

Presenter | Presentation Title or Meeting Title

6

11/9/23

+10% wrt last talk

7 of 15

Opportunistic Pilot Hours by facility

Presenter | Presentation Title or Meeting Title

7

11/9/23

8 of 15

Projects last 6 months

Presenter | Presentation Title or Meeting Title

8

11/9/23

9 of 15

GPU Payload jobs

Presenter | Presentation Title or Meeting Title

9

11/9/23

More variety in facilities than in the past,

but demand is still from a fairly small

number of groups

10 of 15

Some speed bumps

  • FNAL expts are now tokens-only for job submission

Occasional issue where credmon doesn't keep tokens up to date from the vault on some schedds. Results in massive job cancellation immediately after submission

  • Some recent issues with StashCache requests stepping on each other

FNAL IF uses StashCache over CVMFS (/cvmfs/expt.osgstorage.org repos)

Getting intermittent timeouts and "host serving data too slowly" messages (local redirection to dCache for on-site jobs was broken)

Turned out to be multiple issues (e.g. here); some were fixed in cvmfs 2.11.2

  • Another issue with some jobs haveing excessive shadow starts/restarts- still investigating

Presenter | Presentation Title or Meeting Title

10

11/9/23

11 of 15

Current concerns for June 2023

  • Tokens remains top current concern

Train has left the station for FNAL IF experiments; mostly on job submission side for now; interactive usage slowly increasing

Concern now is more with storage side of things- IF experiments looking to go tokens-only perhaps earlier than some WLCG sites. Propose to not force proxy drop, but set dates for offering token support

  • Second concern: OS migration. Things seem to be mostly converging to Alma 8/9 in the HEP world and at this point the relevant OSG software will be ready well before it absolutely has to be.

Planning to start major push on FIFE expts. and DUNE in Q3/Q4 CY23

Containers somewhat mitigate this but there will inevitably be issues when someone needs some state-of-the-art thing backported in a few years

Running into similar issues now with some legacy experiments (can't work with token in SL6 containers)

Containers within containers could potentially solve all this, but security questions remain (especially enabling user namespaces for containers instantiated by payload scripts)

Presenter | Presentation Title or Meeting Title

11

11/9/23

12 of 15

Summary

  • Things generally steady
  • On pace for > 2.8 B pilot hours; expect to break 2022 record this month
  • FNAL token transition continues and is making steady progress, with some bumps along the way
  • Concerns are site readiness for tokens (especially on SEs) and upcoming OS transitions

Presenter | Presentation Title or Meeting Title

12

11/9/23

13 of 15

Backup

Presenter | Presentation Title or Meeting Title

13

11/9/23

14 of 15

Tales of Tokens Transitions (from Feb 2023 talk)

How is it going so far? Very much a mixed bag. It (mostly) works (haven't needed to roll anything back) but there has been a lot more pain than expected.

Lack of ability to view job logs in browser caused a headache for people trying to debug problems, especially related to:

Biggest issue so far has actually related to storage (xrootd clients failing token auth for multiple expts; most were using <= 5.1.0 clients), but that's the reason for retaining proxies. Temporary workaround to default to x509 auth for root:// protocol on FNAL dCache door and use roots:// for tokens

Some lessons so far (personal perspective):

Testing did not really ramp up until a few days before. Expect this. Be receptive to delay requests, but require stakeholders to show you something isn't working (i.e. don't accept "I didn't have time to test yet") otherwise you will wait forever.

Prepare some extended support time/hours

Presenter | Presentation Title or Meeting Title

14

11/9/23

T. Durakiewicz, Physics Today 69, 2, 11 (2016)

Approximate distribution of testing frequency:

15 of 15

More lessons from token transition (from Feb 2023 talk)

  • May be worth delaying to implement features that aren't fully necessary but are popular or convenient.
  • Expect user software to be dated in unexpected ways
  • Record demos/tutorials for those who may not be able to attend scheduled demos
  • Users are used to proxies and being able to have one thing to manipulate. Not a lot of appreciation for different types of tokens and capability sets
  • Looking back, we probably should have:

Delayed until log viewing via browser was working

Logs were experiment/group readable (helps debugging)

Spent more time in demos explaining the different types of tokens and when you use which

Pushed experiments hard to give us some representative workflow examples to test ourselves. Traditionally very hard to get them to do that, but would have caught several problems ahead of time

Presenter | Presentation Title or Meeting Title

15

11/9/23