Production Support Update
Ken Herner
OSG Area Coordinators
8 November 2023
6-month activity review
Presenter | Presentation Title or Meeting Title
2
11/9/23
Pilots past 6 months
Presenter | Presentation Title or Meeting Title
3
11/9/23
Pilots past 6 months (2)
Presenter | Presentation Title or Meeting Title
4
11/9/23
Payload hours last 6 months
Presenter | Presentation Title or Meeting Title
5
11/9/23
Opportunistic Payload hours by VO
Presenter | Presentation Title or Meeting Title
6
11/9/23
+10% wrt last talk
Opportunistic Pilot Hours by facility
Presenter | Presentation Title or Meeting Title
7
11/9/23
Projects last 6 months
Presenter | Presentation Title or Meeting Title
8
11/9/23
GPU Payload jobs
Presenter | Presentation Title or Meeting Title
9
11/9/23
More variety in facilities than in the past,
but demand is still from a fairly small
number of groups
Some speed bumps
Occasional issue where credmon doesn't keep tokens up to date from the vault on some schedds. Results in massive job cancellation immediately after submission
FNAL IF uses StashCache over CVMFS (/cvmfs/expt.osgstorage.org repos)
Getting intermittent timeouts and "host serving data too slowly" messages (local redirection to dCache for on-site jobs was broken)
Turned out to be multiple issues (e.g. here); some were fixed in cvmfs 2.11.2
Presenter | Presentation Title or Meeting Title
10
11/9/23
Current concerns for June 2023
Train has left the station for FNAL IF experiments; mostly on job submission side for now; interactive usage slowly increasing
Concern now is more with storage side of things- IF experiments looking to go tokens-only perhaps earlier than some WLCG sites. Propose to not force proxy drop, but set dates for offering token support
Planning to start major push on FIFE expts. and DUNE in Q3/Q4 CY23
Containers somewhat mitigate this but there will inevitably be issues when someone needs some state-of-the-art thing backported in a few years
Running into similar issues now with some legacy experiments (can't work with token in SL6 containers)
Containers within containers could potentially solve all this, but security questions remain (especially enabling user namespaces for containers instantiated by payload scripts)
Presenter | Presentation Title or Meeting Title
11
11/9/23
Summary
Presenter | Presentation Title or Meeting Title
12
11/9/23
Backup
Presenter | Presentation Title or Meeting Title
13
11/9/23
Tales of Tokens Transitions (from Feb 2023 talk)
How is it going so far? Very much a mixed bag. It (mostly) works (haven't needed to roll anything back) but there has been a lot more pain than expected.
Lack of ability to view job logs in browser caused a headache for people trying to debug problems, especially related to:
Biggest issue so far has actually related to storage (xrootd clients failing token auth for multiple expts; most were using <= 5.1.0 clients), but that's the reason for retaining proxies. Temporary workaround to default to x509 auth for root:// protocol on FNAL dCache door and use roots:// for tokens
Some lessons so far (personal perspective):
Testing did not really ramp up until a few days before. Expect this. Be receptive to delay requests, but require stakeholders to show you something isn't working (i.e. don't accept "I didn't have time to test yet") otherwise you will wait forever.
Prepare some extended support time/hours
Presenter | Presentation Title or Meeting Title
14
11/9/23
T. Durakiewicz, Physics Today 69, 2, 11 (2016)
Approximate distribution of testing frequency:
More lessons from token transition (from Feb 2023 talk)
Delayed until log viewing via browser was working
Logs were experiment/group readable (helps debugging)
Spent more time in demos explaining the different types of tokens and when you use which
Pushed experiments hard to give us some representative workflow examples to test ourselves. Traditionally very hard to get them to do that, but would have caught several problems ahead of time
Presenter | Presentation Title or Meeting Title
15
11/9/23