JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 8

Operations Area

Nov 28, 2018

Jeff Dost

1

2 of 8

Unsupported Pilot Service Certs [in progress]

InCommon / Let’s Encrypt don’t support service certs
Most pilots use service certs to auth with CEs, e.g:

/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=glideinwms/osg-flock.grid.iu.edu

Brian L / Brian B considering implications of using plain host certs for pilots
Problem - requires possibly disruptive changes to site HTCondor CE configs
High Urgency - OSG flock pilot cert expires soonest 2019-04-11
Brian B came up with possible workaround
TODO - test workaround with a frontend using InCommon / Let’s Encrypt hostcert as pilot credential, perhaps the GLOW test frontend?

Requires coordination with factory ops and software team

2

3 of 8

Aftermath of Suchandra Leaving OSG

UChicago T2 staff took over his OSG Connect responsibilities
Marco Mascheroni and Jeff Dost temporarily took ownership of Hosted CEs

We assume replacement will be hired at Chicago early 2019 and can eventually reclaim ownership

3

4 of 8

Hosted CE progress

Marco has done most of the work of learning the technical details
CEs worked on:

ASU [operational]
NMSU AGGIE GRID [operational]
TACC [in progress]

Special thanks to Brian Lin for technical support, and helping with Topology updates
TODO - go through and register existing Hosted CEs in topology that aren’t already

4

5 of 8

John Thiltges ramp up @ UNL

John Thiltges has been ramping up working on more UNL based services
Began attending Operations weekly meeting

5

6 of 8

Operations Face 2 Face

Hosted at UCSD Tues Jan 29 - Thur Jan 31 (half day on Thur)
Majority of ops team confirmed the dates work
Next step is to plan schedule
Topic ideas (incomplete list):

First day each operator gives a summary of services they run - this gives the rest of the team a high level big picture of ops as a whole
Updating service SLAs
Effort tracking
Handling maintenance windows (planned / unplanned)
Better ticket triaging

6

7 of 8

Updating OSG Service SLAs

From IRIS-HEP deliverables:

Repo of old SLAs on github:

https://github.com/opensciencegrid/operations/tree/master/docs/SLA

Initial thoughts:

Service catalog shows we have 40+ services, only 20 SLAs in repo
It won’t scale to make 1 SLA per service and try to keep them up to date
Some obvious overlap: Glidein Frontends, StashCache origins / caches
Tim C suggested we have shared SLAs for common services

Obviously needs updating, to start focus should be updating SLAs of services affecting LHC experiments

Should we just just pick one that is obvious to start, glideinWMS factory?

7

8 of 8

Concerns

Big takeaway from OSG Planning - tracking FTE breakdowns required to run 40+ services is not reasonable
But I need to make decisions based on Operator workloads when new services come along to decide who should manage it

Example - Offloading of Hosted CEs when Suchandra left

Potential solution - propose ops team to send weekly reports tracking their time, like software team does

8