1 of 8

Operations Area

Nov 28, 2018

Jeff Dost

1

2 of 8

Unsupported Pilot Service Certs [in progress]

  • InCommon / Let’s Encrypt don’t support service certs
  • Most pilots use service certs to auth with CEs, e.g:
    • /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=glideinwms/osg-flock.grid.iu.edu
  • Brian L / Brian B considering implications of using plain host certs for pilots
  • Problem - requires possibly disruptive changes to site HTCondor CE configs
  • High Urgency - OSG flock pilot cert expires soonest 2019-04-11
  • Brian B came up with possible workaround
  • TODO - test workaround with a frontend using InCommon / Let’s Encrypt hostcert as pilot credential, perhaps the GLOW test frontend?
    • Requires coordination with factory ops and software team

2

3 of 8

Aftermath of Suchandra Leaving OSG

  • UChicago T2 staff took over his OSG Connect responsibilities
  • Marco Mascheroni and Jeff Dost temporarily took ownership of Hosted CEs
    • We assume replacement will be hired at Chicago early 2019 and can eventually reclaim ownership

3

4 of 8

Hosted CE progress

  • Marco has done most of the work of learning the technical details
  • CEs worked on:
    • ASU [operational]
    • NMSU AGGIE GRID [operational]
    • TACC [in progress]
  • Special thanks to Brian Lin for technical support, and helping with Topology updates
  • TODO - go through and register existing Hosted CEs in topology that aren’t already

4

5 of 8

John Thiltges ramp up @ UNL

  • John Thiltges has been ramping up working on more UNL based services
  • Began attending Operations weekly meeting

5

6 of 8

Operations Face 2 Face

  • Hosted at UCSD Tues Jan 29 - Thur Jan 31 (half day on Thur)
  • Majority of ops team confirmed the dates work
  • Next step is to plan schedule
  • Topic ideas (incomplete list):
    • First day each operator gives a summary of services they run - this gives the rest of the team a high level big picture of ops as a whole
    • Updating service SLAs
    • Effort tracking
    • Handling maintenance windows (planned / unplanned)
    • Better ticket triaging

6

7 of 8

Updating OSG Service SLAs

  • From IRIS-HEP deliverables:
  • Repo of old SLAs on github:
  • Initial thoughts:
    • Service catalog shows we have 40+ services, only 20 SLAs in repo
    • It won’t scale to make 1 SLA per service and try to keep them up to date
    • Some obvious overlap: Glidein Frontends, StashCache origins / caches
    • Tim C suggested we have shared SLAs for common services
  • Obviously needs updating, to start focus should be updating SLAs of services affecting LHC experiments
    • Should we just just pick one that is obvious to start, glideinWMS factory?

7

8 of 8

Concerns

  • Big takeaway from OSG Planning - tracking FTE breakdowns required to run 40+ services is not reasonable
  • But I need to make decisions based on Operator workloads when new services come along to decide who should manage it
    • Example - Offloading of Hosted CEs when Suchandra left
  • Potential solution - propose ops team to send weekly reports tracking their time, like software team does

8