1 of 11

Operations Area

Mar 15, 2023 (Since 11-02-22)

Jeff Dost

A Tale of Kubernetes

2 of 11

Request from resource provider for Hosted CE change

  • https://support.opensciencegrid.org/a/tickets/71621/
  • “I reduced the mem request by 2GB, by changing #SBATCH --mem=30000 to #SBATCH --mem=28000, which works. ��Can you please give that a try?”

2

3 of 11

  • On factory, updated /etc/osg-gfactory/OSG_autoconf/10-hosted-ces.auto.yml:

3

LIGO_AU_SUT-OzStar_trevor:

limits:

entry:

glideins: 1

attrs:

GLIDEIN_Supported_VOs:

value: LIGO

GLIDEIN_CPUS:

value: 12

GLIDEIN_MaxMemMBs:

value: 28000

submit_attrs:

+maxMemory: 28000

+xcount: 12

+queue: ""trevor""

Gets into glidein Startd config on worker node startup

Gets into glidein classad at CE to request correct amount of resources

4 of 11

  • Errors running gwms-factory reconfig:

4

Executing reconfigure hook: /etc/gwms-factory/hooks.reconfig.pre/hostedce_gen.sh

ERROR:root:

Traceback (most recent call last):

File "/bin/OSG_autoconf", line 623, in <module>

main()

File "/bin/OSG_autoconf", line 607, in main

result = get_information(config["OSG_COLLECTOR"])

File "/bin/OSG_autoconf", line 189, in get_information

htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]

File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper

rv = func(*args, **kwargs)

htcondor.HTCondorIOError: Failed communication with collector.

Unexpected exception. Aborting automatic configuration generation!

Traceback (most recent call last):

File "/bin/OSG_autoconf", line 623, in <module>

main()

File "/bin/OSG_autoconf", line 607, in main

result = get_information(config["OSG_COLLECTOR"])

File "/bin/OSG_autoconf", line 189, in get_information

htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]

File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper

rv = func(*args, **kwargs)

htcondor.HTCondorIOError: Failed communication with collector.

OSG_autoconf exited with a code different than 0. Aborting.

Press a key to continue...

Continuing with reconfigure and old xmls

(collector.opensciencegrid.org)

5 of 11

5

condor_status -pool collector.opensciencegrid.org:9619 -sched

Error: communication error

CEDAR:6001:Failed to connect to <128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>

Error: Couldn't contact the condor_collector on

central-collector-0.osg.chtc.io

(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>).

Extra Info: the condor_collector is a process that runs on the central

manager of your Condor pool and collects the status of all the machines and

jobs in the Condor pool. The condor_collector might not be running, it might

be refusing to communicate with you, there might be a network problem, or

there may be some other problem. Check with your system administrator to fix

this problem.

If you are the system administrator, check that the condor_collector is

running on central-collector-0.osg.chtc.io

(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>), check the

ALLOW/DENY configuration in your condor_config, and check the MasterLog and

CollectorLog files in your log directory for possible clues as to why the

condor_collector is not responding. Also see the Troubleshooting section of

the manual.

6 of 11

Reached out on Slack #operations

  • Reported the collector issue
  • John Thiltges confirmed network looked broken to collector service
  • Also appeared to be affecting other services running on tiger0002.chtc.wisc.edu (Fabio reported an OSDF cache issue earlier)

6

7 of 11

Meanwhile During Ops Standup…

7

Services with issues (6):

+--------------------------------------------------------------------------------------------+------------------+

| Service | Availability % |

+============================================================================================+==================+

| Condor status on collector.opensciencegrid.org | 36.69 |

+--------------------------------------------------------------------------------------------+------------------+

| Hosted CE availability LIGO_US_PSU-LIGO (psu-ligo-ce1.svc.opensciencegrid.org) | 36.69 |

+--------------------------------------------------------------------------------------------+------------------+

| Hosted CE availability OSG_US_LSUHSC-Tigerfish-CE2 (lsuhsc-tf-ce2.svc.opensciencegrid.org) | 0 |

+--------------------------------------------------------------------------------------------+------------------+

| Hosted CE availability OSG_US_UTC-Epyc (utc-epyc-ce1.svc.opensciencegrid.org) | 36.69 |

+--------------------------------------------------------------------------------------------+------------------+

| Hosted CE availability VU-AUGIE-CE1 (vu-augie-ce1.svc.opensciencegrid.org) | 36.69 |

+--------------------------------------------------------------------------------------------+------------------+

| XRootD copy from CHTC_TIGER_CACHE (stash-cache.osg.chtc.io) | 36.71 |

+--------------------------------------------------------------------------------------------+------------------+

SLA services:

+---------------------------+---------+------------------+------------------+

| Service | SLA % | Availability % | SLA Satisfied? |

+===========================+=========+==================+==================+

| SLA - CE Collectors | 90 | 36.69 | N |

+---------------------------+---------+------------------+------------------+

| SLA - Central Managers | 95 | 100 | Y |

+---------------------------+---------+------------------+------------------+

| SLA - GRACC | 95 | 99.69 | Y |

+---------------------------+---------+------------------+------------------+

| SLA - GWMS Factories | 95 | 99.87 | Y |

+---------------------------+---------+------------------+------------------+

| SLA - GWMS Frontends | 95 | 99.74 | Y |

+---------------------------+---------+------------------+------------------+

| SLA - Hosted CEs | 95 | 93.4 | N |

+---------------------------+---------+------------------+------------------+

Numbers all roughly same look suspect! Did all these go down same time?

8 of 11

Confirmed all problem services were on tiger0002

  • Confirmed unreachable from factory:
  • However, CE is reachable from another pod inside Tiger!

8

kubectl -n slate-group-osg-ops get pods -o wide | egrep 'utc|vu|psu'

osg-hosted-ce-psu-ligo-756dbbd7f7-vldf9 1/1 Running 0 5d21h 10.129.180.251 tiger0002.chtc.wisc.edu <none> <none>

osg-hosted-ce-utc-epyc-7b85574d87-6bjwz 1/1 Running 0 5d21h 10.129.173.107 tiger0002.chtc.wisc.edu <none> <none>

osg-hosted-ce-vu-augie-ce1-6c96b657cb-b6xdw 1/1 Running 0 5d22h 10.129.173.74 tiger0002.chtc.wisc.edu <none> <none>

condor_ce_q -pool psu-ligo-ce1.svc.opensciencegrid.org:9619 -name psu-ligo-ce1.svc.opensciencegrid.org

Error: Couldn't contact the condor_collector on

psu-ligo-ce1.svc.opensciencegrid.org:9619.

condor_ce_q -pool psu-ligo-ce1.svc.opensciencegrid.org:9619 -name psu-ligo-ce1.svc.opensciencegrid.org

-- Schedd: psu-ligo-ce1.svc.opensciencegrid.org : <128.104.103.185:9619?... @ 03/09/23 16:57:02

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

osg07 ID: 479598 2/11 00:25 _ _ _ 1 479598.0

9 of 11

Continuing debugging on Slack

  • Jeff Peterson quarantined tiger0002 off to prevent services from starting on it
  • We had to manually delete the pods to get them to restart on a non-problem host
    • kubectl -n slate-group-osg-ops delete … was not enough we also needed to use --force in order to make stuck CEs give back their PVCs (persistent volumes where state is saved between pod restarts)
  • After manual intervention all services recovered.
    • Total outage of services: 3/8 5:34 pm CT - 3/9 1:20 pm CT (~20h)
    • Affected: CE collector, 3 Hosted CEs, and 1 OSDF Cache

9

10 of 11

Root Cause

  • Jeff P on Slack:
    • “Looking at other pods there was a metallb eviction about 18h ago off of tiger0002, and the speaker on tiger0002 is showing an error on image pull. and the host os “route” command is hanging some”
  • Brian B reply:
    • “That'd do it... if MetalLB is evicted then it can't advertise IP addresses.”

10

?

?

?

?

?

?

?

?

11 of 11

Concerns / room for improvements?

  • The error didn’t seem obvious at first pass looking at Kubernetes, it just reported the pods were happily running
    • Is there any way to catch this from within k8s or is our Check_MK our best tool to detect broken WAN to pods?
  • What should the procedure be in this type of error? It is great that k8s pods can start up on a different host but it still required humans to notice / quarantine off the bad node / restart pods
  • In addition to daily reports we do get email alerts from Check_MK when things go wrong. However for the CEs I only got one alert, and it was buried in many other emails of transient warnings, we should clean up the alerts and also ensure critical ones are sent periodically

11