1 of 11

Operations Area

Mar 15, 2023 (Since 11-02-22)

Jeff Dost

A Tale of Kubernetes

2 of 11

Request from resource provider for Hosted CE change

https://support.opensciencegrid.org/a/tickets/71621/
“I reduced the mem request by 2GB, by changing #SBATCH --mem=30000 to #SBATCH --mem=28000, which works. ��Can you please give that a try?”

2

3 of 11

On factory, updated /etc/osg-gfactory/OSG_autoconf/10-hosted-ces.auto.yml:

3

LIGO_AU_SUT-OzStar_trevor:

limits:

entry:

glideins: 1

attrs:

GLIDEIN_Supported_VOs:

value: LIGO

GLIDEIN_CPUS:

value: 12

GLIDEIN_MaxMemMBs:

value: 28000

submit_attrs:

+maxMemory: 28000

+xcount: 12

+queue: ""trevor""

Gets into glidein Startd config on worker node startup

Gets into glidein classad at CE to request correct amount of resources

4 of 11

Errors running gwms-factory reconfig:

4

Executing reconfigure hook: /etc/gwms-factory/hooks.reconfig.pre/hostedce_gen.sh

ERROR:root:

Traceback (most recent call last):

File "/bin/OSG_autoconf", line 623, in <module>

main()

File "/bin/OSG_autoconf", line 607, in main

result = get_information(config["OSG_COLLECTOR"])

File "/bin/OSG_autoconf", line 189, in get_information

htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]

File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper

rv = func(*args, **kwargs)

htcondor.HTCondorIOError: Failed communication with collector.

Unexpected exception. Aborting automatic configuration generation!

Traceback (most recent call last):

File "/bin/OSG_autoconf", line 623, in <module>

main()

File "/bin/OSG_autoconf", line 607, in main

result = get_information(config["OSG_COLLECTOR"])

File "/bin/OSG_autoconf", line 189, in get_information

htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]

File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper

rv = func(*args, **kwargs)

htcondor.HTCondorIOError: Failed communication with collector.

OSG_autoconf exited with a code different than 0. Aborting.

Press a key to continue...

Continuing with reconfigure and old xmls

(collector.opensciencegrid.org)

5 of 11

5

condor_status -pool collector.opensciencegrid.org:9619 -sched

Error: communication error

CEDAR:6001:Failed to connect to <128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>

Error: Couldn't contact the condor_collector on

central-collector-0.osg.chtc.io

(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>).

Extra Info: the condor_collector is a process that runs on the central

manager of your Condor pool and collects the status of all the machines and

jobs in the Condor pool. The condor_collector might not be running, it might

be refusing to communicate with you, there might be a network problem, or

there may be some other problem. Check with your system administrator to fix

this problem.

If you are the system administrator, check that the condor_collector is

running on central-collector-0.osg.chtc.io

(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>), check the

ALLOW/DENY configuration in your condor_config, and check the MasterLog and

CollectorLog files in your log directory for possible clues as to why the

condor_collector is not responding. Also see the Troubleshooting section of

the manual.

6 of 11

Reached out on Slack #operations

Reported the collector issue
John Thiltges confirmed network looked broken to collector service
Also appeared to be affecting other services running on tiger0002.chtc.wisc.edu (Fabio reported an OSDF cache issue earlier)

6

7 of 11

Meanwhile During Ops Standup…

7

Services with issues (6):

+--------------------------------------------------------------------------------------------+------------------+

| Service | Availability % |

+============================================================================================+==================+

| Condor status on collector.opensciencegrid.org | 36.69 |

+--------------------------------------------------------------------------------------------+------------------+

| Hosted CE availability LIGO_US_PSU-LIGO (psu-ligo-ce1.svc.opensciencegrid.org) | 36.69 |

+--------------------------------------------------------------------------------------------+------------------+

| Hosted CE availability OSG_US_LSUHSC-Tigerfish-CE2 (lsuhsc-tf-ce2.svc.opensciencegrid.org) | 0 |

+--------------------------------------------------------------------------------------------+------------------+

| Hosted CE availability OSG_US_UTC-Epyc (utc-epyc-ce1.svc.opensciencegrid.org) | 36.69 |

+--------------------------------------------------------------------------------------------+------------------+

| Hosted CE availability VU-AUGIE-CE1 (vu-augie-ce1.svc.opensciencegrid.org) | 36.69 |

+--------------------------------------------------------------------------------------------+------------------+

| XRootD copy from CHTC_TIGER_CACHE (stash-cache.osg.chtc.io) | 36.71 |

+--------------------------------------------------------------------------------------------+------------------+

SLA services:

+---------------------------+---------+------------------+------------------+

+===========================+=========+==================+==================+

| SLA - CE Collectors | 90 | 36.69 | N |

+---------------------------+---------+------------------+------------------+

| SLA - Central Managers | 95 | 100 | Y |

+---------------------------+---------+------------------+------------------+

| SLA - GRACC | 95 | 99.69 | Y |

+---------------------------+---------+------------------+------------------+

| SLA - GWMS Factories | 95 | 99.87 | Y |

+---------------------------+---------+------------------+------------------+

| SLA - GWMS Frontends | 95 | 99.74 | Y |

+---------------------------+---------+------------------+------------------+

| SLA - Hosted CEs | 95 | 93.4 | N |

+---------------------------+---------+------------------+------------------+

Numbers all roughly same look suspect! Did all these go down same time?

8 of 11

Confirmed all problem services were on tiger0002

Confirmed unreachable from factory:

However, CE is reachable from another pod inside Tiger!

8

kubectl -n slate-group-osg-ops get pods -o wide | egrep 'utc|vu|psu'

osg-hosted-ce-psu-ligo-756dbbd7f7-vldf9 1/1 Running 0 5d21h 10.129.180.251 tiger0002.chtc.wisc.edu <none> <none>

osg-hosted-ce-utc-epyc-7b85574d87-6bjwz 1/1 Running 0 5d21h 10.129.173.107 tiger0002.chtc.wisc.edu <none> <none>

osg-hosted-ce-vu-augie-ce1-6c96b657cb-b6xdw 1/1 Running 0 5d22h 10.129.173.74 tiger0002.chtc.wisc.edu <none> <none>

condor_ce_q -pool psu-ligo-ce1.svc.opensciencegrid.org:9619 -name psu-ligo-ce1.svc.opensciencegrid.org

Error: Couldn't contact the condor_collector on

psu-ligo-ce1.svc.opensciencegrid.org:9619.

condor_ce_q -pool psu-ligo-ce1.svc.opensciencegrid.org:9619 -name psu-ligo-ce1.svc.opensciencegrid.org

-- Schedd: psu-ligo-ce1.svc.opensciencegrid.org : <128.104.103.185:9619?... @ 03/09/23 16:57:02

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

osg07 ID: 479598 2/11 00:25 _ _ _ 1 479598.0

9 of 11

Continuing debugging on Slack

Jeff Peterson quarantined tiger0002 off to prevent services from starting on it
We had to manually delete the pods to get them to restart on a non-problem host

kubectl -n slate-group-osg-ops delete … was not enough we also needed to use --force in order to make stuck CEs give back their PVCs (persistent volumes where state is saved between pod restarts)

After manual intervention all services recovered.

Total outage of services: 3/8 5:34 pm CT - 3/9 1:20 pm CT (~20h)
Affected: CE collector, 3 Hosted CEs, and 1 OSDF Cache

9

10 of 11

Root Cause

Jeff P on Slack:

“Looking at other pods there was a metallb eviction about 18h ago off of tiger0002, and the speaker on tiger0002 is showing an error on image pull. and the host os “route” command is hanging some”

Brian B reply:

“That'd do it... if MetalLB is evicted then it can't advertise IP addresses.”

10

?

11 of 11

Concerns / room for improvements?

The error didn’t seem obvious at first pass looking at Kubernetes, it just reported the pods were happily running

Is there any way to catch this from within k8s or is our Check_MK our best tool to detect broken WAN to pods?

What should the procedure be in this type of error? It is great that k8s pods can start up on a different host but it still required humans to notice / quarantine off the bad node / restart pods
In addition to daily reports we do get email alerts from Check_MK when things go wrong. However for the CEs I only got one alert, and it was buried in many other emails of transient warnings, we should clean up the alerts and also ensure critical ones are sent periodically

11