Operations Area
Mar 15, 2023 (Since 11-02-22)
Jeff Dost
A Tale of Kubernetes
Request from resource provider for Hosted CE change
2
3
LIGO_AU_SUT-OzStar_trevor:
limits:
entry:
glideins: 1
attrs:
GLIDEIN_Supported_VOs:
value: LIGO
GLIDEIN_CPUS:
value: 12
GLIDEIN_MaxMemMBs:
value: 28000
submit_attrs:
+maxMemory: 28000
+xcount: 12
+queue: ""trevor""
Gets into glidein Startd config on worker node startup
Gets into glidein classad at CE to request correct amount of resources
4
Executing reconfigure hook: /etc/gwms-factory/hooks.reconfig.pre/hostedce_gen.sh
ERROR:root:
Traceback (most recent call last):
File "/bin/OSG_autoconf", line 623, in <module>
main()
File "/bin/OSG_autoconf", line 607, in main
result = get_information(config["OSG_COLLECTOR"])
File "/bin/OSG_autoconf", line 189, in get_information
htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]
File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper
rv = func(*args, **kwargs)
htcondor.HTCondorIOError: Failed communication with collector.
Unexpected exception. Aborting automatic configuration generation!
Traceback (most recent call last):
File "/bin/OSG_autoconf", line 623, in <module>
main()
File "/bin/OSG_autoconf", line 607, in main
result = get_information(config["OSG_COLLECTOR"])
File "/bin/OSG_autoconf", line 189, in get_information
htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]
File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper
rv = func(*args, **kwargs)
htcondor.HTCondorIOError: Failed communication with collector.
OSG_autoconf exited with a code different than 0. Aborting.
Press a key to continue...
Continuing with reconfigure and old xmls
(collector.opensciencegrid.org)
5
condor_status -pool collector.opensciencegrid.org:9619 -sched
Error: communication error
CEDAR:6001:Failed to connect to <128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>
Error: Couldn't contact the condor_collector on
central-collector-0.osg.chtc.io
(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>).
Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines and
jobs in the Condor pool. The condor_collector might not be running, it might
be refusing to communicate with you, there might be a network problem, or
there may be some other problem. Check with your system administrator to fix
this problem.
If you are the system administrator, check that the condor_collector is
running on central-collector-0.osg.chtc.io
(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>), check the
ALLOW/DENY configuration in your condor_config, and check the MasterLog and
CollectorLog files in your log directory for possible clues as to why the
condor_collector is not responding. Also see the Troubleshooting section of
the manual.
Reached out on Slack #operations
6
Meanwhile During Ops Standup…
7
Services with issues (6):
+--------------------------------------------------------------------------------------------+------------------+
| Service | Availability % |
+============================================================================================+==================+
| Condor status on collector.opensciencegrid.org | 36.69 |
+--------------------------------------------------------------------------------------------+------------------+
| Hosted CE availability LIGO_US_PSU-LIGO (psu-ligo-ce1.svc.opensciencegrid.org) | 36.69 |
+--------------------------------------------------------------------------------------------+------------------+
| Hosted CE availability OSG_US_LSUHSC-Tigerfish-CE2 (lsuhsc-tf-ce2.svc.opensciencegrid.org) | 0 |
+--------------------------------------------------------------------------------------------+------------------+
| Hosted CE availability OSG_US_UTC-Epyc (utc-epyc-ce1.svc.opensciencegrid.org) | 36.69 |
+--------------------------------------------------------------------------------------------+------------------+
| Hosted CE availability VU-AUGIE-CE1 (vu-augie-ce1.svc.opensciencegrid.org) | 36.69 |
+--------------------------------------------------------------------------------------------+------------------+
| XRootD copy from CHTC_TIGER_CACHE (stash-cache.osg.chtc.io) | 36.71 |
+--------------------------------------------------------------------------------------------+------------------+
SLA services:
+---------------------------+---------+------------------+------------------+
| Service | SLA % | Availability % | SLA Satisfied? |
+===========================+=========+==================+==================+
| SLA - CE Collectors | 90 | 36.69 | N |
+---------------------------+---------+------------------+------------------+
| SLA - Central Managers | 95 | 100 | Y |
+---------------------------+---------+------------------+------------------+
| SLA - GRACC | 95 | 99.69 | Y |
+---------------------------+---------+------------------+------------------+
| SLA - GWMS Factories | 95 | 99.87 | Y |
+---------------------------+---------+------------------+------------------+
| SLA - GWMS Frontends | 95 | 99.74 | Y |
+---------------------------+---------+------------------+------------------+
| SLA - Hosted CEs | 95 | 93.4 | N |
+---------------------------+---------+------------------+------------------+
Numbers all roughly same look suspect! Did all these go down same time?
Confirmed all problem services were on tiger0002
8
kubectl -n slate-group-osg-ops get pods -o wide | egrep 'utc|vu|psu'
osg-hosted-ce-psu-ligo-756dbbd7f7-vldf9 1/1 Running 0 5d21h 10.129.180.251 tiger0002.chtc.wisc.edu <none> <none>
osg-hosted-ce-utc-epyc-7b85574d87-6bjwz 1/1 Running 0 5d21h 10.129.173.107 tiger0002.chtc.wisc.edu <none> <none>
osg-hosted-ce-vu-augie-ce1-6c96b657cb-b6xdw 1/1 Running 0 5d22h 10.129.173.74 tiger0002.chtc.wisc.edu <none> <none>
condor_ce_q -pool psu-ligo-ce1.svc.opensciencegrid.org:9619 -name psu-ligo-ce1.svc.opensciencegrid.org
Error: Couldn't contact the condor_collector on
psu-ligo-ce1.svc.opensciencegrid.org:9619.
condor_ce_q -pool psu-ligo-ce1.svc.opensciencegrid.org:9619 -name psu-ligo-ce1.svc.opensciencegrid.org
-- Schedd: psu-ligo-ce1.svc.opensciencegrid.org : <128.104.103.185:9619?... @ 03/09/23 16:57:02
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
osg07 ID: 479598 2/11 00:25 _ _ _ 1 479598.0
Continuing debugging on Slack
9
Root Cause
10
?
?
?
?
?
?
?
?
Concerns / room for improvements?
11