Joseph Areeda*, Cole Bollig†, Joshua Smith*
* - California State University, Fullerton
† - University of Wisconsin Madison
1
Case study: dynamic jobs, subdags, and resource requests in HTCondor DAGs
Laser Interferometer Gravitational Wave Observatory1
2
Real time data
3
Omicron3
4
Q-transform4
5
Unclustered
Clustered
GW170814 14 Aug 2017 10:30:43
2 black holes, 30 and 25 solar masses
540 megaparsecs, 1.89 light years from earth
The problem
We process O(1000) channels at 2 observatories every 5 minutes�Some depend on detector state but a complete day has 250,000 jobs
�Jobs fail for many reasons, the most common is delayed arrival of�data from the real-time front end system.
�The project we are discussing is how to find gaps in results that may be �fillable in a fast and efficient way and reprocess that time interval.
6
The gap filler job
NB: Time periods scan vary from a few hours to a few months.
7
Gap filler overview
8
DAG to fill 1 gap in one channel group
9
A typical day (well one where it all worked)
10
Animated GIF�N/A in PDF
The DAG created by a program each time
11
The DAG that controls the process
JOB FIND STD5-FIND.submit�
JOB FILL STD5-FILL.submit�
SCRIPT POST FILL omicron-subdag-create -vvv --group STD5 --inpath gaps-STD5-20240620.000000-20240621.000000 --outpath omicron_subdag.dag�
SUBDAG EXTERNAL all_omicron_subdags /omicron_subdag.dag�
PARENT FIND CHILD FILL�
PARENT FILL CHILD all_omicron_subdags
12
FIND job submit file
arguments = "omicron-find-gaps -v --group STD5 --ifo L1 --config-file l1-channels.ini -v --output-dir gaps-STD5 1402876818 1402963218 --"
executable = /home/detchar/.conda/envs/ligo-omicron-3.10/bin/python
log = STD5-FIND.log
error = STD5-FIND.err
output = STD5-FIND.out
request_disk = 500M
request_memory = 1024M
queue 1
13
FILL submit unspecified number of jobs
Submit file uses wildcards in the queue statement
arguments = $(script)
executable = /bin/bash
log = STD5-FILL-$(Process).log
error = STD5-FILL-$(Process).err
output = STD5-FILL-$(Process).out�
queue script matching fillgap-*.sh
14
Dynamic DAGs
The top level dag defines a placeholder but does not create it:
JOB FIND STD5-FIND.submit
JOB FILL STD5-FILL.submit
SCRIPT POST FILL omicron-subdag-create --inpath --outpath omicron_subdag.dag
SUBDAG EXTERNAL all_omicron_subdags omicron_subdag.dag
PARENT FIND CHILD FILL
PARENT FILL CHILD all_omicron_subdags
The post script omicron-subdag-create is a python program that finds all DAGs created by the FILL job to create the omicron_subdag.dag file
15
Subdags created by post script
Each gap has a DAG
SUBDAG EXTERNAL PEM2_00 PEM2-201848-001848/condor/omicron-PEM2.dag�SUBDAG EXTERNAL PEM2_01 PEM2-001848-041848/condor/omicron-PEM2.dag�SUBDAG EXTERNAL PEM2_02 PEM2-041848-081848/condor/omicron-PEM2.dag�SUBDAG EXTERNAL PEM2_03 PEM2-161848-201848/condor/omicron-PEM2.dag�SUBDAG EXTERNAL PEM2_04 PEM2-081848-121848/condor/omicron-PEM2.dag
<snip>
To be a good citizen limit the number of parallel DAGs with parent/child relationship:
CATEGORY ALL_NODES LIMIT
MAXJOBS LIMIT 5
16
Dynamic memory requests
About 80% of the trigger generator runs work well in 1GB memory �but noisy data increases memory requirements. At 7 GB we declare �data noisy enough we can stop.
�my.InitialRequestMemory = 1000
request_memory = ifthenelse(isUndefined(MemoryUsage), � my.InitialRequestMemory, int(2*MemoryUsage))�
periodic_release = (HoldReasonCode =?= 26 || HoldReasonCode =?= 34 ||� HoldReasonCode =?= 46) && (JobStatus == 5) && � (time() - EnteredCurrentStatus > 5)
periodic_remove = (JobStatus == 1) && MemoryUsage >= 7000
allowed_job_duration = 10800
�
17
Dynamic memory requests step by step
request_memory = ifthenelse(isUndefined(MemoryUsage), � my.InitialRequestMemory, int(2*MemoryUsage))
When the job starts MemoryUsage is undefined. If a job is held for any reason the memory_request is doubled.
periodic_release = (HoldReasonCode =?= 26 || HoldReasonCode =?= 34 ||� HoldReasonCode =?= 46) && (JobStatus == 5) && � (time() - EnteredCurrentStatus > 5)
18
Takeaways
19
Related links
20