1 of 20

Joseph Areeda*, Cole Bollig†, Joshua Smith*

* - California State University, Fullerton

† - University of Wisconsin Madison

1

Case study: dynamic jobs, subdags, and resource requests in HTCondor DAGs

2 of 20

Laser Interferometer Gravitational Wave Observatory1

  • 4 Km arms change in length as a gravitational wave passes ΔL = O(10-18 m)
  • Ratio of ΔL to the arm lengths L=4km defines strain which is used to characterize gravitational waves
  • This would be like measuring the distance from our Sun to Alpha Centauri �and detecting a change about the width of a human hair (Nature blog2)

2

3 of 20

Real time data

  • The interferometer uses a closed loop control system.�Strain is calculated from control signals at 16 KHz .�With a nominal 5 seconds latency for filters to settle.�
  • O(500,000) auxiliary channels acquired to
    • Operate the interferometer
    • Monitor the health of the interferometer
    • Eliminate any non gravitational wave sources of possible detections�
  • This case study is concerned with about 1000 channels used�to estimate noise and support noise source investigation�and mitigation

3

4 of 20

Omicron3

  • C++ implementation of Q-transform, a multi-resolution time frequency analysis
  • Q-transform uses Fourier and sine-gaussian wavelet decomposition producing good time and frequency resolution
  • It can produce images or apply a threshold to generate triggers.
  • Clustering is a trigger-by-trigger search for neighbors in time and frequency.
  • We use the clustered triggers.

4

5 of 20

Q-transform4

5

Unclustered

Clustered

GW170814 14 Aug 2017 10:30:43

2 black holes, 30 and 25 solar masses

540 megaparsecs, 1.89 light years from earth

6 of 20

The problem

We process O(1000) channels at 2 observatories every 5 minutes�Some depend on detector state but a complete day has 250,000 jobs

�Jobs fail for many reasons, the most common is delayed arrival of�data from the real-time front end system.

�The project we are discussing is how to find gaps in results that may be �fillable in a fast and efficient way and reprocess that time interval.

6

7 of 20

The gap filler job

  1. Scan available data files for time when our input data is available.�When data is available, scan for appropriate detector state and �no Omicron trigger files.�
  2. Generate HTCondor DAGs to fill each gap for any channel group�with missing data�
  3. Submit up to N DAGs at a time to limit resource requests.�

NB: Time periods scan vary from a few hours to a few months.

7

8 of 20

Gap filler overview

8

9 of 20

DAG to fill 1 gap in one channel group

9

10 of 20

A typical day (well one where it all worked)

10

Animated GIF�N/A in PDF

11 of 20

The DAG created by a program each time

  • FIND identifies gaps and creates�0-n scripts for FILL a script may have�more than 1 gap.�
  • FILL creates a DAG to fill each gap�
  • PS-Create subdag overwrites the�placeholder with a DAG that runs�each DAG created by fill 8 at a time.�
  • The now updated �“All_omicron_subdags” actually does�the work.

11

12 of 20

The DAG that controls the process

JOB FIND STD5-FIND.submit�

JOB FILL STD5-FILL.submit�

SCRIPT POST FILL omicron-subdag-create -vvv --group STD5 --inpath gaps-STD5-20240620.000000-20240621.000000 --outpath omicron_subdag.dag�

SUBDAG EXTERNAL all_omicron_subdags /omicron_subdag.dag�

PARENT FIND CHILD FILL�

PARENT FILL CHILD all_omicron_subdags

12

13 of 20

FIND job submit file

arguments = "omicron-find-gaps -v --group STD5 --ifo L1 --config-file l1-channels.ini -v --output-dir gaps-STD5 1402876818 1402963218 --"

executable = /home/detchar/.conda/envs/ligo-omicron-3.10/bin/python

log = STD5-FIND.log

error = STD5-FIND.err

output = STD5-FIND.out

request_disk = 500M

request_memory = 1024M

queue 1

13

14 of 20

FILL submit unspecified number of jobs

Submit file uses wildcards in the queue statement

arguments = $(script)

executable = /bin/bash

log = STD5-FILL-$(Process).log

error = STD5-FILL-$(Process).err

output = STD5-FILL-$(Process).out�

queue script matching fillgap-*.sh

14

15 of 20

Dynamic DAGs

The top level dag defines a placeholder but does not create it:

JOB FIND STD5-FIND.submit

JOB FILL STD5-FILL.submit

SCRIPT POST FILL omicron-subdag-create --inpath --outpath omicron_subdag.dag

SUBDAG EXTERNAL all_omicron_subdags omicron_subdag.dag

PARENT FIND CHILD FILL

PARENT FILL CHILD all_omicron_subdags

The post script omicron-subdag-create is a python program that finds all DAGs created by the FILL job to create the omicron_subdag.dag file

15

16 of 20

Subdags created by post script

Each gap has a DAG

SUBDAG EXTERNAL PEM2_00 PEM2-201848-001848/condor/omicron-PEM2.dag�SUBDAG EXTERNAL PEM2_01 PEM2-001848-041848/condor/omicron-PEM2.dag�SUBDAG EXTERNAL PEM2_02 PEM2-041848-081848/condor/omicron-PEM2.dag�SUBDAG EXTERNAL PEM2_03 PEM2-161848-201848/condor/omicron-PEM2.dag�SUBDAG EXTERNAL PEM2_04 PEM2-081848-121848/condor/omicron-PEM2.dag

<snip>

To be a good citizen limit the number of parallel DAGs with parent/child relationship:

CATEGORY ALL_NODES LIMIT

MAXJOBS LIMIT 5

16

17 of 20

Dynamic memory requests

About 80% of the trigger generator runs work well in 1GB memory �but noisy data increases memory requirements. At 7 GB we declare �data noisy enough we can stop.

my.InitialRequestMemory = 1000

request_memory = ifthenelse(isUndefined(MemoryUsage), � my.InitialRequestMemory, int(2*MemoryUsage))�

periodic_release = (HoldReasonCode =?= 26 || HoldReasonCode =?= 34 ||� HoldReasonCode =?= 46) && (JobStatus == 5) && � (time() - EnteredCurrentStatus > 5)

periodic_remove = (JobStatus == 1) && MemoryUsage >= 7000

allowed_job_duration = 10800

17

18 of 20

Dynamic memory requests step by step

request_memory = ifthenelse(isUndefined(MemoryUsage), � my.InitialRequestMemory, int(2*MemoryUsage))

When the job starts MemoryUsage is undefined. If a job is held for any reason the memory_request is doubled.

periodic_release = (HoldReasonCode =?= 26 || HoldReasonCode =?= 34 ||� HoldReasonCode =?= 46) && (JobStatus == 5) && � (time() - EnteredCurrentStatus > 5)

  • JobStatus 5: Job is held
  • HoldReasonCode 34: Memory usage exceeds a memory limit.
  • HoldReasonCode 26: SYSTEM_PERIODIC_HOLD evaluated to true.
  • HoldReasonCode 46: The job’s allowed duration was exceeded.
  • time() …: Wait 5 seconds to avoid race condition between� request_memory update and periodic_release

18

19 of 20

Takeaways

  • The HTCondor submit language is very powerful, making it hard�to know what is possible.
  • Dynamic DAGs are useful for automating tasks that are data dependent.
  • Managing resource requests not only helps others but our jobs match more execute points quickly.

19

20 of 20

Related links

  1. LIGO Scientific Collaboration https://www.ligo.org/index.php
  2. Keen blog https://www.nature.com/scitable/blog/pop/gravitatonal_waves/
  3. Robinet et.al “Omicron: a tool to characterize transient noise in gravitational-wave detectors” (https://arxiv.org/abs/2007.11374)�
  4. Chatterji et. al. “Multiresolution techniques for the detection of gravitational-wave bursts” https://arxiv.org/pdf/gr-qc/0412119
  5. HTCondor queue statement variations:�https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#submitting-many-similar-jobs-with-one-queue-command
  6. SubDAGs https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-using-other-dags.html#a-dag-within-a-dag-is-a-subdag

20