PATh Staff HTCSS Update��
Todd Tannenbaum
Center for High Throughput Computing
University of Wisconsin-Madison
European HTCondor Workshop 2025�Prague�
2
HTCondor 25.x !�
HTCondor 25.x roll out started yesterday��Since last year…
3
New releases monthly containing a total of� + 130 documented features� + 193 documented bugfixes
Highlights on the web, details in the Manual
Highlights:
https://htcondor.org/htcondor/release-highlights/
Details:
https://htcondor.readthedocs.io/en/latest/version-history/index.html
-> https://htcondor.readthedocs.io/en/main/version-history/upgrading-from-24-0-to-25-0-versions.html
4
Dealing with Job Memory
5
Dealing with Job Memory
6
RequestMemory = 1 GB
RetryRequestMemory = 4 GB
RequestMemory = 1 GB
RetryRequestMemory = 4 GB, 16 GB
RequestMemory = 1 GB
RetryRequestMemoryIncrease = RequestMemory * 4
RetryRequestMemoryMax = 16 GB
Ex 1
Ex 2
Ex 3
From command-line…
$ MY_DIR=dir my_command one two three > out 2> err
From command-line… to submit file
Executable = my_command
Arguments = one two three
Output = out
Error = err
Environment= MY_DIR=dir
queue
New “Shell” submit command
$ MY_DIR=dir my_command one two three > out 2> err
New “Shell” submit command
shell = MY_DIR=dir my_command one two three > out 2> err
queue
Cgroup Management w/o root
Improvements from 24.x -> 25.x
13
HTCondor Python Bindings Version 2
14
Improvements from 24.x -> 25.x
15
DAG Checker Tool: condor_dag_checker
- Check DAG file for various failures such as invalid DAG command syntax, referencing undefined nodes, cyclic dependencies.
- Get statistics about a DAG such as the count of nodes and arcs in a given DAG.
Improvements from 24.x -> 25.x
17
New Job Status
18
$ htcondor job status 123.45
�Job 123.45 is currently running on host exec221.chtc.wisc.edu. �It started running again 2.1 hours ago. �It was submitted 3.6 hours ago. �Its current memory usage is 2.5 GB out of 4.0 GB requested. �Its current disk usage is 3.8 GB out of 5.5 GB requested. �It has restarted 2 times.
Goodput is 80% (0.5 hours badput, 2.1 hours goodput).��
What about a DAGMan workflow?
19
$ htcondor dag status 223
DAGMan Job 223.0 [simple.dag] has been running for 52 days 04:12:46.
DAG has submitted 382 individual job(s), of which:
45 are running.
10 are idle.
0 are held.
162 have completed successfully
DAG has failed nodes but will continue until all possible work is finished: 5 nodes failed.
10 nodes waiting to begin.
24 nodes running.
[###########=======----------------] 34% complete.
�
Improvements from 24.x -> 25.x
20
Gotchas: New defaults coming up�might be surprising
import htcondor
import htcondor2 as htcondor
21
STARTER_NESTED_SCRATCH
execute_dir
dir_xxxx
Job
files
condor
files
scratch
htcondor
Share Common Files
Shares explicitly-listed common files between jobs in the same job list (cluster) running at the same time.
23
Access Point
Execution Point
Slot 1 🡪 Job 55.0
Share Input Common Files
Shares explicitly-listed common files between jobs in the same job list (cluster) running at the same time.
24
Access Point
Execution Point
Slot 1 🡪 Job 55.0��Slot 2 🡪 Job 55.1
Hard-link
Common�input files
STARTER_NESTED_SCRATCH
execute_dir
dir_xxxx
Job
files
condor
files
European HTCondor Workshop 2025�Prague�
26
Some European Workshop Take-aways
27
28
This work is supported by NSF under Cooperative Agreement OAC-2030508 as part of the PATh Project. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
Thank You!
Please add your institution
to our world map of HTCondor Users at:
and click "Add Your Institution" on upper right