1 of 28

PATh Staff HTCSS Update��

Todd Tannenbaum

Center for High Throughput Computing

University of Wisconsin-Madison

2 of 28

European HTCondor Workshop 2025�Prague�

2

HTCondor 25.x !�

3 of 28

HTCondor 25.x roll out started yesterday��Since last year…

3

New releases monthly containing a total of� + 130 documented features� + 193 documented bugfixes

4 of 28

Highlights on the web, details in the Manual

Highlights:

https://htcondor.org/htcondor/release-highlights/

Details:

https://htcondor.readthedocs.io/en/latest/version-history/index.html

  • Documented all the new features / mechanisms that have been added at each version
  • Notes about "gotchas" when upgrading from version X to Y

-> https://htcondor.readthedocs.io/en/main/version-history/upgrading-from-24-0-to-25-0-versions.html

  • condor_upgrade_check tool assists administrators upgrading an HTCondor instillation between major version (i.e. V24 -> V25)
    • Intended for use by those running LTS release of HTCondor.
    • Checks current installation and setup for well known incompatibilities and informs administrator of actions to take to resolve said issues.

4

5 of 28

Dealing with Job Memory

  • Current default behavior: Hold jobs that use more memory than requested.

5

6 of 28

Dealing with Job Memory

  • New first class way for users to retry with more memory.
  • Examples:

6

RequestMemory = 1 GB

RetryRequestMemory = 4 GB

RequestMemory = 1 GB

RetryRequestMemory = 4 GB, 16 GB

RequestMemory = 1 GB

RetryRequestMemoryIncrease = RequestMemory * 4

RetryRequestMemoryMax = 16 GB

Ex 1

Ex 2

Ex 3

7 of 28

From command-line…

  • Before…

$ MY_DIR=dir my_command one two three > out 2> err

8 of 28

From command-line… to submit file

Executable = my_command

Arguments = one two three

Output = out

Error = err

Environment= MY_DIR=dir

queue

9 of 28

New “Shell” submit command

  • With shell…

$ MY_DIR=dir my_command one two three > out 2> err

10 of 28

New “Shell” submit command

shell = MY_DIR=dir my_command one two three > out 2> err

queue

11 of 28

Cgroup Management w/o root

  • If the base system grants permission, a glidein EP can create cgroups w/o root!
  • Off by default, ENABLE_CGROUP_WITHOUT_ROOT enables, will be default soon�
  • HTCondor creates writeable cgroups by default
  • Can also launch an EP via Docker and create writeable cgroups

12 of 28

13 of 28

Improvements from 24.x -> 25.x

  • New and improved Python bindings (WARNING! Python code must be migrated to the new bindings! )

13

14 of 28

HTCondor Python Bindings Version 2

  • The version 1 bindings depend on an unsupported library (boost.python), so we needed to do something to make sure the bindings would remain available.
  • Bindings are intended to be generally compatible; import htcondor2 as htcondor should mostly just work.
    • However: We removed some of the things marked as deprecated in the version 1 bindings, and will be deprecating a few other APIs that are not widely used
  • Win for users
    • Less picky, just need Python 3.8 or greater
    • Caltech reports 5x speed improvement! (query 200k job ads)

14

15 of 28

Improvements from 24.x -> 25.x

  • New and improved Python bindings (WARNING! Python code must be migrated to the new bindings! )
  • New condor_dag_checker tool finds syntax and logic errors before run

15

16 of 28

DAG Checker Tool: condor_dag_checker

  • Good for checking the validity of a DAG file and getting some statistics back without having to place the DAG and dig through the DAGMan debug log file.

- Check DAG file for various failures such as invalid DAG command syntax, referencing undefined nodes, cyclic dependencies.

- Get statistics about a DAG such as the count of nodes and arcs in a given DAG.

17 of 28

Improvements from 24.x -> 25.x

  • New and improved Python bindings (WARNING! Python code must be migrated to the new bindings! )
  • New condor_dag_checker tool finds syntax and logic errors before run
  • Add the ability to enforce memory and CPU limits on local universe jobs
  • Add job attributes to track why and how often a job is vacated
  • New job attribute to report number of input files transferred by protocol
  • New condor_q -hold-codes produces a summary of held jobs
  • Add new 'halt' and 'resume' verbs to "htcondor dag“
  • “htcondor ap status” now reports the AP's RecentDaemonCoreDutyCycle

17

18 of 28

New Job Status

18

$ htcondor job status 123.45

�Job 123.45 is currently running on host exec221.chtc.wisc.edu. �It started running again 2.1 hours ago. �It was submitted 3.6 hours ago. �Its current memory usage is 2.5 GB out of 4.0 GB requested. �Its current disk usage is 3.8 GB out of 5.5 GB requested. �It has restarted 2 times.

Goodput is 80% (0.5 hours badput, 2.1 hours goodput).��

19 of 28

What about a DAGMan workflow?

19

$ htcondor dag status 223

DAGMan Job 223.0 [simple.dag] has been running for 52 days 04:12:46.

DAG has submitted 382 individual job(s), of which:

45 are running.

10 are idle.

0 are held.

162 have completed successfully

DAG has failed nodes but will continue until all possible work is finished: 5 nodes failed.

10 nodes waiting to begin.

24 nodes running.

[###########=======----------------] 34% complete.

20 of 28

Improvements from 24.x -> 25.x

  • New and improved Python bindings (WARNING! Python code must be migrated to the new bindings! )
  • New condor_dag_checker tool finds syntax and logic errors before run
  • Add the ability to enforce memory and CPU limits on local universe jobs
  • Add job attributes to track why and how often a job is vacated
  • New job attribute to report number of input files transferred by protocol
  • New condor_q -hold-codes produces a summary of held jobs
  • Add new 'halt' and 'resume' verbs to "htcondor dag“
  • “htcondor ap status” now reports the AP's RecentDaemonCoreDutyCycle
  • Can limit the number of times that a job can be released
  • condor_watch_q now displays when file transfer is happening
  • Add ability to use authentication when fetching Docker images
  • HTCondor marks slots as broken when the slot resources cannot be released
  • Improved management and cleanup of EXECUTE directories (needs root at EP)
  • Add Singularity launcher to distinguish runtime failure from job failure
  • Container Universe jobs can now mount a writable directory under scratch
  • New job attributes FirstJobMatchDate and InitialWaitDuration

20

21 of 28

Gotchas: New defaults coming up�might be surprising

  • Python API (bindings) v1 being dropped -> long live v2 ! Question:

import htcondor

import htcondor2 as htcondor

  • HTCondor CE Routes must use “new” syntax
  • System Swap space (virtual memory) will not be used for jobs on the EP by default
  • Dropping support for multiple queue statements in a single submit file �      (Use queue foreach, etc.)
  • Partitionable Slots enabled by default (instead of static partitioning)
  • The job’s executable will no longer be renamed to ‘condor_exec.exe’
  • GPU discovery is enabled on all Execution Points by default
  • Nested Job Scratch Directories – jobs must use env to find .jobad, .machinead

21

22 of 28

STARTER_NESTED_SCRATCH

execute_dir

dir_xxxx

Job

files

condor

files

scratch

htcondor

23 of 28

Share Common Files

  • You can try it out as of HTCSS 24.9!

Shares explicitly-listed common files between jobs in the same job list (cluster) running at the same time.

      • Transfers them only once.
      • Makes only one copy on-disk.
      • Controlled by the AP.

23

Access Point

Execution Point

Slot 1 🡪 Job 55.0

24 of 28

Share Input Common Files

  • You can try it out as of HTCSS 24.9!

Shares explicitly-listed common files between jobs in the same job list (cluster) running at the same time.

      • Transfers them only once.
      • Makes only one copy on-disk.
      • Controlled by the AP.

24

Access Point

Execution Point

Slot 1 🡪 Job 55.0��Slot 2 🡪 Job 55.1

Hard-link

Common�input files

25 of 28

STARTER_NESTED_SCRATCH

  • Today

execute_dir

dir_xxxx

Job

files

condor

files

26 of 28

European HTCondor Workshop 2025�Prague�

  • 36 people
  • 17 Organizations from 10 countries
  • Next one:
    • Leon, France
    • Sept 29-Oct 2 2026

26

27 of 28

Some European Workshop Take-aways

  • Desire for HTCSS Office Hours
  • Run 4 testing at CERN… 2M CMS jobs sustained on a ghost pool.
  • Office hours: Docker in Docker, too many files w/ Galaxy
  • Everyone had a “node_healthy” attribute set by startd cron and referenced in START expression
    • Should be first class?
    • What else should be first class?

27

28 of 28

28

This work is supported by NSF under Cooperative Agreement OAC-2030508 as part of the PATh Project. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

Thank You!

Please add your institution

to our world map of HTCondor Users at:

https://htcondor.org/user-map

and click "Add Your Institution" on upper right