1 of 42

HTCondor Tips & Tricks:

Using condor_q and condor_history to Learn about Your Jobs

OSG Special Topics Training

Research Computing Facilitation Team

November 2022

1

2 of 42

Before We Start

We welcome questions! To ask questions, please raise your hand.

Part of this workshop is hands-on! You are welcome to follow along (we will walk through the steps together) or to simply watch.

2

3 of 42

Introductions

3

4 of 42

Research Computing Facilitators are Here to Help!

Showmic Islam

Rachel Lombardi

Mats

Rynge

  • Email: support@osg-htc.org
  • Zoom Office Hours: Tuesday 3-4:30pm CT & Thursday 10:30am-12pm CT

Christina Koch

4

5 of 42

Motivation

We talk a lot about how to submit jobs...

...and not so much about how to get information about them

after you submit. This talk will cover different tools that you can use to answer

questions about your jobs, like:

• What is my job doing?

• What resources is it using?

• Why is it on hold?

• Etc.

5

6 of 42

Tools for Learning About Jobs

HTCondor’s job attribute information

• Accessed via condor_q, or condor_history

Files

• HTCondor log files

• Standard error/standard output files

6

7 of 42

Primary Learning Objectives

To learn about your jobs using HTCondor’s condor_q and condor_history features.

Learning Outcomes:

  • Discuss the differences and default behaviors of condor_q and condor_history
  • Understand how to investigate jobs currently in HTCondor’s queue using condor_q and common condor_q flags
  • Investigate jobs that have recently left HTCondor’s queue using condor_history and common condor_history flags

7

8 of 42

Agenda

  • Introduction to the default behavior of HTCondor’s condor_q and condor_history
  • Running Jobs on the OSPool with HTCondor
  • How to use condor_q and condor_history to understand your jobs

8

9 of 42

Introduction to condor_q and condor_history

9

10 of 42

Before We Start: Submit Sample Jobs

There will be a live demo as part of this talk. If you want to follow along, and have an account on an HTCondor Access Point:

• Log in to your access point

• Run:

$ git clone https://github.com/CHTC/job-info-examples

$ cd job-info-examples

$ condor_submit simplejobs.submit

10

11 of 42

Overview: condor_q and condor_history

condor_q <User, ClusterID, JobID, CustomBatchName>

  • Used to show information for jobs currently in HTCondor’s queue (Idle, Running, Held jobs)

condor_history <User, ClusterID, JobID, CustomBatchName>

  • Used to show information about jobs that have left HTCondor’s queue and now are in HTCondor’s “history” (Done or Removed jobs)
  • HTCondor’s history is constantly being updated with information as jobs exit the queue
  • Jobs leave `condor_history` after several days or weeks

11

12 of 42

Default Behavior

condor_q <User, ClusterID, JobID, CustomBatchName>

  • Default output are your jobs grouped into “batches”
  • To display individual jobs, use `-nobatch` flag

12

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS

alice ID: 25248040 11/14 13:51 1 _ 1 1 3 25248040.1-2

alice ID: 25248044 11/14 13:51 1 _ 1 1 3 25248044.1-2

alice ID: 25249765 11/14 15:47 _ _ 5 _ 5 25249765.0-4

13 of 42

Default Behavior

condor_q <User, ClusterID, JobID, CustomBatchName>

  • Default output are your jobs grouped into “batches”
  • To display individual jobs, use `-nobatch` flag

condor_history <User, ClusterID, JobID, CustomBatchName>

  • Default output is all jobs in HTCondor’s history, from most recent to oldest.

13

By default, HTCondor searches its history starting with the most recent entry. Use ctrl + C to exit search.

14 of 42

Is my job waiting (idle) or running?

condor_q (with or without -nobatch)

condor_watch_q for an updating display

14

15 of 42

How long has my job been running?

HTCondor command: condor_q -run

• To see current (not cumulative) run time: condor_q -run -current

File option: the log file

15

16 of 42

How many resources is my job using?

To answer this question, we need to talk about...

16

Attributes!

17 of 42

Running Jobs on the OSPool with HTCondor

17

18 of 42

HTCondor Job Workflow

18

19 of 42

A Day on the OSPool:

Thousands of Jobs, Thousands of Machines

19

19

20 of 42

HTCondor’s Central Manager

HTCondor matches jobs with computers via a “central manager

20

21 of 42

Class Ads

  • HTCondor stores a list of information about each job and each machine in the OSPool
  • This information is stored as a “Class Ad”

  • Class Ads have the format:

AttributeName = value

21

Can be a boolean, expression, number, or string

22 of 42

Job Class Ad

22

Example Submit File

23 of 42

Machine Class Ad

23

24 of 42

Job Matching

On a regular basis, the Central Manager reviews Job and Machine Class Ads and matches jobs to computers

24

25 of 42

Job Execution

(Then the submit and execute points communicate directly.)

25

26 of 42

Class Ads for People

Class Ads also provide lots of useful information about jobs and computers to HTCondor users and administrators

26

27 of 42

Job Attributes

HTCondor stores a list of information about each job. This information is stored in this format:

AttributeName = value

You can find a list of attributes for a single job by running:

condor_q -l JobID

You can print out specific attributes by using the “format” or “auto-format” flags with an HTCondor command:

condor_q -af Attribute1 Attribute2

27

28 of 42

How many resources is my job using?

Use the “Request” and “Usage” attributes.

For example, for memory, use RequestMemory and RequestUsage

condor_q -af RequestMemory MemoryUsage

To summarize, add the ”sort” and “uniq” commands:

condor_q -af RequestMemory MemoryUsage | sort | uniq -c

Can also swap out RequestMemory with RequestCPUs or RequestDisk!

28

Note that the “Usage” attributes are not always updated in real-time.

29 of 42

What resources did my job use?

Can answer the same questions for finished jobs using condor_history

condor_history user -limit 2 -af RequestMemory MemoryUsage

Use the “-limit” flag to get results more quickly. This is really useful for summarizing test results!!

condor_history user -limit 10 -af RequestMemory MemoryUsage | sort | uniq -c

Again, can look at Disk and CPU usage in the same way.

29

Note that the “Usage” attributes are not always updated in real-time.

30 of 42

Where did my job run?

Attribute option: answerable using condor_history!

condor_history user -limit 2 -af LastRemoteHost

• This can be useful when looking for patterns of failures or successes.

• Only shows the last place a job ran.

File option: the log file (shows all places a job ran)

If a job is still running, can use:

condor_q -run

condor_q -af RemoteHost

30

31 of 42

Interlude: Submit More Jobs

Are you still in the “job-info-examples” folder? Run:

condor_submit complexjobs.submit

31

File does not exist

Not a good fit for the OSPool

32 of 42

What did I run? What is still running?

The “Cmd” and “Args” attributes are useful for recovering details of a job (executable, arguments):

condor_q user -af:jh Cmd Args

condor_history user -limit 10 -af:jh Cmd Args

We’re using the “j” (for JobID) and “h” (for ”header”) in addition to the auto-format option above -- this prints out the JobID along with the requested attributes.

32

33 of 42

Where are my files? and other questions.

Can also look up:

• Where the job was submitted: Iwd

• Where the log file is: UserLog

• Where stdout and stderr are: Out, Err

• A job’s exit code: ExitCode

• A job’s requirements: -af:r Requirements

Time attributes

• There *are* timestamps for different events but they are in epoch time, so you have to do some work to read them.

33

34 of 42

Where can I see all the attributes?

See the manual for a list of all the attributes HTCondor can use.

• Manual > Appendix > ClassAd Attributes > Job ClassAd Attributes

For the JobStatus attribute (which we will see soon), can use this command to print out a list of the JobStatus codes:

condor_q -help status

34

35 of 42

Why is my job on hold?

Condor command: condor_q -hold

Use job attributes:

condor_q -af HoldReasonCode HoldReason

File option: Log file

35

36 of 42

How do I see just “x” jobs?

So far, we have simply been selecting specific jobs using their JobID or all the jobs associated with a single user.

What if you want to select a more specific set of jobs, like all jobs that are still idle?

This can be done with attributes and the “-constraint” option to condor_q

condor_q -constraint JobStatus == 1’

The constraint option works with other HTCondor job-related

commands (condor_hold, condor_rm, condor_release, etc.)

36

37 of 42

condor_status: What kind of computers are there?

We can use similar principles to learn about the computers (machines) in an HTCondor Pool using condor_status. To get a list of attributes, use the “long” option:

condor_status -l [name]

Then look at certain attributes (like Machine, TotalCpus, CPUModel) using the “auto format” flag.

condor_status -af Machine TotalCpus CpuModel | sort |uniq -c

37

38 of 42

Real-world Diagnosing

# My Submit File: test.sub

universe = vanilla

executable = test.sh

arguments = sample313.txt

# This analysis has many important files

transfer_input_files = software/, sample318.txt

# Controlling my job behavior

should_transfer_files = YES

when_to_transfer_output = ON_EXIT

requirements = (OSGVO_OS_STRING == "RHEL 7")

# Save standard error, output, and HTCondor log files

log = TestJobOutput/test_job.log

output = TestJobOutput/test_job.out

error = TestJobOutput/test_job.error

+JobDurationCategory = "Medium"

request_cpus = 1

request_memory = 5GB

request_disk = 1GB

queue 1

38

[alice@login05 analysis-folder]$ ls -lh

total 24K

-rwxrwxr-x 1 alice osg 423M Mar 17 2022 test.sh

-rwxrwxr-x 1 alice osg 517M Mar 17 2022 test.sub

-rwxrwxr-x 1 alice osg 320M Mar 17 2022 download_data.sh

-rw-rw-r-- 1 alice osg 12K Mar 17 2022 README.md

drwxrwxr-x 2 alice osg 30M Mar 17 2022 software/

drwxrwxr-x 2 alice osg 30M Mar 17 2022 TestJobOutput

-rwxrwxr-x 1 alice osg 3G Mar 17 2022 sample313.txt

-rwxrwxr-x 1 alice osg 3G Mar 17 2022 sample445.txt

-rwxrwxr-x 1 alice osg 3G Mar 17 2022 sample917.txt

This job will go on hold for two reasons. Do you know what they are? How would you identify these issues for your held jobs?

Bonus: Do you know what order they will happen in?

39 of 42

Next Steps

Visit our documentation website https://portal.osg-htc.org to learn more about using HTCondor commands like condor_q and condor_history.

The HTCondor ClassAd Language Tutorial

and other HTCondor talks on the Center for High Throughput Computing YouTube channel.

Attend office hours with questions, etc.!

39

40 of 42

Acknowledgements

This material is based upon work supported by the National Science Foundation under Cooperative Agreement OAC-2030508 as part of the PATh Project. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

40

41 of 42

Questions?

What other information would you like to be able to learn about your jobs?

41

42 of 42

42