HTCondor Tips & Tricks:
Using condor_q and condor_history to Learn about Your Jobs
OSG Special Topics Training
Research Computing Facilitation Team
November 2022
1
Before We Start
We welcome questions! To ask questions, please raise your hand.
Part of this workshop is hands-on! You are welcome to follow along (we will walk through the steps together) or to simply watch.
2
Introductions
3
Research Computing Facilitators are Here to Help!
Showmic Islam
Rachel Lombardi
Mats
Rynge
Christina Koch
4
Motivation
We talk a lot about how to submit jobs...
...and not so much about how to get information about them
after you submit. This talk will cover different tools that you can use to answer
questions about your jobs, like:
• What is my job doing?
• What resources is it using?
• Why is it on hold?
• Etc.
5
Tools for Learning About Jobs
HTCondor’s job attribute information
• Accessed via condor_q, or condor_history
Files
• HTCondor log files
• Standard error/standard output files
6
Primary Learning Objectives
To learn about your jobs using HTCondor’s condor_q and condor_history features.
Learning Outcomes:
7
Agenda
8
Introduction to condor_q and condor_history
9
Before We Start: Submit Sample Jobs
There will be a live demo as part of this talk. If you want to follow along, and have an account on an HTCondor Access Point:
• Log in to your access point
• Run:
$ git clone https://github.com/CHTC/job-info-examples
$ cd job-info-examples
$ condor_submit simplejobs.submit
10
Overview: condor_q and condor_history
condor_q <User, ClusterID, JobID, CustomBatchName>
condor_history <User, ClusterID, JobID, CustomBatchName>
11
Default Behavior
condor_q <User, ClusterID, JobID, CustomBatchName>
12
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
alice ID: 25248040 11/14 13:51 1 _ 1 1 3 25248040.1-2
alice ID: 25248044 11/14 13:51 1 _ 1 1 3 25248044.1-2
alice ID: 25249765 11/14 15:47 _ _ 5 _ 5 25249765.0-4
Default Behavior
condor_q <User, ClusterID, JobID, CustomBatchName>
condor_history <User, ClusterID, JobID, CustomBatchName>
13
By default, HTCondor searches its history starting with the most recent entry. Use ctrl + C to exit search.
Is my job waiting (idle) or running?
• condor_q (with or without -nobatch)
• condor_watch_q for an updating display
14
How long has my job been running?
HTCondor command: condor_q -run
• To see current (not cumulative) run time: condor_q -run -current
File option: the log file
15
How many resources is my job using?
To answer this question, we need to talk about...
16
Attributes!
Running Jobs on the OSPool with HTCondor
17
HTCondor Job Workflow
18
A Day on the OSPool:
Thousands of Jobs, Thousands of Machines
19
19
HTCondor’s Central Manager
HTCondor matches jobs with computers via a “central manager”
20
Class Ads
• AttributeName = value
21
Can be a boolean, expression, number, or string
Job Class Ad
22
Example Submit File
Machine Class Ad
23
…
Job Matching
On a regular basis, the Central Manager reviews Job and Machine Class Ads and matches jobs to computers
24
Job Execution
(Then the submit and execute points communicate directly.)
25
Class Ads for People
Class Ads also provide lots of useful information about jobs and computers to HTCondor users and administrators
26
Job Attributes
HTCondor stores a list of information about each job. This information is stored in this format:
• AttributeName = value
You can find a list of attributes for a single job by running:
• condor_q -l JobID
You can print out specific attributes by using the “format” or “auto-format” flags with an HTCondor command:
• condor_q -af Attribute1 Attribute2
27
How many resources is my job using?
Use the “Request” and “Usage” attributes.
For example, for memory, use RequestMemory and RequestUsage
• condor_q -af RequestMemory MemoryUsage
To summarize, add the ”sort” and “uniq” commands:
• condor_q -af RequestMemory MemoryUsage | sort | uniq -c
Can also swap out RequestMemory with RequestCPUs or RequestDisk!
28
Note that the “Usage” attributes are not always updated in real-time.
What resources did my job use?
Can answer the same questions for finished jobs using condor_history
• condor_history user -limit 2 -af RequestMemory MemoryUsage
Use the “-limit” flag to get results more quickly. This is really useful for summarizing test results!!
• condor_history user -limit 10 -af RequestMemory MemoryUsage | sort | uniq -c
Again, can look at Disk and CPU usage in the same way.
29
Note that the “Usage” attributes are not always updated in real-time.
Where did my job run?
Attribute option: answerable using condor_history!
• condor_history user -limit 2 -af LastRemoteHost
• This can be useful when looking for patterns of failures or successes.
• Only shows the last place a job ran.
File option: the log file (shows all places a job ran)
If a job is still running, can use:
• condor_q -run
• condor_q -af RemoteHost
30
Interlude: Submit More Jobs
Are you still in the “job-info-examples” folder? Run:
• condor_submit complexjobs.submit
31
File does not exist
Not a good fit for the OSPool
What did I run? What is still running?
The “Cmd” and “Args” attributes are useful for recovering details of a job (executable, arguments):
• condor_q user -af:jh Cmd Args
• condor_history user -limit 10 -af:jh Cmd Args
We’re using the “j” (for JobID) and “h” (for ”header”) in addition to the auto-format option above -- this prints out the JobID along with the requested attributes.
32
Where are my files? and other questions.
Can also look up:
• Where the job was submitted: Iwd
• Where the log file is: UserLog
• Where stdout and stderr are: Out, Err
• A job’s exit code: ExitCode
• A job’s requirements: -af:r Requirements
Time attributes
• There *are* timestamps for different events but they are in epoch time, so you have to do some work to read them.
33
Where can I see all the attributes?
See the manual for a list of all the attributes HTCondor can use.
• Manual > Appendix > ClassAd Attributes > Job ClassAd Attributes
For the JobStatus attribute (which we will see soon), can use this command to print out a list of the JobStatus codes:
• condor_q -help status
34
Why is my job on hold?
Condor command: condor_q -hold
Use job attributes:
• condor_q -af HoldReasonCode HoldReason
File option: Log file
35
How do I see just “x” jobs?
So far, we have simply been selecting specific jobs using their JobID or all the jobs associated with a single user.
What if you want to select a more specific set of jobs, like all jobs that are still idle?
This can be done with attributes and the “-constraint” option to condor_q
• condor_q -constraint ‘JobStatus == 1’
The constraint option works with other HTCondor job-related
commands (condor_hold, condor_rm, condor_release, etc.)
36
condor_status: What kind of computers are there?
We can use similar principles to learn about the computers (machines) in an HTCondor Pool using condor_status. To get a list of attributes, use the “long” option:
• condor_status -l [name]
Then look at certain attributes (like Machine, TotalCpus, CPUModel) using the “auto format” flag.
• condor_status -af Machine TotalCpus CpuModel | sort |uniq -c
37
Real-world Diagnosing
# My Submit File: test.sub
universe = vanilla
executable = test.sh
arguments = sample313.txt
# This analysis has many important files
transfer_input_files = software/, sample318.txt
# Controlling my job behavior
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
requirements = (OSGVO_OS_STRING == "RHEL 7")
# Save standard error, output, and HTCondor log files
log = TestJobOutput/test_job.log
output = TestJobOutput/test_job.out
error = TestJobOutput/test_job.error
+JobDurationCategory = "Medium"
request_cpus = 1
request_memory = 5GB
request_disk = 1GB
queue 1
38
[alice@login05 analysis-folder]$ ls -lh
total 24K
-rwxrwxr-x 1 alice osg 423M Mar 17 2022 test.sh
-rwxrwxr-x 1 alice osg 517M Mar 17 2022 test.sub
-rwxrwxr-x 1 alice osg 320M Mar 17 2022 download_data.sh
-rw-rw-r-- 1 alice osg 12K Mar 17 2022 README.md
drwxrwxr-x 2 alice osg 30M Mar 17 2022 software/
drwxrwxr-x 2 alice osg 30M Mar 17 2022 TestJobOutput
-rwxrwxr-x 1 alice osg 3G Mar 17 2022 sample313.txt
-rwxrwxr-x 1 alice osg 3G Mar 17 2022 sample445.txt
-rwxrwxr-x 1 alice osg 3G Mar 17 2022 sample917.txt
This job will go on hold for two reasons. Do you know what they are? How would you identify these issues for your held jobs?
Bonus: Do you know what order they will happen in?
Next Steps
Visit our documentation website https://portal.osg-htc.org to learn more about using HTCondor commands like condor_q and condor_history.
The HTCondor ClassAd Language Tutorial
and other HTCondor talks on the Center for High Throughput Computing YouTube channel.
Attend office hours with questions, etc.!
39
Acknowledgements
This material is based upon work supported by the National Science Foundation under Cooperative Agreement OAC-2030508 as part of the PATh Project. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
40
Questions?
�What other information would you like to be able to learn about your jobs?
41
42