Beginner Slurm
Paul Hall, PhD
Senior Research Software Engineer
HPC - team
Center for Computation and Visualization
Goals
Introduction
Simulation of physical phenomena
Visualization
Introduction
Oscar: Under the Hood
Gateway nodes
login
desktop
transfer
/home
50 GB
/data
512+ GB
/scratch
up to 12 TB
GPFS
Storage
Compute
Compute nodes
CPU
CPU
CPU
CPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
GPU
VNC
Scheduler
(Slurm)
Submitting jobs
You can specify these outputs using a batch file
Anatomy of batch file
#!/bin/bash
# Here is a comment
#SBATCH --time=1:00:00
#SBATCH -N 1
#SBATCH -c 1
#SBATCH -J MyJob
#SBATCH -o MyJob-%j.out
#SBATCH -e MyJob-%j.err
module load workshop
hello
This is a bash file but with slurm flags
Use #SBATCH to specify slurm flags
To remove the flag just break the pattern # SBATCH
Anatomy of batch file
#!/bin/bash
# Here is a comment
#SBATCH --time=1:00:00
#SBATCH -N 1
#SBATCH -c 1
#SBATCH -J MyJob
#SBATCH -o MyJob-%j.out
#SBATCH -e MyJob-%j.err
module load workshop
hello
This is a bash file but with slurm flags
How much time do I need
How many nodes do I need - Use 1 here if your code is not MPI enabled
cores
Name of your job
Where to put your output and error files
%j expands into the job-number
Job number is unique
Bash commands to run your job
Example batch scripts
Submitting batch files
sbatch <file_name>
sbatch <flags> <file_name>
sbatch -N 2 submit.sh - This will override the corresponding flags in your batch script
Checking on your jobs
Running/Pending jobs
Completed jobs
Examples
What resources should I ask for?
This depends on the code you are running
Nodes/Cores
Q. is your code parallel? You will need to find out if your code can
- run on multiple cores
- run across multiple nodes
Look at the documentation for your code, is it threaded, multiprocessor, MPI
Q. Is your code serial?
- This means it can only make use of one core (-n 1)
What resources should I ask for?
This depends on the code you are running
Time
Make an estimate and add a bit.
e.g. if think your code will take an hour, give it 3 hours
if you think your code will take a day, give it 2-3 days
If you run out of time your job will be killed, so be generous with your estimate
What resources should I ask for?
This depends on the code you are running
Memory
For memory, this can take some trial and error. You can ask for a lot, then measure your usage. If you have asked for 100GB of memory, but only used 1GB, you can reduce your memory for your next job. To ask for all the memory available on a node, use #SBATCH --mem=0
What resources should I ask for?
This depends on the code you are running
GPUs
If you code is built to use gpus you can submit to the gpu partition. To request 1 gpu:
#SBATCH -p gpu --gres=gpu:1
I have a condo account, how do I submit to the it?
#SBATCH --account=<account-name>
If you do not know the name of the condo account then execute condos command
You need to explicitly added to the condos, to check -
sacctmgr show assoc where user=$USER
if you haven’t been added please email support@ccv.brown.edu
Finding out optimal resources for your job
It is good practice to occasionally check what resources your job is using.
myjobinfo -j <job-id>- This command is Oscar specific
seff <job-id>
Why wont my Job start?
Reason | What this means |
(Resources) | Waiting for enough resources to be available |
(QOSGrpCpuLimit) | Your condo cores are all in use |
(QOSGrpMemLimit) | Your condo memory is all in use |
(JobHeldUser) | You have put a hold on the job |
(Priority) | Jobs with higher priority are waiting for compute nodes |
(ReqNodeNotAvail) | The nodes you requested are not available |
(PartitionNodeLimit) | You have requested more nodes that are in the partition |
Understanding queue priority
This blue line represents all the cores on Oscar
Understanding queue priority
This blue line represents all the cores on Oscar
time
the x axis is time
Understanding queue priority
time
the x axis is time
Job1
Job2
Job3
Understanding queue priority
time
the x axis is time
Job1
Job2
Job3
Job4
Understanding queue priority
time
the x axis is time
Job1
Job2
Job3
Job4
Job5
Understanding queue priority
time
the x axis is time
Job1
Job2
Job3
Job4
Job5
Condo
Job
Interactive jobs
You can start the interactive jobs using the interact command
Have Questions?