Introduction to High Performance Computing
“High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.”
2
bioinfotraining.bio.cam.ac.uk
HPC vs. local computer
3
bioinfotraining.bio.cam.ac.uk
Parallel computing
4
bioinfotraining.bio.cam.ac.uk
Topology of the HPC
5
Login
nodes
Queue manager /
Job Scheduler
Production/Compute nodes
bioinfotraining.bio.cam.ac.uk
Topology of the HPC
6
Login
nodes
Queue manager /
Job Scheduler
Production/Compute nodes
bioinfotraining.bio.cam.ac.uk
Topology of the HPC
7
Shared Filesystem
Login
nodes
Queue manager /
Job Scheduler
Production/Compute nodes
bioinfotraining.bio.cam.ac.uk
Topology of the HPC
8
Vs.
Shared Filesystem
Login
nodes
Queue manager /
Job Scheduler
Production/Compute nodes
bioinfotraining.bio.cam.ac.uk
Login nodes
9
bioinfotraining.bio.cam.ac.uk
Queue manager / Job Scheduler
10
bioinfotraining.bio.cam.ac.uk
Queue manager / Job Scheduler
11
Single computer environment:
$ bowtie2 –x ref_index -1 reads_1.fastq -2 reads_2.fastq
HPC:
$ sbatch RESOURCE_REQUEST mapping.sh
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST
2048 highmem mapping ht123 R 0:02 1 highmem-node01
Spend time thinking about your resource requests
bioinfotraining.bio.cam.ac.uk
Work together with the Job Scheduler
12
Pipelines from the beginning to the end in one job
Pipelines separated to stages and submitted as separate jobs
bioinfotraining.bio.cam.ac.uk
Work together with the Job Scheduler
13
Pipelines from the beginning to the end in one job
Pipelines separated to stages and submitted as separate jobs
bioinfotraining.bio.cam.ac.uk
14
Production/Compute nodes
Different types of nodes accessible via �partitions or queues
bioinfotraining.bio.cam.ac.uk
Often separate “home” and “scratch” storage.
15
/home
/scratch
Filesystem
bioinfotraining.bio.cam.ac.uk
Use the HPC…
16
bioinfotraining.bio.cam.ac.uk
Exercise and Quiz
17
Job submission with SLURM
18
SLURM Job Scheduler
SLURM is the software that manages the job queue:
19
bioinfotraining.bio.cam.ac.uk
SLURM: Key Commands
20
Let’s do a demo of all these!
bioinfotraining.bio.cam.ac.uk
SLURM: Summary
21
bioinfotraining.bio.cam.ac.uk
Connecting and working on a HPC
22
Working on Remote Servers: Overview
23
bioinfotraining.bio.cam.ac.uk
Connecting to Remote Server: ssh
ssh hm533@login.hpc.cam.ac.uk
(hm533@login.hpc.cam.ac.uk) Password:
24
The ssh (secure shell) command is used to connect to a remote host
You will be asked for your password (and possibly two-factor authentication code, if setup in your HPC):
⚠ As you type the password, nothing shows up! This is a safety mechanism, but your password is being captured as you type.
💡On Windows/Linux use Ctrl + Shift + V to paste. �On macOS the usual ⌘ + V works.
Your terminal should change to indicate you are on the remote server:
bioinfotraining.bio.cam.ac.uk
Editing Files: nano , VS Code, vim
25
nano
VS Code
vim
bioinfotraining.bio.cam.ac.uk
Managing Software
26
Managing Software
27
Solution 1: pre-installed software
Solution 2: install it yourself locally �(remember you don’t have admin permissions)
bioinfotraining.bio.cam.ac.uk
Managing Software: Modules
28
module avail 2>&1 | grep -i bowtie
bowtie2-2.3.1
bowtie2-2.3.5
The modules package adds software to the user’s $PATH
Remember to include the `module load` command in your SLURM submission script for each package you want to use
bioinfotraining.bio.cam.ac.uk
Managing Software: Modules
29
Pros:
Cons:
bioinfotraining.bio.cam.ac.uk
Managing Software: Mamba/Conda
30
Python 3.7
NumPy 1.15
scikit-learn 0.20
Python 3.12
NumPy 1.26
TensorFlow 2.0
Environment 1
Environment 2
Mamba environments
Python 2.7
R 3.1
System �environment
bioinfotraining.bio.cam.ac.uk
Managing Software: Conda
31
mamba install -n datasci -c conda-forge matplotlib=3.8.3
bioinfotraining.bio.cam.ac.uk
Managing Software: Conda
32
⚠ Warning!
When submitting jobs to SLURM on the HPC you need to include the following lines of code:
# Always add these two commands to your scripts
eval "$(conda shell.bash hook)"
source $CONDA_PREFIX/etc/profile.d/mamba.sh
# then you can activate the environment
mamba activate datasci
bioinfotraining.bio.cam.ac.uk
Managing Software: Conda
33
Our Mamba installation instructions include a step to automatically search the two most popular channels/repositories:
Therefore, you can leave the channel out of the install command. These are all equivalent:
mamba install -n datasci -c conda-forge numpy=1.26.4
mamba install -n datasci conda-forge::numpy=1.26.4
mamba install -n datasci numpy=1.26.4
💡
bioinfotraining.bio.cam.ac.uk
Managing Software: Conda
34
Pros:
Cons:
bioinfotraining.bio.cam.ac.uk
Job Arrays
35
Parallel Jobs: Speed Processing Time
36
Input
(e.g. FASTQ file)
CPU
Output
(e.g. BAM file)
bioinfotraining.bio.cam.ac.uk
Parallel Jobs: Speed Processing Time
37
Input
(e.g. FASTQ file)
Multiple
CPUs
Output
(e.g. BAM file)
Multi-threading
bioinfotraining.bio.cam.ac.uk
Parallel Jobs in SLURM: For Loops
for eachFile in filesList
do
sbatch jobScript.sh eachFile
done
38
bioinfotraining.bio.cam.ac.uk
Parallel Jobs in SLURM: Job Arrays
39
#!/bin/bash
#SBATCH -D /scratch/participant/hpc_workshop
#SBATCH -o logs/parallel_arrays_%a.log
#SBATCH -c 2
#SBATCH --mem=1G
#SBATCH -a 1-3
echo "This is task number $SLURM_ARRAY_TASK_ID"
echo "Using $SLURM_CPUS_PER_TASK CPUs cores"
echo "Running on:"
hostname
SLURM_ARRAY_TASK_ID = 1
SLURM_ARRAY_TASK_ID = 2
SLURM_ARRAY_TASK_ID = 3
logs/parallel_arrays_3.log
This is task number 3
Using 2 cores
Running on: node-10
logs/parallel_arrays_1.log
This is task number 1
Using 2 cores
Running on: node-15
logs/parallel_arrays_2.log
This is task number 2
Using 2 cores
Running on: node-8
bioinfotraining.bio.cam.ac.uk
Advantages and Limitations of Job Arrays
40
bioinfotraining.bio.cam.ac.uk
Job Arrays: $SLURM_ARRAY_TASK_ID Tricks
41
For stochastic simulations can use it to set a seed (for reproducibility)
#SBATCH -a 1-100
python my_simulation_script.py $SLURM_ARRAY_TASK_ID
Write your script so that it takes a number as input, which is then used to set a seed for a random number generator
(You may not always want to do this - if you’re running many simulations in your work, maybe you shouldn’t use the same set of seeds all the time)
bioinfotraining.bio.cam.ac.uk
Job Arrays: $SLURM_ARRAY_TASK_ID Tricks
42
Different inputs
#SBATCH -a 2-3
SAMPLE=$(head -n $SLURM_ARRAY_TASK_ID samplesheet.csv | tail -n 1 | cut -d "," -f 1)
INPUT=$(head -n $SLURM_ARRAY_TASK_ID samplesheet.csv | tail -n 1 | cut -d "," -f 2)
command --input ${INPUT} --output results/${SAMPLE}.out
sample,input
patient1,data/XYZ10231.fq
patient2,data/XYZ19381.fq
Do Exercise 2
Green sticky → finished
Red sticky → help!
bioinfotraining.bio.cam.ac.uk
Job Dependencies
43
Job Dependencies: Syntax
44
$ sbatch job1.sh�Submitted batch job 349
�$ sbatch --dependency=afterok:349 job2.sh
Submitted batch job 350
$ sbatch --dependency=afterok:350 job3.sh
Submitted batch job 351
Linear pipeline where each script depends on the output of a previous script:
job1.sh → job2.sh → job3.sh
bioinfotraining.bio.cam.ac.uk
Job Dependencies: “afterok”
Useful for pipelines with a “linear” chain of job dependencies
45
result_task1.txt
task1.sh
result_task2.txt
task2.sh
Demo in hpc_workshop/dependency/ok
bioinfotraining.bio.cam.ac.uk
Job Dependencies: Capturing Job ID
46
# first task of our pipeline
# capture JOBID into a variable
run1_id=$(sbatch --parsable task1.sh)
# second task of our pipeline
# use the previous variable here
sbatch --dependency afterok:${run1_id} task2.sh
You can write a long linear pipeline like this in a shell script and run as: �bash submit_pipeline.sh
(note: this is not submitted to SLURM → the script is doing the submission for us instead)
bioinfotraining.bio.cam.ac.uk
Types of Job Dependencies
47
afterok:jobid[:jobid...] | job can begin after the specified jobs �have completed with an exit code of zero |
afternotok:jobid[:jobid...] | job can begin after the specified jobs �have failed |
singleton | jobs can begin execution after �all previous jobs with the same name and user have ended. �This is useful to collate results of a swarm or to send a notification at the end of a swarm. |
after:jobid[:jobid...] | job can begin after the specified jobs have started |
afterany:jobid[:jobid...] | job can begin after the specified jobs have terminated |
bioinfotraining.bio.cam.ac.uk
Job Dependencies: “afternotok”
48
bioinfotraining.bio.cam.ac.uk
Job Dependencies: “afternotok”
49
Demo in hpc_workshop/dependency/notok
long_task_result.txt
checkpoint.txt
task_with_checkpoints.sh
every 15s
bioinfotraining.bio.cam.ac.uk
Job Dependencies: “singleton”
Useful for tasks with multiple dependencies
50
result_task1.txt
task1.sh
result_task2.txt
task2.sh
Demo in hpc_workshop/dependency/singleton
result_task3.txt
task3.sh
bioinfotraining.bio.cam.ac.uk
Job Dependencies: Complex Pipelines
51
bioinfotraining.bio.cam.ac.uk
Job Dependencies: Complex Pipelines
52
For complex pipelines using dedicated workflow-management software such as Snakemake or Nextflow may be more suitable
See: nf-co.re/pipelines
bioinfotraining.bio.cam.ac.uk
Moving Files
53
Moving Files to Remote Servers
54
bioinfotraining.bio.cam.ac.uk
Moving Files: FileZilla
55
Pros:
Cons:
bioinfotraining.bio.cam.ac.uk
Moving Files: scp
$ cd ~/Documents
$ scp -r awesome_proj/data rob123@train.bio:scratch/awesome_proj
56
copy directories recursively �(like cp)
The source directory/file I want to copy �(relative to my current dir)
My computer (e.g. a macOS):
/Users
|_robin
|_Documents
|_awesome_proj
|_ data
The HPC filesystem:
/home
|_rob123
|_scratch
|_ awesome_proj
|_ data
|_ results
I want to transfer the data folder
The destination I want to �copy it into �(relative to /home/rob123)
The credentials to access the HPC (same as with ssh)
A separator
bioinfotraining.bio.cam.ac.uk
Moving Files: scp
$ cd ~/Documents
$ scp -r rob123@train.bio:scratch/awesome_proj/results awesome_proj
57
copy directories recursively �(like cp)
The source directory/file �I want to copy�(relative to /home/rob123)
My computer (e.g. a macOS):
/Users
|_robin
|_Documents
|_awesome_proj
|_ data
The HPC filesystem:
/home
|_rob123
|_scratch
|_ awesome_proj
|_ data
|_ results
I want to transfer the results folder
The destination I want to copy it into �(relative to my current dir)
The credentials to access the HPC (same as with ssh)
A separator
bioinfotraining.bio.cam.ac.uk
Moving Files: scp
58
Pros:
Cons:
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
$ cd ~/Documents
$ rsync -avhu awesome_proj/data/ rob123@train.bio:scratch/awesome_proj/data/
59
Only transfer new files�(more explanation in following slides)
The source directory/file I want to synch�(relative to my current dir)
My computer (e.g. a macOS):
/Users
|_robin
|_Documents
|_awesome_proj
|_ data
The HPC filesystem:
/home
|_rob123
|_scratch
|_ awesome_proj
|_ data
|_ results
I want to synch the data folder
The destination I want to �synch it into �(relative to /home/rob123)
The credentials to access the HPC (same as with ssh)
A separator
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
$ cd ~/Documents
$ rsync -avhu awesome_proj/data/ rob123@train.bio:scratch/awesome_proj/data/
60
My computer (e.g. a macOS):
/Users
|_robin
|_Documents
|_awesome_proj
|_ data
The HPC filesystem:
/home
|_rob123
|_scratch
|_ awesome_proj
|_ data
|_ results
I want to synch the data folder
⚠ ⚠ ⚠
The / in the source path is important!
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
rsync -avhu awesome_proj/data rob123@train.bio:scratch/awesome_proj/data/
61
My computer (e.g. a macOS):
/Users
|_robin
|_Documents
|_awesome_proj
|_ data
The HPC filesystem:
/home
|_rob123
|_scratch
|_ awesome_proj
|_ data
|_ data
|_ results
This would happen if you didn’t include the / in source path
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
$ cd ~/Documents
$ rsync -avhu rob123@train.bio:scratch/awesome_proj/results awesome_proj/
62
My computer (e.g. a macOS):
/Users
|_robin
|_Documents
|_awesome_proj
|_ data
The HPC filesystem:
/home
|_rob123
|_scratch
|_ awesome_proj
|_ data
|_ results
I want to transfer the results folder
Only transfer new files�(more explanation in following slides)
The source directory/file I want to copy �(relative to /home/rob123)
The destination I want to �copy it into �(relative to my current dir)
The credentials to access the HPC (same as with ssh)
A separator
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
$ cd ~/Documents
$ rsync -avhu rob123@train.bio:scratch/awesome_proj/results awesome_proj/
63
My computer (e.g. a macOS):
/Users
|_robin
|_Documents
|_awesome_proj
|_ data
|_ results
The HPC filesystem:
/home
|_rob123
|_scratch
|_ awesome_proj
|_ data
|_ results
After the transfer
⚠ ⚠ ⚠
The / at the end of the source path is important!
By excluding the / we are saying: “transfer the entire results folder to awesome_proj”
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
rsync -avhu rob123@train.bio:scratch/awesome_proj/results/ awesome_proj/
64
My computer (e.g. a macOS):
/Users
|_robin
|_Documents
|_awesome_proj
|_ data
|_ file1.csv
|_ file2.txt
The HPC filesystem:
/home
|_rob123
|_scratch
|_ awesome_proj
|_ data
|_ results
|_ file1.csv
|_ file2.txt
This would happen if you included �the /in the source path
With / we are saying: “transfer the contents of results to awesome_proj”
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
65
Options:
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
66
💡Tip:
--dry-run option → shows you what files/folders rsync will transfer with your command, but not actually do the transfer.
Great way to make sure you specified your paths and options correctly!
bioinfotraining.bio.cam.ac.uk
Moving Files: rsync
67
Pros:
Cons:
bioinfotraining.bio.cam.ac.uk