1 of 67

Introduction to High Performance Computing

2 of 67

“High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.”

2

bioinfotraining.bio.cam.ac.uk

3 of 67

HPC vs. local computer

Shared resources
Administrative rights
Parallel computing
Specific nodes for different resource needs
Reliability, redundancy, maintenance, safety

Cost of running
Impact on the environment

3

bioinfotraining.bio.cam.ac.uk

Let’s start by comparing a local computer (like your desktop/laptop) and a HPC.

Shared resources:

if you are using your computer you don’t have to share any resources, you have all CPUs, memory, storage available.
on a HPC you share resources with other users, so you need to be aware of how many resources are available and how you can use them. For example, some resources (such as storage and even the number of CPU hours) can be restricted, or need to be purchased.

Admin rights:

On your computer you usually have rights to install/uninstall software and make changes to your system
On HPC you have no admin rights. You typically have system administrators that you can contact via a helpdesk. For certain tasks - such as installing software - we may need to use alternative methods to manage them locally, such as Conda, which we will talk about.

Parallel computing:

On local computer we usually run analysis or process samples linearly
On a HPC we usually have many cores available, which means we need to think about how we can parallelise our computations.

Specific nodes:

HPC usually has different “flavours” of computers (or nodes), which allows us to more effectively use the HPC. For example, we may have high-memory nodes with fewer CPUs, which may be adequate for certain applications (such as genome assembly) or high-CPU nodes suited for other applications (e.g. simulations). There are also GPU-based nodes, which may be suitable for machine (deep) learning applications.

Reliability, redundancy, maintenance, safety:

Usually HPC has very resilient setup, which is professionally maintained. This is an advantage compared to using your own server, which will have to be maintained by yourself.
However, HPC should not be used for backed storage!

Cost of running:

Although in some cases it may cost you some money to use a HPC, this is typically cheaper compared to buying your own high-compute server (if you also account for sysadmin time)

Impact on the environment:

Because HPC is sharing resources in a smart way and kept in appropriate server rooms, the energy costs for running our analysis are substantially lower. So, carbon footprint is actually lower than if every one of us is using servers in our poorly cooled offices.

4 of 67

Parallel computing

Accelerate the run by using multiple CPU cores for calculation

Desktop, in-house server, HPC, Smartphone, …

Process multiple samples in parallel

Multiple desktops or small servers, HPC, distributed computing (SETI@home, Folding@home)

Perform a single run on multiple computers (nodes)

HPC

4

bioinfotraining.bio.cam.ac.uk

A few more words on parallel computing, which is one of the main features of a HPC.

Many bioinformatics software can do parallel computing, so they can take advantage of multiple CPUs available on these systems.

The number of CPUs is typically much higher (32 or even 72) than what we have on our personal desktop computers or in-house servers.

But perhaps the main advantage of a HPC is that on the HPC you can process multiple samples in parallel on different nodes. This is where using a local server cannot compete with HPC.

There are some distributed computer solutions that have been used, for example using people’s machines to search for extraterrestrial life (SETI@home) or solve complex protein dynamics problems (Folding@home). But generally these are very specific projects and don’t necessarily scale well for other applications.

Finally, HPC can be used to perform a single run on multiple computers (so the parallelisation is not just across CPUs on a node, but across nodes), although this requires special programming, for example using MPI programming.

5 of 67

Topology of the HPC

5

Login

nodes

Queue manager /

Job Scheduler

Production/Compute nodes

bioinfotraining.bio.cam.ac.uk

6 of 67

Topology of the HPC

6

Login

nodes

Queue manager /

Job Scheduler

Production/Compute nodes

bioinfotraining.bio.cam.ac.uk

7 of 67

Topology of the HPC

7

Shared Filesystem

Login

nodes

Queue manager /

Job Scheduler

Production/Compute nodes

bioinfotraining.bio.cam.ac.uk

8 of 67

Topology of the HPC

8

Vs.

Shared Filesystem

Login

nodes

Queue manager /

Job Scheduler

Production/Compute nodes

bioinfotraining.bio.cam.ac.uk

9 of 67

Login nodes

9

SSH login, no GUI�
SFTP / SCP file transfer�
Running your scripts on a login node is very tempting, but �you must resist!�
Learn the way of direct file transfer solutions �(wget, rsync, specific API, …)�
Use commands ethically!

bioinfotraining.bio.cam.ac.uk

10 of 67

Queue manager / Job Scheduler

10

bioinfotraining.bio.cam.ac.uk

The queue manager or job scheduler is the software where we will submit our scripts to run on the HPC.

One analogy is to think of the job scheduler as a waiter managing customers in a restaurant:

Think of each table as a computer (node) and each chair as a CPU on that node.
From this picture we can see that if a group arrives and wants 6 seats, there is only one node that the waiter can assign that group to. But if someone arrives on their own and only wants 1 seat, then the waiter can put them on a free seat on one of the other tables - OK, this is not your typical restaurant, here we assume that customers are happy to share a table with others :)
But what this illustrates is that this is similar to requesting CPUs for a particular job. The job scheduler will have to find nodes that have enough CPUs available to satisfy our request. The more CPUs we request, potentially the longer we will have to wait for our job to run (in the same way that a bigger group of people may have to wait longer for a table to be free in the restaurant).

11 of 67

Queue manager / Job Scheduler

11

Single computer environment:

$ bowtie2 –x ref_index -1 reads_1.fastq -2 reads_2.fastq

HPC:

$ sbatch RESOURCE_REQUEST mapping.sh

$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST

2048 highmem mapping ht123 R 0:02 1 highmem-node01

Spend time thinking about your resource requests

Reduce time and number of CPU cores if you can (backfilling)
Monitor production node usage
Time and CPU core restrictions

bioinfotraining.bio.cam.ac.uk

So, how does this look like in practice? Here is a simple example, comparing what we would do if we ran a job interactively on our computer, compared to running it on a HPC:

For example, if we were running a command such as bowtie2 (a common bioinformatics software for aligning sequencing reads to a reference genome), this might be how it would look like if you were running it interactively on a computer.
On the HPC, instead we include our commands in a shell script and submit that script to the program `sbatch` (assuming SLURM) with a set of requests such as CPUs/memory/etc. - which we will learn about during today.
The job scheduler will then put our job in a queue, and decide when and where your jobs will run, depending on the requested resources.
It is useful to learn how the queue manager works in general way, because then we can submit our jobs to run as fast as possible

for example asking for as few resources as we can is ideal - if we ask for more time and CPUs than we need, our job will be queuing for longer
knowing which nodes (or partitions) are most in use can help us submit our jobs to a more suitable partition

12 of 67

Work together with the Job Scheduler

12

Pipelines from the beginning to the end in one job

Pipelines separated to stages and submitted as separate jobs

bioinfotraining.bio.cam.ac.uk

13 of 67

Work together with the Job Scheduler

13

Pipelines from the beginning to the end in one job

Pipelines separated to stages and submitted as separate jobs

bioinfotraining.bio.cam.ac.uk

14 of 67

14

Production/Compute nodes

General computing nodes

High memory nodes

High CPU count nodes

GPU nodes

Different types of nodes accessible via �partitions or queues

bioinfotraining.bio.cam.ac.uk

15 of 67

Often separate “home” and “scratch” storage.

15

/home

Small
Backed-up
Use for software and general scripts

/scratch

Large (1TB +)
Not backed-up
Use as “working directory” for processing data
Make sure to regularly backup code from here

Filesystem

bioinfotraining.bio.cam.ac.uk

16 of 67

Use the HPC…

…Ethically

Do not use the login node for production runs

…Smartly

Optimise your jobs for CPU and time usage
Create universal and software-specific submission scripts (but never sample specific)
Reduce the number of CPU cores if it doesn’t have a very significant effect to go in production earlier
Check production node usage

…Efficiently

Use free account (SL3) if the job is not urgent
Run multiple samples in parallel
Build up dependency chains

16

bioinfotraining.bio.cam.ac.uk

17 of 67

Exercise and Quiz

17

Exercise → think about what your answers for these questions would be
Then we do quiz together: https://www.menti.com/mcv5b2m1mi

18 of 67

Job submission with SLURM

18

19 of 67

SLURM Job Scheduler

SLURM is the software that manages the job queue:

It decides when each jobs runs, depending on the availability of compute nodes and the resources requested for that job (CPUs and memory).
Your job is given an initial “queue position” that depends on:

Your account priority level (e.g. paid vs free tier)
Resources requested: the more CPUs/memory/time you ask for, the further back in the queue you start

Your “queue position” improves the longer you are in the queue

19

bioinfotraining.bio.cam.ac.uk

20 of 67

SLURM: Key Commands

sbatch → submit a shell script to the queue
squeue → see all the jobs in the queue
squeue -u USERNAME → see your jobs only
scancel JOBID → cancel the job with the specified ID
scancel -u USERNAME → cancel all your jobs
seff JOBID → get efficiency information about your job

20

Let’s do a demo of all these!

bioinfotraining.bio.cam.ac.uk

21 of 67

SLURM: Summary

Write your commands in a shell script
Configure your job options using #SBATCH directives
Use $SLURM_* environment variables to further customise your commands.
Test your code by requesting a terminal on a compute node using sintr (interactive jobs)

all the options available with sbatch are also available to sintr �(e.g. -c, --mem, -t, etc.)
Often sintr jobs are limited in maximum time they can run for (e.g. 1h at Cambridge), because this is not a very efficient way to run large-scale analyses.

21

bioinfotraining.bio.cam.ac.uk

22 of 67

Connecting and working on a HPC

22

23 of 67

Working on Remote Servers: Overview

23

bioinfotraining.bio.cam.ac.uk

24 of 67

Connecting to Remote Server: ssh

ssh hm533@login.hpc.cam.ac.uk

(hm533@login.hpc.cam.ac.uk) Password:

hm533@login-p-1.hpc.cam.ac.uk ~$

24

The ssh (secure shell) command is used to connect to a remote host

You will be asked for your password (and possibly two-factor authentication code, if setup in your HPC):

⚠ As you type the password, nothing shows up! This is a safety mechanism, but your password is being captured as you type.

💡On Windows/Linux use Ctrl + Shift + V to paste. �On macOS the usual ⌘ + V works.

Your terminal should change to indicate you are on the remote server:

bioinfotraining.bio.cam.ac.uk

25 of 67

Editing Files: nano , VS Code, vim

25

nano

Command-line text editor available on most Linux distributions
Ctrl + X to exit nano

It will ask if you want to save the file → type Y (yes)
It will ask to confirm the file name → press Enter

VS Code

GUI-based software
Has the capability to connect to remote servers using the ssh protocol

Instructions in the course materials

vim

Command-line text editor available on most Linux distributions
Very advanced (but steep learning curve)
If you know what this is, you don’t need us to tell you about text editors

bioinfotraining.bio.cam.ac.uk

26 of 67

Managing Software

26

27 of 67

Managing Software

27

Solution 1: pre-installed software

Available through the Modules package �(if the HPC admins set this up)

Solution 2: install it yourself locally �(remember you don’t have admin permissions)

Compile software from source → can be more challenging (we’re not covering it here)
Use a package manager → Mamba (previously Conda)

bioinfotraining.bio.cam.ac.uk

So far we haven’t had to worry about software. Everything we’ve been doing uses software that was already installed on our training HPC.

However, in practice, a lot of software you will need to use may not be immediately available when you login to the HPC.

The reason is that because the HPC is used by many users, and they may all have different software needs.

For example, one user may want to use a version of a software, whereas another user needs a different version.

So, there’s no way that the HPC admins could install things to satisfy every single user’s need.

Instead, we rely on software/package managers to do this work for us.

There are different solutions for managing software, but we will focus on two:

“Environment modules” are often setup by HPC admins and are a way that the admins install different software (including different version of the same software) and the users can then load the software packages that they want to use (we will demonstrate this in a while).
The other solution is to install the software yourself locally. You need to install it locally on your /home for example, because you of course don’t have admin permissions to the HPC.�You could compile the software yourself from source, but this is a little more advanced and we will not cover it here. �An alternative is to use a package manager such as Conda, which we will demonstrate here with you.

28 of 67

Managing Software: Modules

28

List all the software packages available:�module avail

Search for a particular package:

module avail 2>&1 | grep -i bowtie

bowtie2-2.3.1

bowtie2-2.3.5

Make the software available in my environment: �module load bowtie2-2.3.1�module load samtools-1.9

The modules package adds software to the user’s $PATH

Remember to include the `module load` command in your SLURM submission script for each package you want to use

bioinfotraining.bio.cam.ac.uk

Before we do this in practice, here is an overview of how the “Modules” package works:

To see all the modules available, you can use this command “module avail”�But this command often gives you a very very long list of software -- so it’s good if we can search for a specific software instead
To search for a specific module we can use some Unix tricks, here we’re using the grep command to search for a software called “bowtie” and we can see a couple of results, because in this example there were two versions installed.
To then use the software we need to load it to our environment, and for this we use the `module load` command. �You need to add this command for each software you want to use and always include it in your SLURM script.

[trainer note]

Switch to the terminal and demonstrate this on the training HPC.

29 of 67

Managing Software: Modules

29

Pros:

Ready and easy to use - no setup time for the user
No need to worry about installation dependencies, etc - HPC admins take care of this for you (even if the software is kind of picky about dependencies).
Software is compiled in an optimal way for the specific hardware of the HPC (may often increase speed and efficiency of the software)

Cons:

If a software/version is not available, need to request it from the HPC admins, which may take some time.

bioinfotraining.bio.cam.ac.uk

30 of 67

Managing Software: Mamba/Conda

30

Python 3.7

NumPy 1.15

scikit-learn 0.20

Python 3.12

NumPy 1.26

TensorFlow 2.0

Environment 1

Environment 2

Mamba environments

Python 2.7

R 3.1

System �environment

bioinfotraining.bio.cam.ac.uk

An alternative to using the Modules package is to manage the software installation yourself.

Using the Conda package manager has become quite popular in bioinformatics, because it strikes a balance between giving you the ability to install specific versions of a software, but at the same time it takes care of also installing all the necessary dependencies for that software.

One of the features of Conda is the ability to manage different “environments”.

You can think of an environment as an isolated set of software that works independently of another environment, as if they were installed on separate folders on your computer (and in reality that is how Conda does it!).

So, how does it actually work in practice?

[trainer note]

At this point move to demonstrate interactively how this works by going through the `scipy` example in the materials: https://cambiotraining.github.io/hpc-intro/04-software.html#Installing_Software_Using_conda

Then come back to the slides and show the next slide which summarises things

31 of 67

Managing Software: Conda

31

Check which packages are available to �install via Mamba: anaconda.org
Create an environment:�mamba create -n datasci
Install the package(s):�mamba install -n datasci -c conda-forge numpy=1.26.4

mamba install -n datasci -c conda-forge matplotlib=3.8.3

To make all packages in the environment available:�mamba activate datasci

bioinfotraining.bio.cam.ac.uk

Here is an overview summary of the main steps in setting up and managing a conda environment.

The first step you want to do is check whether the software you want to install is available to install via Conda. �You can do this from the anaconda.org website. If you search for the package name, you will get a list of results if it is available.
The next step is to create an environment for your work. You can have multiple pieces of software installed under the same environment. �However, it’s usually a good idea to keep the scope of each environment relatively small. �If you create a single environment for doing every single task in your analysis, you may start running into dependency conflicts between the different packages -- so always try to compartmentalise your software needs in smaller environments if possible. �So, in this example we are creating an environment for the task of mapping next-generation sequencing reads to a reference genome.
3rd step then is to actually install the software we want. In this case, we’re installing two pieces of software: bowtie2 and samtools. �Notice that when we search for the bowtie2 package, there is an “owner” of the installation recipe, these are also called a “channel”. �So, in our command we are specifying the name of the environment where we want to install the software, the “bioconda” channel that contains the installation recipe, and then the name of the actual software. �You can install specific versions of the package by using this syntax with the equals sign. This is optional, and if you omit it, it will install the latest version available.
The final step then is to activate the environment, so that the software becomes available for us to use. �If this is being done within a script, then we need to add this extra line of code -- this loads the “conda activate” function, and is a bit annoying that we have to do it, but that’s a quirk of using conda non-interactively.

32 of 67

Managing Software: Conda

32

⚠ Warning!

When submitting jobs to SLURM on the HPC you need to include the following lines of code:

# Always add these two commands to your scripts

eval "$(conda shell.bash hook)"

source $CONDA_PREFIX/etc/profile.d/mamba.sh

# then you can activate the environment

mamba activate datasci

bioinfotraining.bio.cam.ac.uk

Here is an overview summary of the main steps in setting up and managing a conda environment.

The first step you want to do is check whether the software you want to install is available to install via Conda. �You can do this from the anaconda.org website. If you search for the package name, you will get a list of results if it is available.
The next step is to create an environment for your work. You can have multiple pieces of software installed under the same environment. �However, it’s usually a good idea to keep the scope of each environment relatively small. �If you create a single environment for doing every single task in your analysis, you may start running into dependency conflicts between the different packages -- so always try to compartmentalise your software needs in smaller environments if possible. �So, in this example we are creating an environment for the task of mapping next-generation sequencing reads to a reference genome.
3rd step then is to actually install the software we want. In this case, we’re installing two pieces of software: bowtie2 and samtools. �Notice that when we search for the bowtie2 package, there is an “owner” of the installation recipe, these are also called a “channel”. �So, in our command we are specifying the name of the environment where we want to install the software, the “bioconda” channel that contains the installation recipe, and then the name of the actual software. �You can install specific versions of the package by using this syntax with the equals sign. This is optional, and if you omit it, it will install the latest version available.
The final step then is to activate the environment, so that the software becomes available for us to use. �If this is being done within a script, then we need to add this extra line of code -- this loads the “conda activate” function, and is a bit annoying that we have to do it, but that’s a quirk of using conda non-interactively.

33 of 67

Managing Software: Conda

33

Our Mamba installation instructions include a step to automatically search the two most popular channels/repositories:

conda-forge for scientific computing software
bioconda for bioinformatics software

Therefore, you can leave the channel out of the install command. These are all equivalent:

mamba install -n datasci -c conda-forge numpy=1.26.4

mamba install -n datasci conda-forge::numpy=1.26.4

mamba install -n datasci numpy=1.26.4

💡

bioinfotraining.bio.cam.ac.uk

Here is an overview summary of the main steps in setting up and managing a conda environment.

The first step you want to do is check whether the software you want to install is available to install via Conda. �You can do this from the anaconda.org website. If you search for the package name, you will get a list of results if it is available.
The next step is to create an environment for your work. You can have multiple pieces of software installed under the same environment. �However, it’s usually a good idea to keep the scope of each environment relatively small. �If you create a single environment for doing every single task in your analysis, you may start running into dependency conflicts between the different packages -- so always try to compartmentalise your software needs in smaller environments if possible. �So, in this example we are creating an environment for the task of mapping next-generation sequencing reads to a reference genome.
3rd step then is to actually install the software we want. In this case, we’re installing two pieces of software: bowtie2 and samtools. �Notice that when we search for the bowtie2 package, there is an “owner” of the installation recipe, these are also called a “channel”. �So, in our command we are specifying the name of the environment where we want to install the software, the “bioconda” channel that contains the installation recipe, and then the name of the actual software. �You can install specific versions of the package by using this syntax with the equals sign. This is optional, and if you omit it, it will install the latest version available.
The final step then is to activate the environment, so that the software becomes available for us to use. �If this is being done within a script, then we need to add this extra line of code -- this loads the “conda activate” function, and is a bit annoying that we have to do it, but that’s a quirk of using conda non-interactively.

34 of 67

Managing Software: Conda

34

Pros:

Easier than compiling software yourself and automatically installs dependencies
Installs software locally (by default in /home directory)
Encapsulate different software versions in environments
Can recreate an environment in another computer easily (see Conda docs)

Cons:

Packages are not optimally compiled for the hardware in use
For complex environments dependency conflicts can be hard (or impossible) to resolve
Not every software is available through Mamba/Conda
Can take a lot of space (tip: run mamba clean to remove unused and cached packages)

bioinfotraining.bio.cam.ac.uk

35 of 67

Job Arrays

35

36 of 67

Parallel Jobs: Speed Processing Time

Cores, threads and processors are often used synonymously, usually they all indicate a CPU worker �(although technically they are different things)�
Imagine that processing one input file takes 5 hours�
100 files takes 500 hours ( ~ 21 days)

36

Input

(e.g. FASTQ file)

CPU

Output

(e.g. BAM file)

bioinfotraining.bio.cam.ac.uk

37 of 67

Parallel Jobs: Speed Processing Time

Some software tools support multi-threading to speed things up (although speed gains aren’t usually linear)�
Imagine that using 10 CPUs speeds things from 5 hours to 1 hour�
100 files still takes 100 hours ( ~ 4 days) if we run them serially

37

Input

(e.g. FASTQ file)

Multiple

CPUs

Output

(e.g. BAM file)

Multi-threading

bioinfotraining.bio.cam.ac.uk

38 of 67

Parallel Jobs in SLURM: For Loops

for eachFile in filesList

do

sbatch jobScript.sh eachFile

done

Job submission using loops is not efficient on Slurm Workload Manager
Not recommended

38

bioinfotraining.bio.cam.ac.uk

39 of 67

Parallel Jobs in SLURM: Job Arrays

39

#!/bin/bash

#SBATCH -D /scratch/participant/hpc_workshop

#SBATCH -o logs/parallel_arrays_%a.log

#SBATCH -c 2

#SBATCH --mem=1G

#SBATCH -a 1-3

echo "This is task number $SLURM_ARRAY_TASK_ID"

echo "Using $SLURM_CPUS_PER_TASK CPUs cores"

echo "Running on:"

hostname

SLURM_ARRAY_TASK_ID = 1

SLURM_ARRAY_TASK_ID = 2

SLURM_ARRAY_TASK_ID = 3

logs/parallel_arrays_3.log

This is task number 3

Using 2 cores

Running on: node-10

logs/parallel_arrays_1.log

This is task number 1

Using 2 cores

Running on: node-15

logs/parallel_arrays_2.log

This is task number 2

Using 2 cores

Running on: node-8

bioinfotraining.bio.cam.ac.uk

How job arrays work is illustrated here.

You can think of a job array as a collection of identical jobs that SLURM will automatically generate for you.

So, in some way, it’s the equivalent of using the for loop as we’ve shown before, except more efficient for SLURM.

The way we generate an array of jobs is to use the `-a` option. In this example, we’re launching 3 sub-jobs.

When we submit this script, SLURM will actually launch 3 sub-jobs for us, and creates an environment variable called “SLURM_ARRAY_TASK_ID”, which stores the array job number.

In this example script, all we’re doing is printing some messages with the `echo` command and also the hostname to see where each sub-job would be running.

So, we can see for each of these jobs we have a slightly different message: the task number changes, but each task uses 2 CPUs (because that’s what we requested).

SLURM may run these tasks on different nodes, depending on what is available at the time -- we don’t need to worry about this, SLURM does this automatically.

There is one more thing that we can do. Because now we effectively are running multiple sub-jobs, we may want to get an output log file for each of them separately.

We can do this by appending the array number to the standard output filename, using the “%a” special symbol, shown here in pink.

This example would therefore generate 3 files, with suffix 1, 2 and 3.

40 of 67

Advantages and Limitations of Job Arrays

Advantages

Job submission is quite fast: ~30,000 jobs/ 1-2 milliseconds
Faster than using “for loops”
Job management is easy for us and to SLURM
Job array can be handled as a whole
Individual jobs in an array can be handled independently

Limitations

Each job in the array will request the same resources, like CPUs, memory, time etc.

40

bioinfotraining.bio.cam.ac.uk

41 of 67

Job Arrays: $SLURM_ARRAY_TASK_ID Tricks

41

For stochastic simulations can use it to set a seed (for reproducibility)

#SBATCH -a 1-100

python my_simulation_script.py $SLURM_ARRAY_TASK_ID

Write your script so that it takes a number as input, which is then used to set a seed for a random number generator

(You may not always want to do this - if you’re running many simulations in your work, maybe you shouldn’t use the same set of seeds all the time)

bioinfotraining.bio.cam.ac.uk

What we have shown so far simply repeats the exact same task over and over.

But more often we will want to use a command using different inputs.

Although the $SLURM_ARRAY_TASK_ID variable that is created only stores a number, we can use some tricks to make it super flexible in looping through input files or generating reproducible simulations.

So the first trick is the simplest. If you are running some simulations, what you could do is write your simulation script such that it it takes as an input a number, which is then used to set a seed for a random number generator within the script.

Talking about simulations is out of the scope of this course, so this example is relevant for those of you that may be doing this kind of work and understand what a random seed number is. If this doesn’t mean much to you, that’s OK, it’s probably not relevant for you :)

42 of 67

Job Arrays: $SLURM_ARRAY_TASK_ID Tricks

42

Different inputs

Prepare an input sample sheet (e.g. CSV format)
Use some Unix command line skills

#SBATCH -a 2-3

SAMPLE=$(head -n $SLURM_ARRAY_TASK_ID samplesheet.csv | tail -n 1 | cut -d "," -f 1)

INPUT=$(head -n $SLURM_ARRAY_TASK_ID samplesheet.csv | tail -n 1 | cut -d "," -f 2)

command --input ${INPUT} --output results/${SAMPLE}.out

sample,input

patient1,data/XYZ10231.fq

patient2,data/XYZ19381.fq

Do Exercise 2

Green sticky → finished

Red sticky → help!

bioinfotraining.bio.cam.ac.uk

The next trick is probably the most relevant for most people.

This is when we want each of our job arrays to use a different input -- so that we effectively loop through a set of input files, to generate a set of output files.

This may look a bit more complicated, but all we’re doing here is using some standard Unix command line skills.

So, if you want to use different inputs in your pipeline, you can first prepare a sample sheet in a text-based format such as CSV or TSV format.

Then, we can use the head, tail and cut commands together to grab one of the values in this CSV file. This is illustrated here in this diagram, imagine we had this file with two columns separated by a comma.

If we wanted to grab the value in the 2nd row and 1st column we can first use head and tail together to grab the n-th line of the file.

In this case, because the file has a header with column names, we take the first three lines, then followed by the last line.

Finally, we can use the cut command to grab the 1st value of this text, using the comma as a delimiter.

So, based on this trick, we can now replace things in a shell script submitted to SLURM, but what we do is use the “SLURM_ARRAY_TASK_ID” variable as the number used with head.

This way, we ensure that each job array will use a different input. A few things to note here:

We start our array at number two, because our CSV file has column names, and so we want to skip the first line of the file
We can have as many columns in our CSV file as is convenient for us, and we can fetch the relevant value by changing the `-f` option in the `cut` command
What we also do in this example is store those values in shell variables, which we can then use in our command. �In this example, we were fetching sample names from the first column of our samplesheet and an input file from the second column, and these are then passed to our command.

So, let’s see this in practice by doing exercise 2 in the materials: https://cambiotraining.github.io/hpc-intro/05-job_arrays.html#Using_$SLURM_ARRAY_TASK_ID_to_Automate_Jobs

Note that this exercise is actually composed of two equivalent sub-exercises:

One of them uses a bioinformatic example, or mapping sequencing reads to a reference genome using the bowtie2 software
If you don’t work in genomics and are not so familiar with this sort of data, then you can try the other version of the exercise, which uses a more generic simulation script, that takes two input parameters.

It’s your choice which sub-exercise you do.

[trainer note]

During the presentation of this slide, you can also go to the terminal and demonstrate the logic of these commands step-by-step.

Using the `turing_model_parameters.csv` given as an example in the materials: https://cambiotraining.github.io/hpc-intro/05-job_arrays.html#Using_$SLURM_ARRAY_TASK_ID_to_Automate_Jobs

43 of 67

Job Dependencies

43

44 of 67

Job Dependencies: Syntax

44

$ sbatch job1.sh�Submitted batch job 349

�$ sbatch --dependency=afterok:349 job2.sh

Submitted batch job 350

$ sbatch --dependency=afterok:350 job3.sh

Submitted batch job 351

Linear pipeline where each script depends on the output of a previous script:

job1.sh → job2.sh → job3.sh

bioinfotraining.bio.cam.ac.uk

Often in bioinformatics (and other applications) we have sequences of tasks that depend on each other.

We may be processing some files through a series of tools, where the input of one tool depends on the output of another tool.

Sometimes it may be fine to include the different steps within the same script, but if each task takes some time to run, or requires slightly different resources (e.g. one task needs several CPUs and another task needs lots of RAM) then it’s best to split these into different jobs.

so, how can we manage this job dependency.

Like in this example, where we have 3 jobs, and each job depends on the previous job ending.

In SLURM we can use a feature called job dependency, which is illustrated here for this simple example:

We submit the first job normally and when we do so, we get information about the job number
But for the next job we submit, we now use this `sbatch` option specifying that we only want this job to start after job 349 finishes successfully
Finally, the last job we specify job 350 as the dependency

What this will do is put those jobs in the queue waiting until the job they depend on finishes successfully.

45 of 67

Job Dependencies: “afterok”

Useful for pipelines with a “linear” chain of job dependencies

45

result_task1.txt

task1.sh

result_task2.txt

task2.sh

Demo in hpc_workshop/dependency/ok

Task1 creates a file
Task2 takes this file to produce the second file
Task2 should only run if task1 is successful

bioinfotraining.bio.cam.ac.uk

46 of 67

Job Dependencies: Capturing Job ID

46

# first task of our pipeline

# capture JOBID into a variable

run1_id=$(sbatch --parsable task1.sh)

# second task of our pipeline

# use the previous variable here

sbatch --dependency afterok:${run1_id} task2.sh

You can write a long linear pipeline like this in a shell script and run as: �bash submit_pipeline.sh

(note: this is not submitted to SLURM → the script is doing the submission for us instead)

bioinfotraining.bio.cam.ac.uk

47 of 67

Types of Job Dependencies

47

afterok:jobid[:jobid...]	job can begin after the specified jobs �have completed with an exit code of zero
afternotok:jobid[:jobid...]	job can begin after the specified jobs �have failed
singleton	jobs can begin execution after �all previous jobs with the same name and user have ended. �This is useful to collate results of a swarm or to send a notification at the end of a swarm.
after:jobid[:jobid...]	job can begin after the specified jobs have started
afterany:jobid[:jobid...]	job can begin after the specified jobs have terminated

bioinfotraining.bio.cam.ac.uk

48 of 67

Job Dependencies: “afternotok”

Alternative rescue pathway in a pipeline
Checkpoint and restart

Enable jobs to run longer than time limit
Improve jobs’ throughput by exploiting the holes in the SLURM schedule
Debug long-running jobs by pausing just before the error & restarting from that point multiple times

Time limit is an everyday problem at the Cambridge University HPC �(max 12h/36h on SL3/SL2)

48

bioinfotraining.bio.cam.ac.uk

49 of 67

Job Dependencies: “afternotok”

49

Demo in hpc_workshop/dependency/notok

Counts to 10 increasing by 1 every 15 secs
At each step it saves the current number in a file checkpoint.txt
If checkpoint.txt exists the task resumes from that point
We will submit our job with 1 minute only → meaning we need 3 reruns to complete

long_task_result.txt

checkpoint.txt

task_with_checkpoints.sh

every 15s

bioinfotraining.bio.cam.ac.uk

50 of 67

Job Dependencies: “singleton”

Useful for tasks with multiple dependencies

50

result_task1.txt

task1.sh

result_task2.txt

task2.sh

Demo in hpc_workshop/dependency/singleton

Task1 creates a file
Task2 creates another file
Task3 needs both to produce a third output

result_task3.txt

task3.sh

bioinfotraining.bio.cam.ac.uk

51 of 67

Job Dependencies: Complex Pipelines

51

bioinfotraining.bio.cam.ac.uk

52 of 67

Job Dependencies: Complex Pipelines

52

For complex pipelines using dedicated workflow-management software such as Snakemake or Nextflow may be more suitable

See: nf-co.re/pipelines

bioinfotraining.bio.cam.ac.uk

53 of 67

Moving Files

53

54 of 67

Moving Files to Remote Servers

54

bioinfotraining.bio.cam.ac.uk

55 of 67

Moving Files: FileZilla

55

Pros:

Easy to use (GUI-based)
Data synchronization
Can pause/restart transfer if needed

Cons:

Not suitable if you want to transfer data between two remote servers (e.g. two HPC servers)

bioinfotraining.bio.cam.ac.uk

56 of 67

Moving Files: scp

$ cd ~/Documents

$ scp -r awesome_proj/data rob123@train.bio:scratch/awesome_proj

56

copy directories recursively �(like cp)

The source directory/file I want to copy �(relative to my current dir)

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to transfer the data folder

The destination I want to �copy it into �(relative to /home/rob123)

The credentials to access the HPC (same as with ssh)

A separator

bioinfotraining.bio.cam.ac.uk

57 of 67

Moving Files: scp

$ cd ~/Documents

$ scp -r rob123@train.bio:scratch/awesome_proj/results awesome_proj

57

copy directories recursively �(like cp)

The source directory/file �I want to copy�(relative to /home/rob123)

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to transfer the results folder

The destination I want to copy it into �(relative to my current dir)

The credentials to access the HPC (same as with ssh)

A separator

bioinfotraining.bio.cam.ac.uk

58 of 67

Moving Files: scp

58

Pros:

Works from the command line
Works similarly to standard cp command
Can be used to move data between two remote servers:

First login to one of the servers with ssh
Then use scp from that server to the other (remote) server

Cons:

Always copies all the files and overwrites them if they already exist (no sync ability)

bioinfotraining.bio.cam.ac.uk

59 of 67

Moving Files: rsync

$ cd ~/Documents

$ rsync -avhu awesome_proj/data/ rob123@train.bio:scratch/awesome_proj/data/

59

Only transfer new files�(more explanation in following slides)

The source directory/file I want to synch�(relative to my current dir)

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to synch the data folder

The destination I want to �synch it into �(relative to /home/rob123)

The credentials to access the HPC (same as with ssh)

A separator

bioinfotraining.bio.cam.ac.uk

60 of 67

Moving Files: rsync

$ cd ~/Documents

$ rsync -avhu awesome_proj/data/ rob123@train.bio:scratch/awesome_proj/data/

60

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to synch the data folder

⚠ ⚠ ⚠

The / in the source path is important!

If you include / at the end → transfer the contents inside the folder to the destination
If you don’t → transfer the actual entire folder to the destination

bioinfotraining.bio.cam.ac.uk

61 of 67

Moving Files: rsync

rsync -avhu awesome_proj/data rob123@train.bio:scratch/awesome_proj/data/

61

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

This would happen if you didn’t include the / in source path

bioinfotraining.bio.cam.ac.uk

62 of 67

Moving Files: rsync

$ cd ~/Documents

$ rsync -avhu rob123@train.bio:scratch/awesome_proj/results awesome_proj/

62

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to transfer the results folder

Only transfer new files�(more explanation in following slides)

The source directory/file I want to copy �(relative to /home/rob123)

The destination I want to �copy it into �(relative to my current dir)

The credentials to access the HPC (same as with ssh)

A separator

bioinfotraining.bio.cam.ac.uk

63 of 67

Moving Files: rsync

$ cd ~/Documents

$ rsync -avhu rob123@train.bio:scratch/awesome_proj/results awesome_proj/

63

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

|_ results

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

After the transfer

⚠ ⚠ ⚠

The / at the end of the source path is important!

By excluding the / we are saying: “transfer the entire results folder to awesome_proj”

bioinfotraining.bio.cam.ac.uk

64 of 67

Moving Files: rsync

rsync -avhu rob123@train.bio:scratch/awesome_proj/results/ awesome_proj/

64

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

|_ file1.csv

|_ file2.txt

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

|_ file1.csv

|_ file2.txt

This would happen if you included �the /in the source path

With / we are saying: “transfer the contents of results to awesome_proj”

bioinfotraining.bio.cam.ac.uk

65 of 67

Moving Files: rsync

65

Options:

-a → only transfer files that have changed and preserve their timestamps and other properties
-u → in addition, only transfer files if they are newer compared to the destination (avoids overwriting newer files with older versions by accident)
-h → print file sizes in human format like Mb, Gb, etc.
-v → verbose mode, print some information about what was transferred
--progress → show the progress of file transfer (useful when transferring larger files
-z → compress the data in transit (can save some bandwidth)

bioinfotraining.bio.cam.ac.uk

66 of 67

Moving Files: rsync

66

💡Tip:

--dry-run option → shows you what files/folders rsync will transfer with your command, but not actually do the transfer.

Great way to make sure you specified your paths and options correctly!

bioinfotraining.bio.cam.ac.uk

67 of 67

Moving Files: rsync

67

Pros:

Works from the command line
Can be used to move data between two remote servers
Much more flexible than scp → can synchronize files saving time and bandwidth
The option --dry-run allows testing what the command would do before actually running

Cons:

Many more options can make it a steeper learning curve
The subtle / at the end of the source path can cause confusion

bioinfotraining.bio.cam.ac.uk