1 of 67

Introduction to High Performance Computing

2 of 67

“High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.”

2

bioinfotraining.bio.cam.ac.uk

3 of 67

HPC vs. local computer

  • Shared resources
  • Administrative rights
  • Parallel computing
  • Specific nodes for different resource needs
  • Reliability, redundancy, maintenance, safety

  • Cost of running
  • Impact on the environment

3

bioinfotraining.bio.cam.ac.uk

4 of 67

Parallel computing

  • Accelerate the run by using multiple CPU cores for calculation
    • Desktop, in-house server, HPC, Smartphone, …
  • Process multiple samples in parallel
    • Multiple desktops or small servers, HPC, distributed computing (SETI@home, Folding@home)
  • Perform a single run on multiple computers (nodes)
    • HPC

4

bioinfotraining.bio.cam.ac.uk

5 of 67

Topology of the HPC

5

Login

nodes

Queue manager /

Job Scheduler

Production/Compute nodes

bioinfotraining.bio.cam.ac.uk

6 of 67

Topology of the HPC

6

Login

nodes

Queue manager /

Job Scheduler

Production/Compute nodes

bioinfotraining.bio.cam.ac.uk

7 of 67

Topology of the HPC

7

Shared Filesystem

Login

nodes

Queue manager /

Job Scheduler

Production/Compute nodes

bioinfotraining.bio.cam.ac.uk

8 of 67

Topology of the HPC

8

Vs.

Shared Filesystem

Login

nodes

Queue manager /

Job Scheduler

Production/Compute nodes

bioinfotraining.bio.cam.ac.uk

9 of 67

Login nodes

9

  • SSH login, no GUI
  • SFTP / SCP file transfer�
  • Running your scripts on a login node is very tempting, but �you must resist!
  • Learn the way of direct file transfer solutions �(wget, rsync, specific API, …)�
  • Use commands ethically!

bioinfotraining.bio.cam.ac.uk

10 of 67

Queue manager / Job Scheduler

10

bioinfotraining.bio.cam.ac.uk

11 of 67

Queue manager / Job Scheduler

11

Single computer environment:

$ bowtie2 –x ref_index -1 reads_1.fastq -2 reads_2.fastq

HPC:

$ sbatch RESOURCE_REQUEST mapping.sh

$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST

2048 highmem mapping ht123 R 0:02 1 highmem-node01

Spend time thinking about your resource requests

    • Reduce time and number of CPU cores if you can (backfilling)
    • Monitor production node usage
    • Time and CPU core restrictions

bioinfotraining.bio.cam.ac.uk

12 of 67

Work together with the Job Scheduler

12

Pipelines from the beginning to the end in one job

Pipelines separated to stages and submitted as separate jobs

bioinfotraining.bio.cam.ac.uk

13 of 67

Work together with the Job Scheduler

13

Pipelines from the beginning to the end in one job

Pipelines separated to stages and submitted as separate jobs

bioinfotraining.bio.cam.ac.uk

14 of 67

14

Production/Compute nodes

  • General computing nodes

  • High memory nodes

  • High CPU count nodes

  • GPU nodes

Different types of nodes accessible via �partitions or queues

bioinfotraining.bio.cam.ac.uk

15 of 67

Often separate “home” and “scratch” storage.

15

/home

  • Small
  • Backed-up
  • Use for software and general scripts

/scratch

  • Large (1TB +)
  • Not backed-up
  • Use as “working directory” for processing data
  • Make sure to regularly backup code from here

Filesystem

bioinfotraining.bio.cam.ac.uk

16 of 67

Use the HPC…

  • Ethically
    • Do not use the login node for production runs
  • Smartly
    • Optimise your jobs for CPU and time usage
    • Create universal and software-specific submission scripts (but never sample specific)
    • Reduce the number of CPU cores if it doesn’t have a very significant effect to go in production earlier
    • Check production node usage
  • Efficiently
    • Use free account (SL3) if the job is not urgent
    • Run multiple samples in parallel
    • Build up dependency chains

16

bioinfotraining.bio.cam.ac.uk

17 of 67

Exercise and Quiz

17

  • Exercise → think about what your answers for these questions would be
  • Then we do quiz together: https://www.menti.com/mcv5b2m1mi

18 of 67

Job submission with SLURM

18

19 of 67

SLURM Job Scheduler

SLURM is the software that manages the job queue:

  • It decides when each jobs runs, depending on the availability of compute nodes and the resources requested for that job (CPUs and memory).
  • Your job is given an initial “queue position” that depends on:
    • Your account priority level (e.g. paid vs free tier)
    • Resources requested: the more CPUs/memory/time you ask for, the further back in the queue you start
  • Your “queue position” improves the longer you are in the queue

19

bioinfotraining.bio.cam.ac.uk

20 of 67

SLURM: Key Commands

  • sbatch → submit a shell script to the queue
  • squeue → see all the jobs in the queue
  • squeue -u USERNAME → see your jobs only
  • scancel JOBID → cancel the job with the specified ID
  • scancel -u USERNAME → cancel all your jobs
  • seff JOBID → get efficiency information about your job

20

Let’s do a demo of all these!

bioinfotraining.bio.cam.ac.uk

21 of 67

SLURM: Summary

  • Write your commands in a shell script
  • Configure your job options using #SBATCH directives
  • Use $SLURM_* environment variables to further customise your commands.
  • Test your code by requesting a terminal on a compute node using sintr (interactive jobs)
    • all the options available with sbatch are also available to sintr �(e.g. -c, --mem, -t, etc.)
    • Often sintr jobs are limited in maximum time they can run for (e.g. 1h at Cambridge), because this is not a very efficient way to run large-scale analyses.

21

bioinfotraining.bio.cam.ac.uk

22 of 67

Connecting and working on a HPC

22

23 of 67

Working on Remote Servers: Overview

23

bioinfotraining.bio.cam.ac.uk

24 of 67

Connecting to Remote Server: ssh

ssh hm533@login.hpc.cam.ac.uk

24

The ssh (secure shell) command is used to connect to a remote host

You will be asked for your password (and possibly two-factor authentication code, if setup in your HPC):

As you type the password, nothing shows up! This is a safety mechanism, but your password is being captured as you type.

💡On Windows/Linux use Ctrl + Shift + V to paste. �On macOS the usual ⌘ + V works.

Your terminal should change to indicate you are on the remote server:

bioinfotraining.bio.cam.ac.uk

25 of 67

Editing Files: nano , VS Code, vim

25

nano

  • Command-line text editor available on most Linux distributions
  • Ctrl + X to exit nano
    • It will ask if you want to save the file → type Y (yes)
    • It will ask to confirm the file name → press Enter

VS Code

  • GUI-based software
  • Has the capability to connect to remote servers using the ssh protocol
    • Instructions in the course materials

vim

  • Command-line text editor available on most Linux distributions
  • Very advanced (but steep learning curve)
  • If you know what this is, you don’t need us to tell you about text editors

bioinfotraining.bio.cam.ac.uk

26 of 67

Managing Software

26

27 of 67

Managing Software

27

Solution 1: pre-installed software

  • Available through the Modules package �(if the HPC admins set this up)

Solution 2: install it yourself locally �(remember you don’t have admin permissions)

  • Compile software from source → can be more challenging (we’re not covering it here)
  • Use a package manager → Mamba (previously Conda)

bioinfotraining.bio.cam.ac.uk

28 of 67

Managing Software: Modules

28

  • List all the software packages available:�module avail

  • Search for a particular package:

module avail 2>&1 | grep -i bowtie

bowtie2-2.3.1

bowtie2-2.3.5

  • Make the software available in my environment: �module load bowtie2-2.3.1�module load samtools-1.9

The modules package adds software to the user’s $PATH

Remember to include the `module load` command in your SLURM submission script for each package you want to use

bioinfotraining.bio.cam.ac.uk

29 of 67

Managing Software: Modules

29

Pros:

  • Ready and easy to use - no setup time for the user
  • No need to worry about installation dependencies, etc - HPC admins take care of this for you (even if the software is kind of picky about dependencies).
  • Software is compiled in an optimal way for the specific hardware of the HPC (may often increase speed and efficiency of the software)

Cons:

  • If a software/version is not available, need to request it from the HPC admins, which may take some time.

bioinfotraining.bio.cam.ac.uk

30 of 67

Managing Software: Mamba/Conda

30

Python 3.7

NumPy 1.15

scikit-learn 0.20

Python 3.12

NumPy 1.26

TensorFlow 2.0

Environment 1

Environment 2

Mamba environments

Python 2.7

R 3.1

System �environment

bioinfotraining.bio.cam.ac.uk

31 of 67

Managing Software: Conda

31

  • Check which packages are available to �install via Mamba: anaconda.org
  • Create an environment:�mamba create -n datasci
  • Install the package(s):�mamba install -n datasci -c conda-forge numpy=1.26.4

mamba install -n datasci -c conda-forge matplotlib=3.8.3

  • To make all packages in the environment available:mamba activate datasci

bioinfotraining.bio.cam.ac.uk

32 of 67

Managing Software: Conda

32

Warning!

When submitting jobs to SLURM on the HPC you need to include the following lines of code:

# Always add these two commands to your scripts

eval "$(conda shell.bash hook)"

source $CONDA_PREFIX/etc/profile.d/mamba.sh

# then you can activate the environment

mamba activate datasci

bioinfotraining.bio.cam.ac.uk

33 of 67

Managing Software: Conda

33

Our Mamba installation instructions include a step to automatically search the two most popular channels/repositories:

  • conda-forge for scientific computing software
  • bioconda for bioinformatics software

Therefore, you can leave the channel out of the install command. These are all equivalent:

mamba install -n datasci -c conda-forge numpy=1.26.4

mamba install -n datasci conda-forge::numpy=1.26.4

mamba install -n datasci numpy=1.26.4

💡

bioinfotraining.bio.cam.ac.uk

34 of 67

Managing Software: Conda

34

Pros:

  • Easier than compiling software yourself and automatically installs dependencies
  • Installs software locally (by default in /home directory)
  • Encapsulate different software versions in environments
  • Can recreate an environment in another computer easily (see Conda docs)

Cons:

  • Packages are not optimally compiled for the hardware in use
  • For complex environments dependency conflicts can be hard (or impossible) to resolve
  • Not every software is available through Mamba/Conda
  • Can take a lot of space (tip: run mamba clean to remove unused and cached packages)

bioinfotraining.bio.cam.ac.uk

35 of 67

Job Arrays

35

36 of 67

Parallel Jobs: Speed Processing Time

  • Cores, threads and processors are often used synonymously, usually they all indicate a CPU worker �(although technically they are different things)�
  • Imagine that processing one input file takes 5 hours�
  • 100 files takes 500 hours ( ~ 21 days)

36

Input

(e.g. FASTQ file)

CPU

Output

(e.g. BAM file)

bioinfotraining.bio.cam.ac.uk

37 of 67

Parallel Jobs: Speed Processing Time

  • Some software tools support multi-threading to speed things up (although speed gains aren’t usually linear)
  • Imagine that using 10 CPUs speeds things from 5 hours to 1 hour�
  • 100 files still takes 100 hours ( ~ 4 days) if we run them serially

37

Input

(e.g. FASTQ file)

Multiple

CPUs

Output

(e.g. BAM file)

Multi-threading

bioinfotraining.bio.cam.ac.uk

38 of 67

Parallel Jobs in SLURM: For Loops

for eachFile in filesList

do

sbatch jobScript.sh eachFile

done

  • Job submission using loops is not efficient on Slurm Workload Manager
  • Not recommended

38

bioinfotraining.bio.cam.ac.uk

39 of 67

Parallel Jobs in SLURM: Job Arrays

39

#!/bin/bash

#SBATCH -D /scratch/participant/hpc_workshop

#SBATCH -o logs/parallel_arrays_%a.log

#SBATCH -c 2

#SBATCH --mem=1G

#SBATCH -a 1-3

echo "This is task number $SLURM_ARRAY_TASK_ID"

echo "Using $SLURM_CPUS_PER_TASK CPUs cores"

echo "Running on:"

hostname

SLURM_ARRAY_TASK_ID = 1

SLURM_ARRAY_TASK_ID = 2

SLURM_ARRAY_TASK_ID = 3

logs/parallel_arrays_3.log

This is task number 3

Using 2 cores

Running on: node-10

logs/parallel_arrays_1.log

This is task number 1

Using 2 cores

Running on: node-15

logs/parallel_arrays_2.log

This is task number 2

Using 2 cores

Running on: node-8

bioinfotraining.bio.cam.ac.uk

40 of 67

Advantages and Limitations of Job Arrays

  • Advantages
    • Job submission is quite fast: ~30,000 jobs/ 1-2 milliseconds
    • Faster than using “for loops”
    • Job management is easy for us and to SLURM
    • Job array can be handled as a whole
    • Individual jobs in an array can be handled independently

  • Limitations
    • Each job in the array will request the same resources, like CPUs, memory, time etc.

40

bioinfotraining.bio.cam.ac.uk

41 of 67

Job Arrays: $SLURM_ARRAY_TASK_ID Tricks

41

For stochastic simulations can use it to set a seed (for reproducibility)

#SBATCH -a 1-100

python my_simulation_script.py $SLURM_ARRAY_TASK_ID

Write your script so that it takes a number as input, which is then used to set a seed for a random number generator

(You may not always want to do this - if you’re running many simulations in your work, maybe you shouldn’t use the same set of seeds all the time)

bioinfotraining.bio.cam.ac.uk

42 of 67

Job Arrays: $SLURM_ARRAY_TASK_ID Tricks

42

Different inputs

  • Prepare an input sample sheet (e.g. CSV format)
  • Use some Unix command line skills

#SBATCH -a 2-3

SAMPLE=$(head -n $SLURM_ARRAY_TASK_ID samplesheet.csv | tail -n 1 | cut -d "," -f 1)

INPUT=$(head -n $SLURM_ARRAY_TASK_ID samplesheet.csv | tail -n 1 | cut -d "," -f 2)

command --input ${INPUT} --output results/${SAMPLE}.out

sample,input

patient1,data/XYZ10231.fq

patient2,data/XYZ19381.fq

Do Exercise 2

Green sticky → finished

Red sticky → help!

bioinfotraining.bio.cam.ac.uk

43 of 67

Job Dependencies

43

44 of 67

Job Dependencies: Syntax

44

$ sbatch job1.sh�Submitted batch job 349

�$ sbatch --dependency=afterok:349 job2.sh

Submitted batch job 350

$ sbatch --dependency=afterok:350 job3.sh

Submitted batch job 351

Linear pipeline where each script depends on the output of a previous script:

job1.sh → job2.sh → job3.sh

bioinfotraining.bio.cam.ac.uk

45 of 67

Job Dependencies: “afterok”

Useful for pipelines with a “linear” chain of job dependencies

45

result_task1.txt

task1.sh

result_task2.txt

task2.sh

Demo in hpc_workshop/dependency/ok

  • Task1 creates a file
  • Task2 takes this file to produce the second file
  • Task2 should only run if task1 is successful

bioinfotraining.bio.cam.ac.uk

46 of 67

Job Dependencies: Capturing Job ID

46

# first task of our pipeline

# capture JOBID into a variable

run1_id=$(sbatch --parsable task1.sh)

# second task of our pipeline

# use the previous variable here

sbatch --dependency afterok:${run1_id} task2.sh

You can write a long linear pipeline like this in a shell script and run as: �bash submit_pipeline.sh

(note: this is not submitted to SLURM → the script is doing the submission for us instead)

bioinfotraining.bio.cam.ac.uk

47 of 67

Types of Job Dependencies

47

afterok:jobid[:jobid...]

job can begin after the specified jobs �have completed with an exit code of zero

afternotok:jobid[:jobid...]

job can begin after the specified jobs �have failed

singleton

jobs can begin execution after �all previous jobs with the same name and user have ended. �This is useful to collate results of a swarm or to send a notification at the end of a swarm.

after:jobid[:jobid...]

job can begin after the specified jobs have started

afterany:jobid[:jobid...]

job can begin after the specified jobs have terminated

bioinfotraining.bio.cam.ac.uk

48 of 67

Job Dependencies: “afternotok”

  • Alternative rescue pathway in a pipeline
  • Checkpoint and restart
    • Enable jobs to run longer than time limit
    • Improve jobs’ throughput by exploiting the holes in the SLURM schedule
    • Debug long-running jobs by pausing just before the error & restarting from that point multiple times

  • Time limit is an everyday problem at the Cambridge University HPC �(max 12h/36h on SL3/SL2)

48

bioinfotraining.bio.cam.ac.uk

49 of 67

Job Dependencies: “afternotok”

49

Demo in hpc_workshop/dependency/notok

  • Counts to 10 increasing by 1 every 15 secs
  • At each step it saves the current number in a file checkpoint.txt
  • If checkpoint.txt exists the task resumes from that point
  • We will submit our job with 1 minute only → meaning we need 3 reruns to complete

long_task_result.txt

checkpoint.txt

task_with_checkpoints.sh

every 15s

bioinfotraining.bio.cam.ac.uk

50 of 67

Job Dependencies: “singleton”

Useful for tasks with multiple dependencies

50

result_task1.txt

task1.sh

result_task2.txt

task2.sh

Demo in hpc_workshop/dependency/singleton

  • Task1 creates a file
  • Task2 creates another file
  • Task3 needs both to produce a third output

result_task3.txt

task3.sh

bioinfotraining.bio.cam.ac.uk

51 of 67

Job Dependencies: Complex Pipelines

51

bioinfotraining.bio.cam.ac.uk

52 of 67

Job Dependencies: Complex Pipelines

52

For complex pipelines using dedicated workflow-management software such as Snakemake or Nextflow may be more suitable

bioinfotraining.bio.cam.ac.uk

53 of 67

Moving Files

53

54 of 67

Moving Files to Remote Servers

54

bioinfotraining.bio.cam.ac.uk

55 of 67

Moving Files: FileZilla

55

Pros:

  • Easy to use (GUI-based)
  • Data synchronization
  • Can pause/restart transfer if needed

Cons:

  • Not suitable if you want to transfer data between two remote servers (e.g. two HPC servers)

bioinfotraining.bio.cam.ac.uk

56 of 67

Moving Files: scp

$ cd ~/Documents

$ scp -r awesome_proj/data rob123@train.bio:scratch/awesome_proj

56

copy directories recursively �(like cp)

The source directory/file I want to copy �(relative to my current dir)

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to transfer the data folder

The destination I want to �copy it into �(relative to /home/rob123)

The credentials to access the HPC (same as with ssh)

A separator

bioinfotraining.bio.cam.ac.uk

57 of 67

Moving Files: scp

$ cd ~/Documents

$ scp -r rob123@train.bio:scratch/awesome_proj/results awesome_proj

57

copy directories recursively �(like cp)

The source directory/file �I want to copy�(relative to /home/rob123)

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to transfer the results folder

The destination I want to copy it into �(relative to my current dir)

The credentials to access the HPC (same as with ssh)

A separator

bioinfotraining.bio.cam.ac.uk

58 of 67

Moving Files: scp

58

Pros:

  • Works from the command line
  • Works similarly to standard cp command
  • Can be used to move data between two remote servers:
    • First login to one of the servers with ssh
    • Then use scp from that server to the other (remote) server

Cons:

  • Always copies all the files and overwrites them if they already exist (no sync ability)

bioinfotraining.bio.cam.ac.uk

59 of 67

Moving Files: rsync

$ cd ~/Documents

$ rsync -avhu awesome_proj/data/ rob123@train.bio:scratch/awesome_proj/data/

59

Only transfer new files�(more explanation in following slides)

The source directory/file I want to synch�(relative to my current dir)

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to synch the data folder

The destination I want to �synch it into �(relative to /home/rob123)

The credentials to access the HPC (same as with ssh)

A separator

bioinfotraining.bio.cam.ac.uk

60 of 67

Moving Files: rsync

$ cd ~/Documents

$ rsync -avhu awesome_proj/data/ rob123@train.bio:scratch/awesome_proj/data/

60

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to synch the data folder

The / in the source path is important!

  • If you include / at the end → transfer the contents inside the folder to the destination
  • If you don’t → transfer the actual entire folder to the destination

bioinfotraining.bio.cam.ac.uk

61 of 67

Moving Files: rsync

rsync -avhu awesome_proj/data rob123@train.bio:scratch/awesome_proj/data/

61

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ data

|_ results

This would happen if you didn’t include the / in source path

bioinfotraining.bio.cam.ac.uk

62 of 67

Moving Files: rsync

$ cd ~/Documents

$ rsync -avhu rob123@train.bio:scratch/awesome_proj/results awesome_proj/

62

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

I want to transfer the results folder

Only transfer new files�(more explanation in following slides)

The source directory/file I want to copy �(relative to /home/rob123)

The destination I want to �copy it into �(relative to my current dir)

The credentials to access the HPC (same as with ssh)

A separator

bioinfotraining.bio.cam.ac.uk

63 of 67

Moving Files: rsync

$ cd ~/Documents

$ rsync -avhu rob123@train.bio:scratch/awesome_proj/results awesome_proj/

63

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

|_ results

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

After the transfer

The / at the end of the source path is important!

By excluding the / we are saying: “transfer the entire results folder to awesome_proj

bioinfotraining.bio.cam.ac.uk

64 of 67

Moving Files: rsync

rsync -avhu rob123@train.bio:scratch/awesome_proj/results/ awesome_proj/

64

My computer (e.g. a macOS):

/Users

|_robin

|_Documents

|_awesome_proj

|_ data

|_ file1.csv

|_ file2.txt

The HPC filesystem:

/home

|_rob123

|_scratch

|_ awesome_proj

|_ data

|_ results

|_ file1.csv

|_ file2.txt

This would happen if you included �the /in the source path

With / we are saying: “transfer the contents of results to awesome_proj

bioinfotraining.bio.cam.ac.uk

65 of 67

Moving Files: rsync

65

Options:

  • -a → only transfer files that have changed and preserve their timestamps and other properties
  • -u → in addition, only transfer files if they are newer compared to the destination (avoids overwriting newer files with older versions by accident)
  • -h → print file sizes in human format like Mb, Gb, etc.
  • -v → verbose mode, print some information about what was transferred
  • --progress → show the progress of file transfer (useful when transferring larger files
  • -z → compress the data in transit (can save some bandwidth)

bioinfotraining.bio.cam.ac.uk

66 of 67

Moving Files: rsync

66

💡Tip:

--dry-run option → shows you what files/folders rsync will transfer with your command, but not actually do the transfer.

Great way to make sure you specified your paths and options correctly!

bioinfotraining.bio.cam.ac.uk

67 of 67

Moving Files: rsync

67

Pros:

  • Works from the command line
  • Can be used to move data between two remote servers
  • Much more flexible than scp → can synchronize files saving time and bandwidth
  • The option --dry-run allows testing what the command would do before actually running

Cons:

  • Many more options can make it a steeper learning curve
  • The subtle / at the end of the source path can cause confusion

bioinfotraining.bio.cam.ac.uk