2 of 21

Types of nodes

Cambridge Service for Data-Driven Discovery (CSD3) consists of:

CPU cluster → known as Cumulus with 3 partitions:

Cascade Lake – 56 CPUs, 192 GiB RAM (384 GiB for high-memory nodes)
Ice Lake – 76 CPUs, 256 GiB RAM (512 GiB for high-memory nodes)
Sapphire Rapids –112 CPUs, 512 GiB RAM�

GPU cluster → known as Wilkes3 with a single partition (-p ampere):

4x NVIDIA A100-SXM-80GB GPUs

bioinfotraining.bio.cam.ac.uk

3 of 21

Documentation links

Main site https://www.hpc.cam.ac.uk/
HPC docs: https://docs.hpc.cam.ac.uk/
Storage docs: https://docs.hpc.cam.ac.uk/storage/

bioinfotraining.bio.cam.ac.uk

4 of 21

bioinfotraining.bio.cam.ac.uk

5 of 21

Your PI will receive an email to approve your registration (make sure you talk to them beforehand)

bioinfotraining.bio.cam.ac.uk

6 of 21

bioinfotraining.bio.cam.ac.uk

7 of 21

Service levels

	SL3	SL2
Cost	Free: each PI has 200,000 CPU hours and 3000 GPU hours per quarter	Paid per unit hour (CPU: £0.01; GPU: £0.550)
Priority	Lower priority	Higher priority
Time limit	12h	36h

More about policies in the docs.

You can check your accounts and credits with mybalance command

bioinfotraining.bio.cam.ac.uk

8 of 21

Overview of CPU cluster

bioinfotraining.bio.cam.ac.uk

9 of 21

Charging model - SL2 (paid service)

CPU nodes → currency is CPU hour (£0.01)

Job runs for 2h with 1 CPUs = 2 CPU hours = £0.02
Job runs for 2h with 5 CPUs = 10 CPU hours = £0.10
Job runs for 36h with 76 CPUs = 2736 CPU hours = £27.36

GPU nodes → currency is GPU hour (£0.55)

Job runs for 2h with 1 GPU = 2 GPU hours = £1.1
Job runs for 36h with 4 GPUs = 144 GPU hours = £79.2

bioinfotraining.bio.cam.ac.uk

10 of 21

CPU partitions (icelake, cclake, sapphire)

Memory is tied to CPU

E.g. on icelake we have ~3.3G per CPU

User requests -c 3 they will get ~9.9G memory
User requests --mem 15G they will get 5 CPUs

Therefore for memory-intensive tasks it’s more cost effective to use the himem partitions

bioinfotraining.bio.cam.ac.uk

11 of 21

GPU partitions (ampere)

#SBATCH -A PROJECT-SL3-GPU

#SBATCH -p ampere

#SBATCH --nodes=1

#SBATCH --gres=gpu:3

Each node has 4 GPUs (you get 32 CPUs per GPU)
Resource request:

This would request 3 (of the 4 max) GPUs in the node.

You need to specifically have a GPU account active (check with mybalance; if not, request from support@hpc.cam.ac.uk)

bioinfotraining.bio.cam.ac.uk

12 of 21

Cambridge HPC filesystem

Home directory has very limited space (50GB)
~/rds/hpc-work/ is the main working folder with 1TB storage limit (not backed up)
~/rds/hpc-work/ folder is not shareable

If you work in a team, the PI needs to buy additional RDS (research data storage) space → https://selfservice.uis.cam.ac.uk/

Because of MFA Filezilla needs to be configured as detailed here

bioinfotraining.bio.cam.ac.uk

13 of 21

Module system for applications

List all available modules

module avail

module avail |& grep –i “appname”

Load in modules and test them on the login node

Example: module load gromacs/2021/cclake

Remember to load modules in your sbatch script
Use modules instead of installing them to your own storage space
You can ask HPC support to install new modules

bioinfotraining.bio.cam.ac.uk

14 of 21

Additional Advice

Create some universal sbatch request sets

E.g. cclake-1h-4cpu

Use modules instead of downloading / installing software

”module avail” command for checking availability

Pass argument to the sbatch request set

SampleID, input files

Redirect standard output and error to specific files

#SBATCH -o and #SBATCH -e

Build dependency chains and do checkpointing if the 12/36h time limit is not enough
Interactive jobs → documentation!

bioinfotraining.bio.cam.ac.uk

15 of 21

Additional Advice

Check your storage status

quota

Check your account information (how many CPU hours you have)

mybalance”

Environments / Containers

No Docker allowed, you can use Singularity instead
Conda/Mamba is go solution for local package management

Don’t abuse login nodes → you may approach others if they are doing it
Report problems to the support@hpc.cam.ac.uk

bioinfotraining.bio.cam.ac.uk

16 of 21

Use login nodes ethically

Email Aug 2022 from HPC Service Desk:

bioinfotraining.bio.cam.ac.uk

17 of 21

Do not use `/home` for intensive jobs

Email Apr 2023 from HPC Service Desk:

The problems which began around midday [frozen home directories to many users] appear to have been caused by a user submitting I/O intensive python jobs reading from /home on multiple cpus on around 100 nodes.

Please can I remind everyone that /home is not designed for intense concurrent I/O from many nodes, and that reading python code is still I/O.

Also the consequences of overloading the NFS server behind /home and /usr/local are global and amount to a denial of service.

Currently nodes which are still alive are responding normally again, however there are a great many nodes (including some login nodes) which have not yet recovered. These will need to be rebooted.

Until the residual problem nodes have been cleared out, Slurm will not start new jobs.

bioinfotraining.bio.cam.ac.uk

18 of 21

Don’t use unprotected credentials

bioinfotraining.bio.cam.ac.uk

19 of 21

Don’t use unprotected credentials

bioinfotraining.bio.cam.ac.uk

20 of 21

Tips

For quick testing you can get highest-priority (even on SL3) by adding: � #SBATCH --qos=intr�However:

These jobs are limited to 1h
You can only have 1 of them at any given time

Create aliases on your ~/.bashrc for commonly used commands (see here if you don’t know what they are). �For example: �alias nodestat='sinfo -o "%15P %.5a %.6D %.6t"'�alias myqueue='squeue -u $USER'

bioinfotraining.bio.cam.ac.uk

21 of 21

Use the HPC…

… Ethically

Do not use the login node for production runs

… Smartly

Optimise your jobs for CPU and time usage
Create universal and software specific submission scripts (but never sample specific)
Reduce the number of CPU cores if it doesn’t have a very significant effect to go in production earlier
Check production node usage (sinfo -o "%15P %.5a %.6D %.6t")

… Efficiently

Use SL3 if the job is not urgent
Run multiple samples in parallel
Build up dependency chains

bioinfotraining.bio.cam.ac.uk