1 of 21

The Cambridge University HPC Service

Authors: Lajos Kalmar �(MRC Toxicology)

2 of 21

Types of nodes

2

Cambridge Service for Data-Driven Discovery (CSD3) consists of:

  • CPU cluster → known as Cumulus with 3 partitions:
    • Cascade Lake – 56 CPUs, 192 GiB RAM (384 GiB for high-memory nodes)
    • Ice Lake – 76 CPUs, 256 GiB RAM (512 GiB for high-memory nodes)
    • Sapphire Rapids –112 CPUs, 512 GiB RAM�
  • GPU cluster → known as Wilkes3 with a single partition (-p ampere):
    • 4x NVIDIA A100-SXM-80GB GPUs​

bioinfotraining.bio.cam.ac.uk

3 of 21

Documentation links

3

bioinfotraining.bio.cam.ac.uk

4 of 21

Register for an account

4

bioinfotraining.bio.cam.ac.uk

5 of 21

Register for an account

5

Your PI will receive an email to approve your registration (make sure you talk to them beforehand)

bioinfotraining.bio.cam.ac.uk

6 of 21

Register for an account

6

bioinfotraining.bio.cam.ac.uk

7 of 21

Service levels

7

SL3

SL2

Cost

Free: each PI has 200,000 CPU hours and 3000 GPU hours per quarter

Paid per unit hour

(CPU: £0.01; GPU: £0.550)

Priority

Lower priority

Higher priority

Time limit

12h

36h

More about policies in the docs.

You can check your accounts and credits with mybalance command

bioinfotraining.bio.cam.ac.uk

8 of 21

Overview of CPU cluster

8

bioinfotraining.bio.cam.ac.uk

9 of 21

Charging model - SL2 (paid service)

9

  • CPU nodes → currency is CPU hour (£0.01)
    • Job runs for 2h with 1 CPUs = 2 CPU hours = £0.02
    • Job runs for 2h with 5 CPUs = 10 CPU hours = £0.10
    • Job runs for 36h with 76 CPUs = 2736 CPU hours = £27.36

  • GPU nodes → currency is GPU hour (£0.55)
    • Job runs for 2h with 1 GPU = 2 GPU hours = £1.1
    • Job runs for 36h with 4 GPUs = 144 GPU hours = £79.2

bioinfotraining.bio.cam.ac.uk

10 of 21

CPU partitions (icelake, cclake, sapphire)

10

Memory is tied to CPU

  • E.g. on icelake we have ~3.3G per CPU
    • User requests -c 3 they will get ~9.9G memory
    • User requests --mem 15G they will get 5 CPUs
  • Therefore for memory-intensive tasks it’s more cost effective to use the himem partitions

bioinfotraining.bio.cam.ac.uk

11 of 21

GPU partitions (ampere)

#SBATCH -A PROJECT-SL3-GPU

#SBATCH -p ampere

#SBATCH --nodes=1

#SBATCH --gres=gpu:3

11

  • Each node has 4 GPUs (you get 32 CPUs per GPU)
  • Resource request:

This would request 3 (of the 4 max) GPUs in the node.

You need to specifically have a GPU account active (check with mybalance; if not, request from support@hpc.cam.ac.uk)

bioinfotraining.bio.cam.ac.uk

12 of 21

Cambridge HPC filesystem

  • Home directory has very limited space (50GB)
  • ~/rds/hpc-work/ is the main working folder with 1TB storage limit (not backed up)
  • ~/rds/hpc-work/ folder is not shareable
    • If you work in a team, the PI needs to buy additional RDS (research data storage) space → https://selfservice.uis.cam.ac.uk/
  • Because of MFA Filezilla needs to be configured as detailed here

12

bioinfotraining.bio.cam.ac.uk

13 of 21

Module system for applications

  • List all available modules

module avail

module avail |& grep –i “appname”

  • Load in modules and test them on the login node

Example: module load gromacs/2021/cclake

  • Remember to load modules in your sbatch script
  • Use modules instead of installing them to your own storage space
  • You can ask HPC support to install new modules

13

bioinfotraining.bio.cam.ac.uk

14 of 21

Additional Advice

  • Create some universal sbatch request sets
    • E.g. cclake-1h-4cpu
  • Use modules instead of downloading / installing software
    • ”module avail” command for checking availability
  • Pass argument to the sbatch request set
    • SampleID, input files
  • Redirect standard output and error to specific files
    • #SBATCH -o and #SBATCH -e
  • Build dependency chains and do checkpointing if the 12/36h time limit is not enough
  • Interactive jobs → documentation!

14

bioinfotraining.bio.cam.ac.uk

15 of 21

Additional Advice

  • Check your storage status
    • quota
  • Check your account information (how many CPU hours you have)
    • mybalance”
  • Environments / Containers
    • No Docker allowed, you can use Singularity instead
    • Conda/Mamba is go solution for local package management
  • Don’t abuse login nodes → you may approach others if they are doing it
  • Report problems to the support@hpc.cam.ac.uk

15

bioinfotraining.bio.cam.ac.uk

16 of 21

Use login nodes ethically

16

Email Aug 2022 from HPC Service Desk:

bioinfotraining.bio.cam.ac.uk

17 of 21

Do not use `/home` for intensive jobs

17

Email Apr 2023 from HPC Service Desk:

The problems which began around midday [frozen home directories to many users] appear to have been caused by a user submitting I/O intensive python jobs reading from /home on multiple cpus on around 100 nodes.

Please can I remind everyone that /home is not designed for intense concurrent I/O from many nodes, and that reading python code is still I/O.

Also the consequences of overloading the NFS server behind /home and /usr/local are global and amount to a denial of service.

Currently nodes which are still alive are responding normally again, however there are a great many nodes (including some login nodes) which have not yet recovered. These will need to be rebooted.

Until the residual problem nodes have been cleared out, Slurm will not start new jobs.

bioinfotraining.bio.cam.ac.uk

18 of 21

Don’t use unprotected credentials

18

bioinfotraining.bio.cam.ac.uk

19 of 21

Don’t use unprotected credentials

19

bioinfotraining.bio.cam.ac.uk

20 of 21

Tips

20

  • For quick testing you can get highest-priority (even on SL3) by adding: � #SBATCH --qos=intr�However:
    • These jobs are limited to 1h
    • You can only have 1 of them at any given time

  • Create aliases on your ~/.bashrc for commonly used commands (see here if you don’t know what they are). �For example: �alias nodestat='sinfo -o "%15P %.5a %.6D %.6t"'alias myqueue='squeue -u $USER'

bioinfotraining.bio.cam.ac.uk

21 of 21

Use the HPC…

  • Ethically
    • Do not use the login node for production runs
  • Smartly
    • Optimise your jobs for CPU and time usage
    • Create universal and software specific submission scripts (but never sample specific)
    • Reduce the number of CPU cores if it doesn’t have a very significant effect to go in production earlier
    • Check production node usage (sinfo -o "%15P %.5a %.6D %.6t")
  • Efficiently
    • Use SL3 if the job is not urgent
    • Run multiple samples in parallel
    • Build up dependency chains

21

bioinfotraining.bio.cam.ac.uk