The Cambridge University HPC Service
Authors: Lajos Kalmar �(MRC Toxicology)
Types of nodes
2
Cambridge Service for Data-Driven Discovery (CSD3) consists of:
bioinfotraining.bio.cam.ac.uk
Documentation links
3
bioinfotraining.bio.cam.ac.uk
Register for an account
4
bioinfotraining.bio.cam.ac.uk
Register for an account
5
Your PI will receive an email to approve your registration (make sure you talk to them beforehand)
bioinfotraining.bio.cam.ac.uk
Register for an account
6
bioinfotraining.bio.cam.ac.uk
Service levels
7
| SL3 | SL2 |
Cost | Free: each PI has 200,000 CPU hours and 3000 GPU hours per quarter | Paid per unit hour (CPU: £0.01; GPU: £0.550) |
Priority | Lower priority | Higher priority |
Time limit | 12h | 36h |
More about policies in the docs.
You can check your accounts and credits with mybalance command
bioinfotraining.bio.cam.ac.uk
Overview of CPU cluster
8
bioinfotraining.bio.cam.ac.uk
Charging model - SL2 (paid service)
9
bioinfotraining.bio.cam.ac.uk
CPU partitions (icelake, cclake, sapphire)
10
Memory is tied to CPU
bioinfotraining.bio.cam.ac.uk
GPU partitions (ampere)
#SBATCH -A PROJECT-SL3-GPU
#SBATCH -p ampere
#SBATCH --nodes=1
#SBATCH --gres=gpu:3
11
This would request 3 (of the 4 max) GPUs in the node.
You need to specifically have a GPU account active (check with mybalance; if not, request from support@hpc.cam.ac.uk)
bioinfotraining.bio.cam.ac.uk
Cambridge HPC filesystem
12
bioinfotraining.bio.cam.ac.uk
Module system for applications
module avail
module avail |& grep –i “appname”
Example: module load gromacs/2021/cclake
13
bioinfotraining.bio.cam.ac.uk
Additional Advice
14
bioinfotraining.bio.cam.ac.uk
Additional Advice
15
bioinfotraining.bio.cam.ac.uk
Use login nodes ethically
16
Email Aug 2022 from HPC Service Desk:
bioinfotraining.bio.cam.ac.uk
Do not use `/home` for intensive jobs
17
Email Apr 2023 from HPC Service Desk:
The problems which began around midday [frozen home directories to many users] appear to have been caused by a user submitting I/O intensive python jobs reading from /home on multiple cpus on around 100 nodes.
Please can I remind everyone that /home is not designed for intense concurrent I/O from many nodes, and that reading python code is still I/O.
Also the consequences of overloading the NFS server behind /home and /usr/local are global and amount to a denial of service.
Currently nodes which are still alive are responding normally again, however there are a great many nodes (including some login nodes) which have not yet recovered. These will need to be rebooted.
Until the residual problem nodes have been cleared out, Slurm will not start new jobs.
bioinfotraining.bio.cam.ac.uk
Don’t use unprotected credentials
18
bioinfotraining.bio.cam.ac.uk
Don’t use unprotected credentials
19
bioinfotraining.bio.cam.ac.uk
Tips
20
bioinfotraining.bio.cam.ac.uk
Use the HPC…
21
bioinfotraining.bio.cam.ac.uk