Introduction to HPC: basic level
1
Outline:1
2
Outline:2
3
Introduction �INCD Computing infrastructure
INCD - Infraestrutura Nacional de Computação Distribuída
5
Computing Farm:
Applications run in a compute node:
6
Compute nodes
Job submission�(pauli.ncg.ingrid.pt)
Job scheduling
Scheduler (slurm)
Storage
Computing Farm Advantages:
7
How to access the infrastructure
What is ssh & why use it
9
Generate ssh keys
This example assume a linux machine.
Generate a RSA pair of keys on your local machine with the following command and answer the prompt requests:
10
$ ssh-keygen -t rsa -b 4096 -C “your_email”
Generating public/private rsa key pair.
Enter file in which to save the key (/user/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Generate ssh keys (cont.)
This will create two files on directory “~/.ssh”: “id_rsa” and “id_rsa.pub”, the first is your private key and should be kept away from outsiders, the second is the public key and we will use it to provide you access to remote systems.
.
11
$ ls -l ~/.ssh
-rw------- 1 user group 3247 Sep 8 11:16 id_rsa
-rw-r--r-- 1 user group 752 Sep 8 11:16 id_rsa.pub
Generate ssh keys (cont.)
You can also use the algorithm Ed25519 instead of RSA, offer more or less the same security capabilities with shorter keys, the command would be:
The private key will be stored on file “id_ed25519” and the public key on file “id_ed25519.pub”.
.
12
$ ssh-keygen -t ed25519 -C “your_email”
…
$ ls -l ~/.ssh
-rw------- 1 user group 3247 Sep 8 11:16 id_ed25519
-rw-r--r-- 1 user group 752 Sep 8 11:16 id_ed25519.pub
Access to INCD in Lisbon: Cirrus-A
Access the INCD advanced computing facility at Lisbon with a ssh session:
We use a new domain name, a.incd.pt, but the old domain name, ncg.ingrid.pt, is still valid and you can also login with command:
The user interfaces cirrus.a.incd.pt are CentOS 7.9.2009 servers as the cluster worker nodes, but please note that they have different architectures and some application may not behave as expected on the user interface. Two big differences are the unavailability of infiniband network and GPU’s on the user interfaces. If you need to test an application interactively then start an interactive session; shown below.
13
$ ssh -l “username” cirrus.a.incd.pt
$ ssh -l “username” cirrus.ncg.ingrid.pt
How to use Software @INCD cluster
Available software
The INCD provides an extensive list of pre-compiled applications and tools made available through the environment modules tool, this tool enables easy management of unix shell environment.
Check the available list with following command, the list is extensively:
We provide a list of environments relevant to some of the tutorial modules to facilitate the visualization.
15
[user@cirrus01 ~]$ module avail
------------------------------------ /cvmfs/sw.el7/modules/hpc ---------------------------
DATK gcc63/ngspice/34 libs/32/jemalloc/5.3.0
FigTree/1.4.4 gcc63/openmpi/1.10.7 libs/blas/3.9.0 ….
--------------------------- /cvmfs/sw.el7/modules/tut/module_3 ----------------------
FigTree/1.4.4 Tracer/1.7.2 gcc83/MrBayes/3.2.7a
…
Load software
In the example we’ll load the gromacs application version 2021.5:
16
[user@cirrus01 ~]$ which gmx
usr/bin/which: no gmx in /usr/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
[user@cirrus01 ~]$ env | grep GMX
[user@cirrus01 ~]$ module load intel/gromacs/2021.5
[user@cirrus01 ~]$ which gmx
/cvmfs/sw.el7/ar/ix_5400/i20/gromacs/2021.5/b01/bin/gmx
[user@cirrus01 ~]$ env | grep GMX
GMXMAN=/cvmfs/sw.el7/ar/ix_5400/i20/gromacs/2021.5/b01/share/man
GMXDATA=/cvmfs/sw.el7/ar/ix_5400/i20/gromacs/2021.5/b01/share/gromacs
GMXBIN=/cvmfs/sw.el7/ar/ix_5400/i20/gromacs/2021.5/b01/bin
GMXLDLIB=/cvmfs/sw.el7/ar/ix_5400/i20/gromacs/2021.5/b01/lib64
List loaded software
We can get the loaded environments with the following command:
Note that the gromacs intel compiler dependency was loaded automatically. This could not be true on some cases, you should always check if the environment includes all needed dependencies, and if not, load the appropriate missing modules.
17
[user@cirrus01 ~]$ module list
Currently Loaded Modules:
1) intel/oneapi/2021.3 2) intel/gromacs/2021.5
Unload software
The command “unload” remove the target environment from the shell:
We can use the command “purge” to unload all loaded modules:
18
[user@cirrus01 ~]$ module list
Currently Loaded Modules:
1) intel/oneapi/2021.3 2) intel/gromacs/2021.5
[user@cirrus01 ~]$ module unload intel/gromacs/2021.5
[user@cirrus01 ~]$ module list
[user@cirrus01 ~]$ module list
Currently Loaded Modules:
1) intel/gromacs/2021.5 3) intel/openmpi/4.0.3 5) intel/openfoam/2112
2) intel/oneapi/2022.1 4) intel/hdf5/1.12.0
[user@cirrus01 ~]$ module purge
[user@cirrus01 ~]$ module list
How to submit a simple CPU job
How to submit a simple CPU job
We will start with a simple “hello” test job using one “core”.
20
[user@cirrus01 ~]$ cp -r /data/tutorial/modulo0/hello .
[user@cirrus01 ~]$ cd hello
[user@cirrus01 hello]$ ls -l
-rw-r-----+ 1 user group 322 Sep 8 19:09 hello.sh
[user@cirrus01 hello]$ cat hello.sh
#!/bin/bash
#SBATCH -p short
#SBATCH --tasks-per-node=1
#SBATCH --nodes=1
echo "Hello, ready to business"
How to submit a simple CPU job (cont.)
21
[user@cirrus01 hello]$ sbatch hello.sh
Submitted batch job 6956063
[user@cirrus01 hello]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS TRES_PER_NODE NODELIST
6956063 short hello.sh user PD 0:00 1 1 N/A
JOBID PARTITION NAME USER ST TIME NODES CPUS TRES_PER_NODE NODELIST
6956063 short hello.sh user R 0:00 1 1 N/A hpc060
How to submit a simple CPU job (cont.)
*
22
[user@cirrus01 hello]$ ls -l
-rw-r-----+ 1 user group 322 Sep 8 19:09 hello.sh
-rw-r-----+ 1 user group 922 Sep 8 19:21 slurm-6956063.out
[user@cirrus01 hello]$ cat slurm-6956063.out
* JOB_NAME : hello.sh
* JOB_ID : 6956063 …
Hello, ready to business
How to submit a GROMACS CPU job
How to submit a GROMACS CPU job
We will run a protein analyze using one MPI instance over one CPU “core”:
24
[user@cirrus01 ~]$ cp -r /data/tutorial/modulo0/grom-cpu-1 .
[user@cirrus01 ~]$ cd grom-cpu-1
[user@cirrus01 grom-cpu-1]$ ls -l
-rw-r----- 1 user group 381 Sep 8 19:09 grom-cpu.sh
-rw-r----- 1 user group 1133028 Sep 8 19:59 md.tpr
[user@cirrus01 grom-cpu-1]$ cat grom-cpu-1.sh
How to submit a GROMACS CPU job (cont.)
25
[user@cirrus01 grom-cpu-1]$ sbatch grom-cpu-1.sh
Submitted batch job 6959174
[user@cirrus01 grom-cpu-1]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS TRES_PER_NODE NODELIST
6959174 short grom-cpu-1 user R 0:05 1 1 N/A hpc060
[user@cirrus01 grom-cpu-1]$ ls -l
-rw-r-----+ 1 user group 2160 Sep 9 09:55 ener.edr
-rw-r-----+ 1 user group 401 Sep 8 14:14 grom-gpu.sh
-rw-r-----+ 1 user group 28180 Sep 9 09:55 md.log
-rw-r-----+ 1 user group 1133028 Sep 8 14:13 md.tpr
-rw-r-----+ 1 user group 4941 Sep 9 09:55 slurm-6958633.out
-rw-r-----+ 1 user group 835180 Sep 9 09:55 state.cpt
-rw-r-----+ 1 user group 126728 Sep 9 09:48 traj_comp.xtc
-rw-r-----+ 1 user group 833184 Sep 9 09:48 traj.trr
How to submit a GROMACS CPU job (cont.)
26
[user@cirrus01 gom-cpu-1]$ less md.log
Core t (s) Wall t (s) (%)
Time: 180.630 180.631 100.0
(ns/day) (hour/ns)
Performance: 5.789 4.146
Finished mdrun on rank 0 Fri Sep 9 10:36:39 2022
How to submit a GROMACS MPI job (cont.)
27
[user@cirrus01 ~]$ cp -r /data/tutorial/modulo0/grom-cpu-4 .
[user@cirrus01 ~]$ cd grom-cpu-4
[user@cirrus01 grom-cpu-4]$ sbatch grom-cpu-4.sh
[user@cirrus01 gom-cpu-1]$ less md.log
Core t (s) Wall t (s) (%)
Time: 731.051 182.764 400.0
(ns/day) (hour/ns)
Performance: 19.856 1.209
Finished mdrun on rank 0 Fri Sep 9 10:45:53 2022
How to submit a simple GPU job
How to submit a GROMACS GPU job
We will run the same protein analyze using one MPI instance over one GPU:
29
[user@cirrus01 ~]$ cp -r /data/tutorial/modulo0/grom-gpu .
[user@cirrus01 ~]$ cd grom-gpu
[user@cirrus01 grom-cpu-1]$ ls -l
-rw-r----- 1 user group 357 Sep 8 19:09 grom-gpu.sh
-rw-r----- 1 user group 1133028 Sep 8 19:59 md.tpr
[user@cirrus01 grom-cpu-1]$ cat grom-gpu.sh
How to submit a GROMACS GPU job (cont.)
30
[user@cirrus01 grom-gpu]$ sbatch grom-gpu.sh
Submitted batch job 6959565
[user@cirrus01 grom-gpu]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS TRES_PER_NODE NODELIST
6959565 short grom-gpu.s user R 0:05 1 1 gres:gpu hpc063
[user@cirrus01 gom-gpu]$ less md.log
Core t (s) Wall t (s) (%)
Time: 178.510 178.511 100.0
(ns/day) (hour/ns)
Performance: 120.808 0.199
Finished mdrun on rank 0 Fri Sep 9 10:45:53 2022
Remark on GROMACS and others
Some applications, such as GROMACS, may try to take all available resources when not properly configured.
The batch system will not allowed but the jobs will suffer from poor performance and could even abort.
The users have the responsibility to configure applications to use only the requested resources, the IT team will help on parametrization of the batch system part but the software tuning is out of our scope.
31
How to submit a MPI job
How to submit a MPI job
We will calculate PI with a multicore MPI job using openmpi over four nodes and sixteen instances:
33
[user@cirrus01 ~]$ cp -r /data/tutorial/modulo0/openmpi .
[user@cirrus01 ~]$ cd openmpi
[user@cirrus01 openmpi]$ ls -l
-rw-r----- 1 user group 1518 Sep 8 19:09 cpi_mpi.c
-rw-r----- 1 user group 647 Sep 8 19:59 cpi.sh
[user@cirrus01 openmpi]$ sbatch cpi.sh
Submitted batch job 6959924
[user@cirrus01 openmpi]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS TRES_PER_NODE NODELIST
6959924 short cpi.sh user R 0:05 4 16 N/A hpc[060-063]
How to submit a MPI job (cont.)
34
[user@cirrus01 openmpi]$ cat slurm-6959924.out
=== Environment ===
=== Compiling Parallel ===
=== Running Parallel ====
pi=3.1415926536607000, error=0.0000000000709068, ncores 16, wall clock time = 24.771612
How to start an interactive session
How to start an interactive session
It may be convenient (and faster) to test and troubleshoot applications on interactive mode or on the shell console. The system user interfaces have a different architecture and are not a good choice for such tests, in this cases you should start an interactive session on the workernodes.
36
[user@cirrus01 ~]$ srun -p short --job-name "my_interactive" --pty bash -i
srun: job 6959936 queued and waiting for resources
srun: job 6959936 has been allocated resources
[user@hpc060 ~]$ _
[user@hpc060 ~]$ nvidia-smi
No devices were found
How to start an interactive session (cont.)
37
[user@cirrus01 ~]$ srun -p short –gres=gpu --job-name "my_interactive" --pty bash -i
srun: job 6959969 queued and waiting for resources
srun: job 6959969 has been allocated resources
[user@hpc063 ~]$ nvidia-smi
+-----------------------------------------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------------------+------------------------------+-------------------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.|
…
Production batch system
Differences with the production system
39
Useful commands
Useful commands
41
Documentation
&
Helpdesk
Documentation
43
�Q&A