Introduction to Using the Discovery Cluster at NU
This is an informational document organized in collaboration between ITS, NU faculty and staff.
For questions, contact:
researchcomputing@northeastern.edu
Objective
This guide provides basic information on how to connect, run calculations on, and transfer data to/from, the Discovery Cluster (as of September 2018). It also provides some tips on how to ensure you are obtaining good performance and that your calculations are efficient. Focusing on performance and efficiency (i.e. more computing per core-hour used) benefits you, as well as the rest of the Discovery users.
This guide is not intended to provide a comprehensive introduction to cluster computing. However, this document does provide links to entry-level resources and step-by-step tutorials. If you are trying to accomplish something that is not described here, please let RC staff know, so they can help you properly implement your tasks.
Finally, this is a living document. If you see ambiguities/errors, or you have suggestions, please let us know. Your fellow Discovery users will appreciate it!
2
Why read this guide?
Even if you used previous-generation NU resources, it is still important to familiarize yourself with the policies and configuration of the recently-deployed Discovery Cluster.
Major changes include:
3
Before You Start
Discovery is a Linux-based cluster. If you have never used the command line in Linux, then you will want to get familiar with it before trying to use this resource. Below is information to help you learn the basics of working in a Linux-based environment.
Code Academy intro to the command line
https://www.codecademy.com/learn/learn-the-command-line
In OSX, the command line may be accessed using the “Terminal” application. If you are using a PC, you will likely only use the command line after you connect to the cluster using a Secure Shell. See User-Contributed Tutorials for Windows-specific information.
4
Table of Contents
5
User-Contributed Tutorials
In addition to this guide, other members of the NU community have kindly volunteered to create additional tutorials. We encourage everyone to consult these other resources, since they complement the current document.
Introduction to Discovery for Windows Users by Cuneyt Eroglu (video)
Step-by-step Examples and Demos for Discovery from Kane Group (wiki)
Discovery Cluster guide and tutorials for Python, Spark, Keras and Pyro provided by the Ioannidis Group (wiki)
6
Connecting to Discovery
In order to log in to the discovery cluster, you need to first connect to the front end. This is achieved by establishing a Secure Shell using SSH:
On Linux, or from within the Terminal program when using OSX, you can initiate an SSH connection with the command:
ssh <username>@login.discovery.neu.edu
<username> is your username on discovery.
Windows users: You should use an SSH client, such as PuTTY, to connect. PuTTY may be downloaded at https://www.putty.org For a video tutorial, see User-Contributed Tutorials
7
Transferring Data to/from Discovery
In addition to the front end node, there is also a dedicated data transfer (DT) node for use with sftp, scp or rsync. To ensure stability of the front end, file transfers are only allowed through the data transfer node. Here are some (Linux, or Terminal in OSX) examples for how to transfer data:
scp testfile <username>@xfer.discovery.neu.edu:/scratch/<username>
or
rsync -auv <local directory> <username>@xfer.discovery.neu.edu:<remote directory>
Windows users: For a video tutorial, see User-Contributed Tutorials
8
Proper Data Storage Practices
/home directory
Each account is associated with a /home directory. The home directories are intended for storing executables and small persistent files (e.g. general config files, source code, etc). Each user may have a limit of 100Gb in their home directory. NEVER launch a job from your /home directory. In accordance with cluster policies, jobs launched from /home will be terminated without notice.
9
Proper Data Storage Practices
Shared /scratch directory
/scratch is a 3 PB parallel file server which all users may utilize. Every user should create a subdirectory on /scratch in order to write output from their calculations.
To create a scratch directory, you may use the command : mkdir /scratch/`whoami`
For example, if your login name is john, then your scratch directory would be named /scratch/john
Always launch submitted jobs from /scratch, and transfer your data from /scratch to another resource (e.g. personal/group file server) as soon as possible.
/scratch is for high-speed temporary file access.
/scratch is NOT backed up.
While there is no limit on the amount of data you may store temporarily on /scratch, data removal may be required if the system reaches critically-full levels.
RCC has approved the implementation of purge policies for /scratch. Once implemented, old files (>~3 mo) will be removed automatically. Prior notice will be provided.
10
Proper Data Storage Practices
ITS staff aims to satisfy your research needs. If you find that additional data storage services are required, please contact RC support and/or your college’s RCC representative. See Policy Questions and Suggestions, for contact information.
11
Using Modules to Load Executables
If you would you like to use pre-built code, rather than compile your own software, you will need to use the “module” command. The basic idea is that, by loading the appropriate modules you will be able to call executables from the command line with minimal effort (e.g. the commands will be in your path, and libraries will be loaded).
Here are the most commonly-used module options.
module avail : returns a list of available modules
module whatis <module name> : provides information about a specific module, including additional prerequisites
module load <module name> : load the module, so that you may use the executables
module list : list the modules that you have loaded
12
Hardware Descriptions of the Partitions
In the Discovery environment, collections of compute nodes are organized into “partitions”. Always submit jobs to specific partitions using sbatch, or srun. Never run calculations on the front end node.
Each partition has distinct hardware configurations (e.g. GPU nodes, phi processors, high memory, etc.). There are six publically-available partitions available to all NU researchers. When submitting jobs, you should always specify a partition name. The names of the public partitions are:
The hardware details of each partition are described on subsequent slides.
13
“general” partition
The general partition has a collection of CPU-only nodes. Serial and parallel (multi core and multi-node) jobs are allowed on this partition.
Each node contains two multi-core CPUs.
Current Node Configurations Available
Dual Intel Xeon E5-2650 @ 2.00GHz, 16 total cores, 128GB memory (7 nodes)
Dual Intel Xeon E5-2680 v2 @ 2.80GHz, 20 total cores, 128GB memory (78 nodes)
Dual Intel Xeon E5-2690 v3 @ 2.60GHz, 24 total cores, 128GB memory (184 nodes)
14
Hardware and partitions
“gpu” partition
Current Node Configurations Available
Dual Intel Xeon E5-2650 @ 2.00GHz, 16 total cores, 128GB memory
+ one K20 NVIDIA GPGPU (32 nodes)
Dual Intel Xeon E5-2690 v3 @ 2.60GHz, 24 total cores, 128GB memory
15
Hardware and partitions
“fullnode” partition
This partition is ONLY for jobs that can utilize entire nodes, either in terms of memory, or CPUs. If your job requires more than 128 Gb of memory, or you can use all 28 cores with a single submission script, then this partition is appropriate. Each user must apply for access to this partition (See “Applying for Access to Specialized Partitions”).
Node Specifications
Dual Intel Xeon E5-2680 v4 @ 2.40GHz, 28 total cores, 256GB memory (416 nodes)
16
Hardware and partitions
“multigpu” partition
This partition is for jobs that can utilize multiple GPGPUs. Each user must apply for access to this partition (See “Applying for Access to Specialized Partitions”).
Node Specifications
Dual Intel Xeon E5-2680 v4 @ 2.40GHz, 28 total cores, 512GB memory - each node also has 8 NVIDIA K80 GPGPUs (In total, 8 nodes with 64 GPUs)
Dual Intel Xeon E5-2680 v4 @ 2.40GHz, 28 total cores, 512GB memory - each node also has 4 NVIDIA P100 GPGPUs (in total, 8 nodes with 32 GPUs)
17
Hardware and partitions
“infiniband” partition
Users must apply for access to this partition (See “Applying for Access to Specialized Partitions”).
Each node contains:
Dual Intel Xeon E5-2650 @ 2.00GHz, 16 total cores, 128GB memory (64 nodes)
FDR Infiniband interconnect
18
Hardware and partitions
“phi” partition - temporarily unavailable
Users must apply for access to this partition (See “Applying for Access to Specialized Partitions”).
Each node contains:
Dual Intel Xeon E5-2650 @ 2.00GHz, 16 total cores, 512GB memory (8 nodes) + one phi coprocessor
Note: The phi coprocessor is not supported by the current version of Centos (7.5). At this time, we are waiting for a fix to be released. If you need to use these nodes immediately, please file a ticket with RC so that a temporary workaround solution may be explored.
19
Hardware and partitions
Submitting and Monitoring Jobs on Discovery
In order to perform a calculation on Discovery, you must use SLURM (Simple Linux Utility for Resource Management). The idea is that you ask SLURM for resources (e.g. what type of node, how many cores, how much memory), and then your calculations will be executed once the resources become available.
20
Submitting and Monitoring Jobs on Discovery
The most common SLURM commands that you will need are:
sbatch <file name> : this will send the job to the scheduler.
srun : If you would prefer to work interactively on a node, you may launch and interactive session with srun.
squeue : see what jobs are waiting and running currently.
scancel <job id> : remove a running or pending job from the queue
scontrol <flags> : find more information about the machine configuration and job settings
seff <job id> : report the computational efficiency of your calculations
Below we provide examples for how to use these command appropriately. Even if you have used SLURM in the past, it is useful to review all of the examples below, in order to see how SLURM is configured on Discovery.
21
sbatch
In order to schedule a job for execution, you need to submit it to SLURM using the command:
sbatch example.script
In this example, the submission script is called example.script. Below, we provide multiple example for what you could include in this file.
22
Submitting jobs
Submit Script Format
The general format for a submit script is to provide a variety of SBATCH flags in the header and then the executables are called in the body of the script.
Note: Most sbatch flags may be given in single-letter or whole-word formats. For example, “-N 1” and “--nodes=1” are equivalent. For transparency, we will use the whole-word convention. To see the complete list of available flags, see man sbatch.
23
Submitting jobs
Time Limits
The default time limit for all submitted jobs is 24 hours. This is also the maximum allowable wall time. If your job does not complete within the requested time limit, SLURM will automatically terminate the calculation. Alternately, if you request more than 24 hours, your job will not launch.
Tip: One factor that SLURM uses to determine job order is the requested time. If you request less time, SLURM may be able to schedule your calculation sooner. For example, if the highest priority pending job has requested 10 nodes, and SLURM anticipates that 10 nodes will become available in 6 hours, then jobs that requires less than 6 hours could be completed before the 10-node job begins. In this case, SLURM will allow these shorter lower-priority jobs to run while the larger higher-priority calculation is waiting for available resources.
24
Submitting jobs
Examples for Serial Job Submission
25
Submitting jobs
1-core job
If you wanted to run a 1-core job for 4 hours on the general partition, then your submit script would look like the following:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=general
<commands to execute.>
_____________________________________________________________
26
Submitting jobs
1-core job + additional memory
By default, SLURM will allow you to use 1GB of memory for every core you have allocated. Here is an example of a 1-core job for 4 hours on the general partition that requires 100 GB of memory:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --mem=100Gb
#SBATCH --partition=general
<commands to execute.>
_____________________________________________________________
Note: If your calculations try to use more memory than you requested, SLURM will automatically kill the job.
27
Submitting jobs
1-core job with exclusive use of a node
If you wanted to run a 1-core job for 4 hours on the general partition, and you need exclusive access to the node (e.g. perhaps you have high I/O requirements), then you may want to lock down the entire node with --exclusive:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --exclusive
#SBATCH --partition=general
<commands to execute.>
_____________________________________________________________
28
Submitting jobs
Examples for parallel job submission
29
Submit script contents
#SBATCH --nodes 2
#SBATCH --tasks-per-node=3
#SBATCH --cpus-per-task=4
Node 0
Node 1
2 nodes are reserved
task 0
task 1
task 2
task 3
task 4
task 5
4 cores reserved for each task
e.g. when launching 4 openMP threads per mpi rank
Schematic example of how to requests resources for a program that will employ multi-level mpi-openMP parallelization.
In this case, reserve 2 node, 3 tasks per node (e.g. for use with 3 mpi ranks per node) and reserve 4 cores for each task (e.g. for use with 4 openMP threads per rank).
3 tasks are reserved
(mpi may launch 3 ranks)
3 tasks are reserved
(mpi may launch 3 ranks)
4 cores reserved for each task
e.g. when launching 4 openMP threads per mpi rank
8-task job on a single node + additional memory
By default, SLURM will allow you to use 1GB of memory for every core you have allocated. Here is an example of an 8-task job (e.g for an 8-rank mpi calculation) that will run on a single node and will require 100 GB of memory:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --mem=100Gb
#SBATCH --partition=general
<commands to execute.>
_____________________________________________________________
32
Parallel job submission
8-task job on multiple nodes + additional memory
You may want to distribute your tasks across nodes. Here is an example of an 8-task job, where the tasks will be distributed across 4 nodes, with 100 GB of memory requested per node:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --mem=100Gb
#SBATCH --partition=general
<commands to execute.>
_____________________________________________________________
33
Parallel job submission
8-task, 32-core, job on multiple nodes + memory
You may want to distribute your ranks across nodes. Here is an example of an 8-task job, with 4 cores reserved per task, and the tasks distributed across 4 nodes. 100 GB of memory requested per node:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --mem=100Gb
#SBATCH --partition=general
<commands to execute.>
_____________________________________________________________
34
Parallel job submission
8-task job on multiple nodes + additional memory and exclusive
You may want to distribute your ranks across nodes and have exclusive access to all nodes (e.g. when using multi-level parallelization). Here is an example of an 8-task job, with tasks distributed across 4 nodes, 100 GB of memory requested per node and exclusive use of all nodes.
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=2
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --mem=100Gb
#SBATCH --exclusive
#SBATCH --partition=general
<commands to execute.>
_____________________________________________________________
35
Parallel job submission
Examples for GPU job submission
Submitting jobs to the gpu and multigpu partitions is very similar to submitting jobs on the other partitions. The main difference is that you need to tell SLURM how many GPUs you would like to reserve, per node. If you do not, the GPUs will not be visible to your executables. You can also optionally specify which types of GPUs are required.
36
GPU job submission
1 node, using 1 core and 1 GPU
For this type of job, you would want to request the gpu partition.
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
<commands to execute.>
_____________________________________________________________
37
GPU job submission
1 node, all compute cores and 1 GPU
For this type of job, you would want to request the gpu partition.
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --exclusive
<commands to execute.>
_____________________________________________________________
38
GPU job submission
4 nodes, all compute cores and 1 GPU per node (4 total GPUs)
For this example, you are only requesting one rank per node.
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --exclusive
<commands to execute.>
_____________________________________________________________
39
GPU job submission
4 nodes, all compute cores and 1 GPU per node (4 total GPUs)
Here, request two tasks/ranks per node (8 tasks, 4 nodes), and 6 cores per task.
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=6
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition gpu
#SBATCH --gres=gpu:1
<commands to execute.>
_____________________________________________________________
40
GPU job submission
1 node, 8 compute cores and 4 GPUs per node
For multi-GPU-per-node calculations, one must use the multigpu partition.
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=6
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=multigpu
#SBATCH --gres=gpu:4
<commands to execute.>
_____________________________________________________________
Note: Do NOT use “--exclusive” with multigpu, unless you want to reserve all compute cores and all GPUs on a node. If you are not going to use every CPU and GPU on a node, then --exclusive would hold the remaining resources idle, while preventing other jobs from running.
41
GPU job submission
Additional Examples of Submission Options
42
Redirecting stdout/stderr
By default, SLURM will write all stdout and stderr to a single file called: slurm-<job number>.out, where <job number> is assigned at the time of submission. If you would like to write stderr and stdout to specific files, use the flags below:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=general
#SBATCH --output=myjob.%j.out
#SBATCH --error=myjob.%j.err
<commands to execute.>
_____________________________________________________________
In the above example, “%j” tells slurm to insert the job number in the file name.
43
Specialized job submission
Specifying Required Node Features
Since the general partition contains heterogeneous nodes (different core count and speed), you may want to tell SLURM to only run your job on nodes that have specific features. Here is an example where SLURM is told to only run this job on a node that has an Intel E5-2690v3 chip:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=general
#SBATCH --constraint=“E5-2690v3@2.60GHz”
<commands to execute.>
_____________________________________________________________
44
Specialized job submission
What are the available “Features”?
For a complete list of available features that may be specific when using SLURM on Discovery, you may use the command:
grep Feature /shared/centos7/etc/slurm/nodes.conf
RC staff is currently preparing a wrapper script with which to easily view the current configuration of all nodes. When it is available, a description will be provided here.
45
Specialized job submission
Specifying Required GPU Types
If you want to specify that a particular GPU model should be used:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=8
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=gpu
#SBATCH --gres=gpu:k20:1
<commands to execute.>
_____________________________________________________________
The following GPU designations are available:
gpu partition: K20, k40m
multigpu partition: k80, p100 (4 per node)
46
GPU job submission
Excluding Specific Nodes
You may also tell SLURM to exclude specific nodes. This can be useful if you find specific nodes are being problematic. Here is an example where SLURM is told to NOT run this job on node c0100:
_____________________________________________________________
#!/bin/bash
#SBATCH -N 1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=general
#SBATCH --exclude=c0100
<commands to execute.>
_____________________________________________________________
47
Specialized job submission
Requesting Specific Nodes
You may also tell SLURM to only run jobs on a set of possible nodes. Here is an example where SLURM is told to only consider running on nodes c0100-c0200:
_____________________________________________________________
#!/bin/bash
#SBATCH -N 1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=general
#SBATCH --nodelist=c0[100-200]
<commands to execute.>
_____________________________________________________________
Note: if you request more nodes with -N than are listed in --nodelist, additional resources will also be assigned to the job.
48
Specialized job submission
Using Job Dependencies
Oftentimes, a calculation will require more than the walltime limit. For example, you may have a job that takes 40 hours, but the walltime limit is 24 hours. In that case, you may want to break your calculation into two smaller parts, where the second part only begins after the first part has finished. To include dependencies, you edit the submit script, or define the dependency on the command line. For these examples, let’s assume you have already submitted the first segment of your calculation with the command:
sbatch first.script
You will see a message, such as:
Submitted batch job 45
49
Specialized job submission
Using Job Dependencies
Now, let’s submit the second job, but include a line in your submit script that will tell SLURM to only start it if the first job finishes without an error.
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=general
#SBATCH --dependency=afterok:45
<commands to execute.>
_____________________________________________________________
50
Specialized job submission
Using Job Dependencies
Alternately, you may only want the job to start if the first job has failed:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=general
#SBATCH --dependency=afternotok:45
<commands to execute.>
_____________________________________________________________
51
Specialized job submission
Using Job Dependencies
As a third example, you may want the second job to start, regardless of whether the first job finished without an error:
_____________________________________________________________
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=general
#SBATCH --dependency=afterany:45
<commands to execute.>
_____________________________________________________________
52
Specialized job submission
srun
If you prefer to run a job interactively, you may request a session on a compute node. To do this, please use the command:
srun --pty --export=ALL --tasks-per-node 1 --nodes 1 --mem=10Gb --time=00:30:00 /bin/bash
This example would allocate 1 core on 1 node, 10Gb of memory, and the reservation would be held for 30 minutes. Note: srun automatically logs you on to the compute node. There is no need to additionally ssh to the node. When you are done with your interactive session, you may close the window, or type “exit”. Note: if you would like to use X forwarding, add the flag --x11
Note: Since it is very easy to misuse salloc, it is now recommended that all users launch interactive sessions using srun.
53
squeue
If you want to see all jobs in the queue, including yours, simply use:
squeue
If you only want to see your jobs:
squeue -u `whoami`
If you only want to see your jobs that are currently running:
squeue -u `whoami` -t RUNNING
54
scontrol
If you want extensive information about a running (or pending) job, you can use the scontrol command.
scontrol show job <job id>
Where <job id> is the number assigned to the job you would like to check on.
You may also see all SLURM configurations with
scontrol show config
55
seff
seff is a SLURM tool that will calculate the fraction of reserved resources that were used by a completed job. Here is an example where we will check the efficiency of job 49908:
> seff 49908
Job ID: 49908
Cluster: discovery
User/Group: whitford/users
State: COMPLETED (exit code 0)
Nodes: 8
Cores per node: 28
CPU Utilized: 185-15:16:44
CPU Efficiency: 104.24% of 178-01:55:12 core-walltime
Job Wall-clock time: 19:04:48
Memory Utilized: 753.43 MB
Memory Efficiency: 0.34% of 218.75 GB
Note: You should always strive to use 100% of the CPU time that you reserve. If you are requesting more memory than the default 1Gb/core, then Memory Efficiency should also be close to 100%. Users that regularly perform low-efficiency jobs will have reduced access to the resource.
56
Software-specific tips
While the above description focused on how to reserve resources, the next step is to effectively use your code. In this section, we provide some guidance on how to use specific applications/languages.
Note: If you are an expert with a particular software package, and you would like to provide tips for use on Discovery, please let us know.
57
Running Jupyter Notebook
You may want to use GPU on the cluster to speed up your program and Jupyter Notebook to have a better control over your code.
Since there is no web browser on the cluster, we have to do it in another way.
Generate a config file, since we are not going to use the default.
$jupyter notebook --generate-config
58
Jupyter Tips
Put the following lines in the config file and save. Use any port number between 1025 and 65535.
c = get_config()
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
Check the IP address of a compute note and run Jupyter Notebook on it.
Open the browser locally and put the IP address and port number. You should be able to utilize the resource on Discovery and project the results to your local browser.
59
Jupyter Tips
Adding your own python packages
If you have a deadline and cannot wait for your package request to be done, you can install it under your home directory. The only thing is to tell the Python interpreter where to find it.
Install the package from source or with other virtual environment tools, such as conda.
In your Python code, put the following lines in the front
import sys
sys.path.append(PATH_TO_THE_PACKAGE)
Then you should be able to use the Python on Discovery and the packages you install.
60
Jupyter Tips
Known Issue
As with any migration process, the transition to the new Discovery cluster will require fine-tuning. Here are known issues that are currently being worked on:
61
Policy Questions and Suggestions
Usage policies for the Discovery Cluster are established by the NU Research Computing Committee, which is composed of ITS staff and faculty representatives from every college. We aim to provide thoughtful policies that will support research activities of all users. If you have suggestions on how to improve the utility of the Discovery Cluster, always feel free to reach out. More information about NURCC may be found at: https://web.northeastern.edu/rcac/
62
Applying for Access to Specialized Partitions
While access to the “general” and “gpu” partitions is automatically granted to all Discovery account holders, the “fullnode”, “multigpu” and “infiniband” partitions require an additional application be completed. Since these are more powerful (and expensive) resources, the application is intended to ensure that you know how to appropriately use these nodes.
The new application for fullnode is now available on the RC page. The application for multigpu is currently being revised.
63
Using Discovery in Courses
In addition to serving the research community, the Discovery cluster is also a valuable educational resource. If you are an instructor, and would like to use the cluster in your class, it is highly recommended that you request support through RCC at least 2 months prior to the beginning of the semester. Ideally, you should coordinate your request with your college’s RCC representative. While approval is not required, there can be no guarantee of ITS support if prior approval is not obtained.
You may find a recently-approved request here. To streamline the review process, please model your request after this example.
64
Scheduled Downtime
In order to ensure that the Discovery cluster remains stable and is able to support all users, it is essential that the administrators are able to occasionally power down the system. To allow for this, the Research Computing Committee recommended that ITS/RC implements the following regularly- scheduled maintenance window.
Starting January 1st, 2019, RC/ITS will have a standing reservation for routine maintenance on the cluster. The maintenance window will be from 8am-12pm on the first Monday of each month. If RC/ITS determines that they need to utilize this window, the users will be notified 1 week prior. During the downtime, no jobs will be able to run. In addition, SLURM will not initiate additional calculations, if they are not expected to terminate before the service window.
65
Getting Help
Self-Serve portal: As of September 2018, there is a Self-Service portal available to file Research-Computing-Specific help requests. Since this will be a trackable system, it is the preferred mechanism for obtaining RC assistance. You may file a ticket at : https://northeastern.service-now.com/research
Man pages: This guide only provide a few examples of each command. For more complete listings of options, check the man pages (e.g. issue “man sbatch” on the command line)
In-person consultation: You can always stop by the RC office in 2 Ell Hall to discuss issues you may be having, or to elicit advice.
Training workshops: ITS/RC organizes introductory workshops for new users to Discovery. Check the RC page for upcoming events.
Discuss with other users: There is a Discovery discussion listserv open to all NU members. You can sign up for “discovery-user-forum” at listserv.neu.edu. See details on next slide.
Email: You may also email the RC computing staff directly at: researchcomputing@northeastern.edu
66
Getting Help: User Forum Listserv
There is a Discovery discussion listserv open to all NU members. You can join the “discovery-user-forum” by doing the following:
From the email account that you want to use send the following email:
To: listserv@listserv.neu.edu
Subject: <leave blank>
Content: subscribe discovery-user-forum <firstname> <lastname>
You will receive a confirmation email with a link.
Access archived posts by creating an account with your email address at https://listserv.neu.edu
Go to “Subscribers Corner” to locate the list, click on the list to see archived emails
67