Intro_to_Summit_Q&A
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
$
%
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
Please ask questions below and we will answer them in the following column either during or after the event.
2
3
QuestionsAnswers
4
How much flops from CPU?~1.1 TF with 22 cores @ 3.07 GHz
5
Where can we find the slides?Slides and recordings will be made available on the OLCF website following the presentation.
6
When is connection from outside to GPFS planned? (Globus or otherwise)This is currently planned for use with the production IO system, Alpine, when it is accepted. Although it might be possible that these tools will be available during install of the produciton system
7
8
Is there an environment variable for xlc++? (for the path)It looks like the c and c++ binaries are found under the same directory. I'll ask Matt to confirm that the same environment variable should be used for both.$ ls ${OLCF_XLC_ROOT}/bin
c89 c89_r c99 c99_r cc cc_r cleanpdf genhtml mergepdf showpdf xlC xlC_r xlc xlc++ xlc++_r xlc_configure xlc_r
First response is correct - the C and C++ compilers xlc and xlC/xlc++ are the same binary and change behavior only based on the name of symlink used to invoke it - MPB.
9
10
11
what is the equivalent of showstart?The only way for LSF to predict the start time of a job (that I am aware of) requires a "simulation" server to be running. We do not currently run this service, but I (Matt Belhorn) am also interested in this feature being available. I will revist with our operations group about getting this setup if possible. With the simulation server running, "bjobs" can show a predicted start time.
12
13
is there a plan to support openmpi? Many of the opensource software are built on openmpiSpectrum-mpi is based on OpenMPI and uses the same compiler wrappers. SMPI can build virtually anything that was written for OMPI.
14
15
Does it mean we can also issue mpirun? Tried to build pytorch with spectrummpi and it didnt work as it was expecting mpirunThere technically is (or was, it may have changed) a way to invoke mpirun but we don't support it for a variety of reasons, between allocating resources to the job and tracking for our review audits. If you send a message to us directly, we can work with you on this. I'm aware of some ongoing tickets working to get a functional pytorch with pytorch.distributed using the MPI backend.
16
17
Can I also create a resource set without a gpu? With reference to Slides 14 of jsrun, we can also start some resource sets without the gpu that can give some more configurations.Omitting the `-g` flag, or passing jsrun `-g 0` will create a resource set with no GPUs.
18
19
Do all resource sets in a jsrun have to be identical? (I assume yes, just making sure)Currently, they must all be identical. However, the last I heard, there are plans to specify a file that allows using heterogeneous resource setsthat's cool!
20
21
22
What does '0 threads per task' mean (on the aprun vs jsrun slide)?He is indicating that there are no OpenMP threadsWhat about the $OMP_NUM_THREADS variable? Do we still need to specify that?Yes, with both aprun and jsrun you would still need to specify OMP_NUM_THREADS unless you specify the number of threads using the api in your code. The `-d` (aprun) and `-bind` (jsrun) flags simply state how many cores are available per MPI rank. So, for example, if you want 8 OpenMP threads, but only assign 1 core per MPI ranks, all 8 OpenMP threads would run on the same physical core (instead of 1 OpenMP thread on each physical core)
23
24
what advantages do resource sets provide? Why not I handle like titan always? When should I think about utilizing resource sets properly?Resource sets allow you to organize a node differently based on need. The task is more basic on Titan since it only contains a single GPU per node. If you do not need the resource set organization, you can simply use a resource set per node which contains all of the node's resources.
25
26
Are the slides available later for public/participant as well?Yes, the slides and recordings will be made available on the OLCF website following the presentation.
27
28
Feel free to skip if covered later on: Say I write out my checkpoints to the SSD and my job dies (user error, hardware error, out of time, ...), can I retrieve the checkpointed data? NVM :) (both the tool. and "nevermind", pardon the pun)SCR provides the ability to XOR parity data to other nodes so if you have a node fail in an application and have extra nodes, your application can be automatically restarted from that XOR'd data. Generally restarts will come from persisted files.
29
30
31
earlier it was said that virtualenv is only supported for python2. Are there plans to add that for python3?Python3 incorporated the 3rd-party virtualenv into the python3 standard library as the module `venv`, so it's already built-in to the interpreter.

$ virtualenv my_python2_environment
$ python3 -m venv my_python3_environment
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Loading...
Main menu