Open an internet browser and enter: https://notebook.ospool.osg-htc.org
1
Scaling Out Your Research on the Open Science Pool
Showmic Islam & Rachel Lombardi
OSG Research Facilitation
May 31, 2023
2
Agenda
3
Open an internet browser and enter: https://notebook.ospool.osg-htc.org
4
Log into an OSPool Access Point
Login using any of the available authentication options. Some choices:
5
Launch Data Sciences Notebook
1. Click the “Basic” box
2. Click orange “Start” button
6
Log into an OSPool Access Point
Open a Terminal
7
What is the OSG?
8
The OSG Consortium builds and operates a set of pools of shared computing and data capacity for distributed high-throughput computing (dHTC).
https://display.opensciencegrid.org
OSG Services
Open Science Pool (OSPool): �National resource of computing capacity for high throughput workloads
OSG-operated Access Points: �OSG-operated service to submit jobs to the OSPool**
Open Science Data Federation (OSDF): �Network of data origins (file servers) and caches for data accessibility
9
Free!
no allocation
1000s of cores
Manage 1000s of jobs
GPUs
High Throughput Computing (HTC)
10
One of our favorite HTC examples: baking the world’s largest/longest cake
In computational terms: solving a big problem (the world’s longest cake) by executing many small, self-contained tasks (individual cakes) and joining them.
Submitting Jobs to the OSPool
11
Use Cases
12
Three Researchers
13
Mei Monte Carlo |
|
Needs to run many random simulations to estimate a statistical value. |
Ben Bioinformatics |
|
Needs to process 100s of genomic data files. |
Tamara Trials |
|
Needs to test the performance of a machine learning model with many (hyper)parameters. |
Job Component Vocabulary
14
Executable
Software Environment
Input Files
Output Files
Arguments
(text input)
Standard error and output
(text output)
Job
Use Case 1: Analyzing Multiple Files
15
Use Case 1: Analyzing Multiple Files
16
Ben Bioinformatics |
|
Needs to process 100s of genomic data files. |
Use Case 1: Analyzing Multiple Files
17
universe = container
container_image = bwa.sif
executable = bwa.sh
#arguments =
transfer_input_files = R1.fastq, R2.fastq, ref.fastq, bwa.sif
#transfer_output_remaps =
error = test.err
output = test.out
queue 1
Jupyter Guest Accounts
18
In Jupyter�In an opened terminal, run:
$ tutorial bwa
Then click on the downloaded folder (tutorial-bwa) and open the “README.ipynb” file.
19
Job Component Vocabulary - Expanded
20
Executable
Software Environment
Unique Input Files
Output Files
Arguments
(text input)
Standard error and output
(text output)
Job
Shared Input Files
What Varies
What Varies
x N
unique inputs
= workload
Use Case 1: Analyzing Multiple Files
21
executable = bwa.sh
arguments = $(sample)
transfer_input_files = $(sample).R1.fastq, $(sample).R2.fastq, ref.fastq, bwa.sif
transfer_output_remaps = “$(sample).sam=results/$(sample).sam”
error = test.$(sample).err
output = test .$(sample).out
queue sample from list.txt
executable = bwa.sh
#arguments =
transfer_input_files = SRR1.R1.fastq, SRR1.R2.fastq, ref.fastq, bwa.sif
transfer_output_remaps = “SRR1.sam=results/SRR1.sam”
error = test.err
output = test.out
queue 1
22
Queue Multiple Jobs
22
Syntax | List of Values | Variable Name |
queue N | Integers: 0 through N-1 | $(ProcID) |
queue Var matching pattern* | List of values that match the wildcard pattern. | $(Var) If no variable name is provided, default is $(Item) |
queue Var in (item1 item2 …) | List of values within parentheses. | |
queue Var from list.txt | List of values from list.txt where each value is on its own line. |
In Jupyter
Continue working with the bwa tutorial.
23
Apply to Your Workflow
24
Use Case 2: Random Simulations
25
Use Case 2: Random Simulations
26
Mei Monte Carlo |
|
Needs to run many random simulations to estimate a statistical value. |
27
Queue Multiple Jobs
Syntax | List of Values | Variable Name |
queue N | Integers: 0 through N-1 | $(ProcID) |
queue Var matching pattern* | List of values that match the wildcard pattern. | $(Var) If no variable name is provided, default is $(Item) |
queue Var in (item1 item2 …) | List of values within parentheses. | |
queue Var from list.txt | List of values from list.txt where each value is on its own line. |
27
Use Case 2: Random Simulations
28
universe = container
container_image = R.sif
executable = mcpi.R
arguments = $(Process)
#transfer_input_files =
#transfer_output_remaps =
output = $(Process).out
error = test.err
queue 40
In Jupyter
Back in the terminal tab, run:
$ cd ~
$ tutorial ScalingUp-R
Then click on the downloaded folder and open the “README.ipynb” file.
29
Apply to Your Workflow
30
Use Case 3: Different Parameters
31
Use Case 3: Different Parameters
Scale Out Your Research on the OSPool - C. Koch
32
3/27/23
Tamara Trials |
|
Needs to test the performance of a machine learning model with many (hyper)parameters. |
Use Case 3: Different Parameters, v1
33
universe = container
container_image = pytorch-gpu.sif
executable = main.py
arguments = $(nnopt)
transfer_input_files = mnist.tar.gz,
#transfer_output_remaps =
request_gpus = 1
queue nnopt from list.txt
--batch-size 4 --epochs 5 --seed 5
Use Case 3: Different Parameters, v2
34
universe = container
container_image = pytorch-gpu.sif
executable = main.py
arguments = --batch $(b) --epochs $(e)
transfer_input_files = mnist.tar.gz,
#transfer_output_remaps =
request_gpus = 1
queue b,e from list.txt
4, 5
4, 6
Apply to Your Workflow
35
Building a Workload
36
Patterns for Scaling Out
37
Patterns for Scaling Out
38
Special Considerations
39
Additional Considerations
40
Additional Considerations
41
Additional Considerations (Resource)
42
| Ideal Jobs! (up to 10,000 cores across Jobs, per user!) | Still Very Advantageous! | Less-so, but maybe |
Cores (GPUs) | 1 (1; non-specific type) | <8 (1; specific GPU type) | >8 (or MPI) (multiple) |
Walltime | <10 hrs* *or checkpointable | <20 hrs* *or checkpointable | >20 hrs |
RAM | <few GB | <10s GB | >10s GB |
Input | <500 MB | <10 GB | >10 GB |
Output | <1 GB | <10 GB | >10 GB |
Software | ‘portable’ (pre-compiled binaries, transferable, containerizable, etc.) | most other than → | Licensed software; non-Linux |
Additional Considerations
43
Features to support your work
44
Next Steps
45
Get a Full Account
46
Stay Connected With Facilitation
Our team is here to help!!
47
Showmic Islam
Andrew Owen
Christina
Koch
Rachel Lombardi
Mats Rynge
Get Connected With Community
48
Questions?
49
Strategize
50
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant No. 2030508. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
51
Jupyter Access Point
52
Access Point
Execute Point
/home/user
/condor/scratch
HTCondor