1 of 52

Open an internet browser and enter: https://notebook.ospool.osg-htc.org

1

2 of 52

Scaling Out Your Research on the Open Science Pool

Showmic Islam & Rachel Lombardi

OSG Research Facilitation

May 31, 2023

2

3 of 52

Agenda

Get connected to personal Access Point/ resource pool (via Jupyter interface)
Introduction to High Throughput Computing and the Open Science Pool
Three use cases

Analyzing multiple files (hands-on)
Running multiple random simulations (hands-on)
Running multiple machine learning models

Planning your own workload
Next steps on the OSPool

3

4 of 52

Open an internet browser and enter: https://notebook.ospool.osg-htc.org

4

5 of 52

Log into an OSPool Access Point

Login using any of the available authentication options. Some choices:

Institution account
Google (i.e. gmail)
GitHub
ORCID

5

6 of 52

Launch Data Sciences Notebook

1. Click the “Basic” box

2. Click orange “Start” button

6

7 of 52

Log into an OSPool Access Point

Open a Terminal

7

8 of 52

What is the OSG?

8

The OSG Consortium builds and operates a set of pools of shared computing and data capacity for distributed high-throughput computing (dHTC).

https://display.opensciencegrid.org

9 of 52

OSG Services

Open Science Pool (OSPool): �National resource of computing capacity for high throughput workloads

OSG-operated Access Points: �OSG-operated service to submit jobs to the OSPool**

Open Science Data Federation (OSDF): �Network of data origins (file servers) and caches for data accessibility

9

Free!

no allocation

1000s of cores

Manage 1000s of jobs

GPUs

10 of 52

High Throughput Computing (HTC)

10

One of our favorite HTC examples: baking the world’s largest/longest cake

In computational terms: solving a big problem (the world’s longest cake) by executing many small, self-contained tasks (individual cakes) and joining them.

Photos: Arun Sankar via https://www.theguardian.com/world/2020/jan/16/indian-bakers-rise-to-task-of-making-worlds-longest-cake

11 of 52

Submitting Jobs to the OSPool

11

12 of 52

Use Cases

12

13 of 52

Three Researchers

13

Mei Monte Carlo

Needs to run many random simulations to estimate a statistical value.

Ben Bioinformatics

Needs to process 100s of genomic data files.

Tamara Trials

Needs to test the performance of a machine learning model with many (hyper)parameters.

14 of 52

Job Component Vocabulary

14

Executable

Software Environment

Input Files

Output Files

Arguments

(text input)

Standard error and output

(text output)

Job

15 of 52

Use Case 1: Analyzing Multiple Files

15

16 of 52

Use Case 1: Analyzing Multiple Files

Software: bwa aligner
Executable: Shell script with bwa commands
Arguments: None (for now…)
Input files:

Many pairs of fastq files
Reference file

Output files: aligned .sam files

16

Ben Bioinformatics

Needs to process 100s of genomic data files.

17 of 52

Use Case 1: Analyzing Multiple Files

Software: bwa aligner
Executable: Shell script with bwa commands
Arguments: None (for now…)
Input files:

Many pairs of fastq files
Reference file

Output files: aligned .sam files

17

universe = container

container_image = bwa.sif

executable = bwa.sh

#arguments =

transfer_input_files = R1.fastq, R2.fastq, ref.fastq, bwa.sif

#transfer_output_remaps =

error = test.err

output = test.out

queue 1

18 of 52

Jupyter Guest Accounts

18

19 of 52

In Jupyter�In an opened terminal, run:

$ tutorial bwa

Then click on the downloaded folder (tutorial-bwa) and open the “README.ipynb” file.

19

20 of 52

Job Component Vocabulary - Expanded

20

Executable

Software Environment

Unique Input Files

Output Files

Arguments

(text input)

Standard error and output

(text output)

Job

Shared Input Files

What Varies

x N

unique inputs

= workload

21 of 52

Use Case 1: Analyzing Multiple Files

21

executable = bwa.sh

arguments = $(sample)

transfer_input_files = $(sample).R1.fastq, $(sample).R2.fastq, ref.fastq, bwa.sif

transfer_output_remaps = “$(sample).sam=results/$(sample).sam”

error = test.$(sample).err

output = test .$(sample).out

queue sample from list.txt

executable = bwa.sh

#arguments =

transfer_input_files = SRR1.R1.fastq, SRR1.R2.fastq, ref.fastq, bwa.sif

transfer_output_remaps = “SRR1.sam=results/SRR1.sam”

error = test.err

output = test.out

queue 1

22 of 52

22

Queue Multiple Jobs

22

Syntax	List of Values	Variable Name
queue N	Integers: 0 through N-1	$(ProcID)
queue Var matching pattern*	List of values that match the wildcard pattern.	$(Var) If no variable name is provided, default is $(Item)
queue Var in (item1 item2 …)	List of values within parentheses.
queue Var from list.txt	List of values from list.txt where each value is on its own line.

23 of 52

In Jupyter

Continue working with the bwa tutorial.

23

24 of 52

Apply to Your Workflow

Processing MRI or other imaging data
Molecule/protein docking
Simulations that are described by an input file
Feature extraction
…anything that has many unique input files, each representing a self-contained job producing unique output.

24

25 of 52

Use Case 2: Random Simulations

25

26 of 52

Use Case 2: Random Simulations

Software: R
Executable: R script
Arguments: Numeric value
Input files: none
Output files: none
Output text: calculated value

26

Mei Monte Carlo

Needs to run many random simulations to estimate a statistical value.

27 of 52

27

Queue Multiple Jobs

Syntax	List of Values	Variable Name
queue N	Integers: 0 through N-1	$(ProcID)
queue Var matching pattern*	List of values that match the wildcard pattern.	$(Var) If no variable name is provided, default is $(Item)
queue Var in (item1 item2 …)	List of values within parentheses.
queue Var from list.txt	List of values from list.txt where each value is on its own line.

27

28 of 52

Use Case 2: Random Simulations

Software: R
Executable: R script
Arguments: Numeric value
Input files: none
Output files: none
Output text: calculated value

28

universe = container

container_image = R.sif

executable = mcpi.R

arguments = $(Process)

#transfer_input_files =

#transfer_output_remaps =

output = $(Process).out

error = test.err

queue 40

29 of 52

In Jupyter

Back in the terminal tab, run:

$ cd ~

$ tutorial ScalingUp-R

Then click on the downloaded folder and open the “README.ipynb” file.

29

30 of 52

Apply to Your Workflow

Statistical estimations
Monte Carlo approaches
…anything where you have the same code that you want to run with random (or minimally specified) input arguments.

30

31 of 52

Use Case 3: Different Parameters

31

32 of 52

Use Case 3: Different Parameters

Software: Python + Pytorch
Executable: Python script
Arguments: Neural net options
Input files: MNIST data set
Output files: Summary file

Needs GPUs

Scale Out Your Research on the OSPool - C. Koch

32

3/27/23

Tamara Trials

Needs to test the performance of a machine learning model with many (hyper)parameters.

33 of 52

Use Case 3: Different Parameters, v1

Software: Python + Pytorch
Executable: Python script
Arguments: Neural net options
Input files: MNIST data set
Output files: Summary file

(Optional) Wants GPUs

33

universe = container

container_image = pytorch-gpu.sif

executable = main.py

arguments = $(nnopt)

transfer_input_files = mnist.tar.gz,

#transfer_output_remaps =

request_gpus = 1

queue nnopt from list.txt

--batch-size 4 --epochs 5 --seed 5

34 of 52

Use Case 3: Different Parameters, v2

Software: Python + Pytorch
Executable: Python script
Arguments: Neural net options
Input files: MNIST data set
Output files: Summary file

(Optional) Wants GPUs

34

universe = container

container_image = pytorch-gpu.sif

executable = main.py

arguments = --batch $(b) --epochs $(e)

transfer_input_files = mnist.tar.gz,

#transfer_output_remaps =

request_gpus = 1

queue b,e from list.txt

4, 5

4, 6

35 of 52

Apply to Your Workflow

Parameter search/exploration
Sensitivity analysis
Generating a “heat map” of results

35

36 of 52

Building a Workload

36

37 of 52

Patterns for Scaling Out

“What is a job?”

Define your unit of work and how many you need to run
Identify components (shared and unique/varied) of a single job

Generate Inputs

Do you need to generate unique input files?
How about a list of inputs for your jobs?

Plan to summarize

What steps, if any, are needed to combine results?

37

38 of 52

Patterns for Scaling Out

Write modular code

Write one executable that 1) takes in unique inputs and 2) produces unique outputs.

Think about organization

How do you want to arrange the components for your jobs?

Test, test, test

Always test one job, then a small batch before doing a large run.
How much space is needed for job components?

38

39 of 52

Special Considerations

Consequences of distributed, heterogeneous resources

No shared file system
Practical limitations to data size (up to ~10GB / job on the OSDF)
Varied operating system, base software installation
Opportunistic resources: jobs can be interrupted (Scheduler takes care of interrupted jobs)

If you are interested in guaranteed resources, talk to us about the new PATh Facility!! https://path-cc.io/facility/ or https://portal.path-cc.io/

Available to US-based academic, non-profit, or government researchers and their collaborators

39

40 of 52

Additional Considerations

Data movement

For input/output files between 1 – 20GB, need a scalable data staging tool
Open Science Data Federation

Network of data origins and caches to efficiently move data

Most OSPool Access Points have an associated data origin.

40

41 of 52

Additional Considerations

Software environment

Have to bring along a software environment
Containers – we provide a few, have directions how to build yourself & virtual office hours for consultation
File-based – bring along binary files or zipped software directories

(Conda environments can be used this way)

41

42 of 52

Additional Considerations (Resource)

42

	Ideal Jobs! (up to 10,000 cores across Jobs, per user!)	Still Very Advantageous!	Less-so, but maybe
Cores (GPUs)	1 (1; non-specific type)	<8 (1; specific GPU type)	>8 (or MPI) (multiple)
Walltime	<10 hrs* *or checkpointable	<20 hrs* *or checkpointable	>20 hrs
RAM	<few GB	<10s GB	>10s GB
Input	<500 MB	<10 GB	>10 GB
Output	<1 GB	<10 GB	>10 GB
Software	‘portable’ (pre-compiled binaries, transferable, containerizable, etc.)	most other than →	Licensed software; non-Linux

43 of 52

Additional Considerations

Multi-Step workflows

DAGMan – comes with HTCondor
Pegasus - https://pegasus.isi.edu/

43

44 of 52

Features to support your work

Access Point provides a home for scaling out computing

Access data from multiple sources
Utilize compute capacity from multiple providers
Use multiple interfaces (command line, Jupyter)

After initial set up, HTCondor is designed for easy multiplication of tasks
OSPool has extensive CPU / GPU capacity
OSPool Access Points are accessible to US-based researchers and their collaborators

44

45 of 52

Next Steps

45

46 of 52

Get a Full Account

Can continue to test-drive simple jobs using Jupyter and the guest access we used today.
Request a full account by following the instructions on this page:

Access Point (has Jupyter front end)

Gives you access to a full Access Point, persistent (but not backed up!) home directory, all the compute capacity of the OSPool, a data origin and space for larger data files.

46

47 of 52

Stay Connected With Facilitation

Our team is here to help!!

Email: support@osg-htc.org
Office Hours:

Tues/Thurs, contact us for zoom link (Help Page)

Training Opportunities:

1^st/3^rd Tuesday of the month (Training Page)

Guides: https://portal.osg-htc.org/documentation/

47

Showmic Islam

Andrew Owen

Christina

Koch

Rachel Lombardi

Mats Rynge

48 of 52

Get Connected With Community

Conference: Throughput Computing 2023

July 10-14, in Madison
OSG organizational meeting + HTCondor Software Suite user conference
Registration and event link: https://agenda.hep.wisc.edu/event/2014/

48

49 of 52

Questions?

49

50 of 52

Strategize

What are your most pressing computational/data needs?
Assuming we have the computing/data capacity you need, what next steps would get your work on the OSPool?
What process would get you there?

Hackathon with teammates
Come to OSPool office hours or attend a training
Set a reminder to come back to it

50

51 of 52

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 2030508. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

51

52 of 52

Jupyter Access Point

52

Access Point

Execute Point

/home/user

/condor/scratch

HTCondor