1 of 52

Open an internet browser and enter: https://notebook.ospool.osg-htc.org

1

2 of 52

Scaling Out Your Research on the Open Science Pool

Showmic Islam & Rachel Lombardi

OSG Research Facilitation

May 31, 2023

2

3 of 52

Agenda

  • Get connected to personal Access Point/ resource pool (via Jupyter interface)
  • Introduction to High Throughput Computing and the Open Science Pool
  • Three use cases
    • Analyzing multiple files (hands-on)
    • Running multiple random simulations (hands-on)
    • Running multiple machine learning models
  • Planning your own workload
  • Next steps on the OSPool

3

4 of 52

Open an internet browser and enter: https://notebook.ospool.osg-htc.org

4

5 of 52

Log into an OSPool Access Point

Login using any of the available authentication options. Some choices:

  • Institution account
  • Google (i.e. gmail)
  • GitHub
  • ORCID

5

6 of 52

Launch Data Sciences Notebook

1. Click the “Basic” box

2. Click orange “Start” button

6

7 of 52

Log into an OSPool Access Point

Open a Terminal

7

8 of 52

What is the OSG?

8

The OSG Consortium builds and operates a set of pools of shared computing and data capacity for distributed high-throughput computing (dHTC).

https://display.opensciencegrid.org

9 of 52

OSG Services

Open Science Pool (OSPool):National resource of computing capacity for high throughput workloads

OSG-operated Access Points:OSG-operated service to submit jobs to the OSPool**

Open Science Data Federation (OSDF):Network of data origins (file servers) and caches for data accessibility

9

Free!

no allocation

1000s of cores

Manage 1000s of jobs

GPUs

10 of 52

High Throughput Computing (HTC)

10

One of our favorite HTC examples: baking the world’s largest/longest cake

In computational terms: solving a big problem (the world’s longest cake) by executing many small, self-contained tasks (individual cakes) and joining them.

11 of 52

Submitting Jobs to the OSPool

11

12 of 52

Use Cases

12

13 of 52

Three Researchers

13

Mei Monte Carlo

Needs to run many random simulations to estimate a statistical value.

Ben Bioinformatics

Needs to process 100s of genomic data files.

Tamara Trials

Needs to test the performance of a machine learning model with many (hyper)parameters.

14 of 52

Job Component Vocabulary

14

Executable

Software Environment

Input Files

Output Files

Arguments

(text input)

Standard error and output

(text output)

Job

15 of 52

Use Case 1: Analyzing Multiple Files

15

16 of 52

Use Case 1: Analyzing Multiple Files

  • Software: bwa aligner
  • Executable: Shell script with bwa commands
  • Arguments: None (for now…)
  • Input files:
    • Many pairs of fastq files
    • Reference file
  • Output files: aligned .sam files

16

Ben Bioinformatics

Needs to process 100s of genomic data files.

17 of 52

Use Case 1: Analyzing Multiple Files

  • Software: bwa aligner
  • Executable: Shell script with bwa commands
  • Arguments: None (for now…)
  • Input files:
    • Many pairs of fastq files
    • Reference file
  • Output files: aligned .sam files

17

universe = container

container_image = bwa.sif

executable = bwa.sh

#arguments =

transfer_input_files = R1.fastq, R2.fastq, ref.fastq, bwa.sif

#transfer_output_remaps =

error = test.err

output = test.out

queue 1

18 of 52

Jupyter Guest Accounts

18

19 of 52

In Jupyter�In an opened terminal, run:

$ tutorial bwa

Then click on the downloaded folder (tutorial-bwa) and open the “README.ipynb” file.

19

20 of 52

Job Component Vocabulary - Expanded

20

Executable

Software Environment

Unique Input Files

Output Files

Arguments

(text input)

Standard error and output

(text output)

Job

Shared Input Files

What Varies

What Varies

x N

unique inputs

= workload

21 of 52

Use Case 1: Analyzing Multiple Files

21

executable = bwa.sh

arguments = $(sample)

transfer_input_files = $(sample).R1.fastq, $(sample).R2.fastq, ref.fastq, bwa.sif

transfer_output_remaps = “$(sample).sam=results/$(sample).sam”

error = test.$(sample).err

output = test .$(sample).out

queue sample from list.txt

executable = bwa.sh

#arguments =

transfer_input_files = SRR1.R1.fastq, SRR1.R2.fastq, ref.fastq, bwa.sif

transfer_output_remaps = “SRR1.sam=results/SRR1.sam”

error = test.err

output = test.out

queue 1

22 of 52

22

Queue Multiple Jobs

22

Syntax

List of Values

Variable Name

queue N

Integers: 0 through N-1

$(ProcID)

queue Var matching pattern*

List of values that match the wildcard pattern.

$(Var)

If no variable name is provided, default is $(Item)

queue Var in (item1 item2 …)

List of values within parentheses.

queue Var from list.txt

List of values from list.txt where each value is on its own line.

23 of 52

In Jupyter

Continue working with the bwa tutorial.

23

24 of 52

Apply to Your Workflow

  • Processing MRI or other imaging data
  • Molecule/protein docking
  • Simulations that are described by an input file
  • Feature extraction
  • …anything that has many unique input files, each representing a self-contained job producing unique output.

24

25 of 52

Use Case 2: Random Simulations

25

26 of 52

Use Case 2: Random Simulations

  • Software: R
  • Executable: R script
  • Arguments: Numeric value
  • Input files: none
  • Output files: none
  • Output text: calculated value

26

Mei Monte Carlo

Needs to run many random simulations to estimate a statistical value.

27 of 52

27

Queue Multiple Jobs

Syntax

List of Values

Variable Name

queue N

Integers: 0 through N-1

$(ProcID)

queue Var matching pattern*

List of values that match the wildcard pattern.

$(Var)

If no variable name is provided, default is $(Item)

queue Var in (item1 item2 …)

List of values within parentheses.

queue Var from list.txt

List of values from list.txt where each value is on its own line.

27

28 of 52

Use Case 2: Random Simulations

  • Software: R
  • Executable: R script
  • Arguments: Numeric value
  • Input files: none
  • Output files: none
  • Output text: calculated value

28

universe = container

container_image = R.sif

executable = mcpi.R

arguments = $(Process)

#transfer_input_files =

#transfer_output_remaps =

output = $(Process).out

error = test.err

queue 40

29 of 52

In Jupyter

Back in the terminal tab, run:

$ cd ~

$ tutorial ScalingUp-R

Then click on the downloaded folder and open the “README.ipynb” file.

29

30 of 52

Apply to Your Workflow

  • Statistical estimations
  • Monte Carlo approaches
  • …anything where you have the same code that you want to run with random (or minimally specified) input arguments.

30

31 of 52

Use Case 3: Different Parameters

31

32 of 52

Use Case 3: Different Parameters

  • Software: Python + Pytorch
  • Executable: Python script
  • Arguments: Neural net options
  • Input files: MNIST data set
  • Output files: Summary file

  • Needs GPUs

Scale Out Your Research on the OSPool - C. Koch

32

3/27/23

Tamara Trials

Needs to test the performance of a machine learning model with many (hyper)parameters.

33 of 52

Use Case 3: Different Parameters, v1

  • Software: Python + Pytorch
  • Executable: Python script
  • Arguments: Neural net options
  • Input files: MNIST data set
  • Output files: Summary file

  • (Optional) Wants GPUs

33

universe = container

container_image = pytorch-gpu.sif

executable = main.py

arguments = $(nnopt)

transfer_input_files = mnist.tar.gz,

#transfer_output_remaps =

request_gpus = 1

queue nnopt from list.txt

--batch-size 4 --epochs 5 --seed 5

34 of 52

Use Case 3: Different Parameters, v2

  • Software: Python + Pytorch
  • Executable: Python script
  • Arguments: Neural net options
  • Input files: MNIST data set
  • Output files: Summary file

  • (Optional) Wants GPUs

34

universe = container

container_image = pytorch-gpu.sif

executable = main.py

arguments = --batch $(b) --epochs $(e)

transfer_input_files = mnist.tar.gz,

#transfer_output_remaps =

request_gpus = 1

queue b,e from list.txt

4, 5

4, 6

35 of 52

Apply to Your Workflow

  • Parameter search/exploration
  • Sensitivity analysis
  • Generating a “heat map” of results

35

36 of 52

Building a Workload

36

37 of 52

Patterns for Scaling Out

  • “What is a job?”
    • Define your unit of work and how many you need to run
    • Identify components (shared and unique/varied) of a single job
  • Generate Inputs
    • Do you need to generate unique input files?
    • How about a list of inputs for your jobs?
  • Plan to summarize
    • What steps, if any, are needed to combine results?

37

38 of 52

Patterns for Scaling Out

  • Write modular code
    • Write one executable that 1) takes in unique inputs and 2) produces unique outputs.
  • Think about organization
    • How do you want to arrange the components for your jobs?
  • Test, test, test
    • Always test one job, then a small batch before doing a large run.
    • How much space is needed for job components?

38

39 of 52

Special Considerations

  • Consequences of distributed, heterogeneous resources
    • No shared file system
    • Practical limitations to data size (up to ~10GB / job on the OSDF)
    • Varied operating system, base software installation
    • Opportunistic resources: jobs can be interrupted (Scheduler takes care of interrupted jobs)
      • If you are interested in guaranteed resources, talk to us about the new PATh Facility!! https://path-cc.io/facility/ or https://portal.path-cc.io/
  • Available to US-based academic, non-profit, or government researchers and their collaborators

39

40 of 52

Additional Considerations

  • Data movement
    • For input/output files between 1 – 20GB, need a scalable data staging tool
    • Open Science Data Federation
      • Network of data origins and caches to efficiently move data
    • Most OSPool Access Points have an associated data origin.

40

41 of 52

Additional Considerations

  • Software environment
    • Have to bring along a software environment
    • Containers – we provide a few, have directions how to build yourself & virtual office hours for consultation
    • File-based – bring along binary files or zipped software directories
      • (Conda environments can be used this way)

41

42 of 52

Additional Considerations (Resource)

42

Ideal Jobs!

(up to 10,000 cores across

Jobs, per user!)

Still Very

Advantageous!

Less-so, but maybe

Cores

(GPUs)

1

(1; non-specific type)

<8

(1; specific GPU type)

>8 (or MPI)

(multiple)

Walltime

<10 hrs*

*or checkpointable

<20 hrs*

*or checkpointable

>20 hrs

RAM

<few GB

<10s GB

>10s GB

Input

<500 MB

<10 GB

>10 GB

Output

<1 GB

<10 GB

>10 GB

Software

‘portable’ (pre-compiled binaries, transferable, containerizable, etc.)

most other than

Licensed software; non-Linux

43 of 52

Additional Considerations

  • Multi-Step workflows
    • DAGMan – comes with HTCondor
    • Pegasus - https://pegasus.isi.edu/

43

44 of 52

Features to support your work

  • Access Point provides a home for scaling out computing
    • Access data from multiple sources
    • Utilize compute capacity from multiple providers
    • Use multiple interfaces (command line, Jupyter)
  • After initial set up, HTCondor is designed for easy multiplication of tasks
  • OSPool has extensive CPU / GPU capacity
  • OSPool Access Points are accessible to US-based researchers and their collaborators

44

45 of 52

Next Steps

45

46 of 52

Get a Full Account

  • Can continue to test-drive simple jobs using Jupyter and the guest access we used today.
  • Request a full account by following the instructions on this page:
  • Gives you access to a full Access Point, persistent (but not backed up!) home directory, all the compute capacity of the OSPool, a data origin and space for larger data files.

46

47 of 52

Stay Connected With Facilitation

Our team is here to help!!

  • Email: support@osg-htc.org
  • Office Hours:
    • Tues/Thurs, contact us for zoom link (Help Page)
  • Training Opportunities:
    • 1st/3rd Tuesday of the month (Training Page)
  • Guides: https://portal.osg-htc.org/documentation/

47

Showmic Islam

Andrew Owen

Christina

Koch

Rachel Lombardi

Mats Rynge

48 of 52

Get Connected With Community

  • Conference: Throughput Computing 2023
    • July 10-14, in Madison
    • OSG organizational meeting + HTCondor Software Suite user conference
    • Registration and event link: https://agenda.hep.wisc.edu/event/2014/

48

49 of 52

Questions?

49

50 of 52

Strategize

  • What are your most pressing computational/data needs?
  • Assuming we have the computing/data capacity you need, what next steps would get your work on the OSPool?
  • What process would get you there?
    • Hackathon with teammates
    • Come to OSPool office hours or attend a training
    • Set a reminder to come back to it

50

51 of 52

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 2030508. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

51

52 of 52

Jupyter Access Point

52

Access Point

Execute Point

/home/user

/condor/scratch

HTCondor