Bioinformatics Analyses on the
OSPool: A BWA Example
OSG Research Facilitation Team
1
Before We Start
We welcome questions! To ask questions, please raise your hand.
Part of this workshop is hands-on! You are welcome to follow along (we will walk through the steps together) or to simply watch.
2
Introductions
3
Research Computing Facilitators are Here to Help!
Showmic Islam
Rachel Lombardi
Mats
Rynge
Christina Koch
4
Agenda
5
Primary Learning Objective
To understand the principles of running a bioinformatics workflow on the OSPool.
Learning Outcomes:
6
Introduction to the OSG and Open Science Pool
7
What is the OSG?
The OSG Consortium builds and operates a set of pools of shared computing and data capacity for distributed high-throughput computing (HTC).
https://display.opensciencegrid.org
8
Open Science Pool
One of these pools, the Open Science Pool (OSPool) is operated for all US-associated open science.
OSPool
Computer by miracle from NounProject.com
9
Open Science Pool
One of these pools, the Open Science Pool (OSPool) is operated for all US-associated open science.
OSPool
Computer by miracle from NounProject.com
OSPool Access Point
10
Using the OSPool
11
HTCondor Job Flow
OSPool Access Point
Job Components
HTCondor Submit File
/home/user
12
HTCondor Job Flow
OSPool Access Point
Job Components
HTCondor Submit File
/home/user
$ condor_submit SubmitFile.submit
13
HTCondor Job Flow
OSPool Access Point
Job Components
HTCondor Submit File
OSPool Execute Point
/home/user
Job specifications
HTCondor
14
HTCondor Job Flow
OSPool Access Point
Job Components
HTCondor Submit File
Software
Scripts
Input Data
OSPool Execute Point
/home/user
/condor/scratch
Job specifications
HTCondor
15
HTCondor Job Flow
OSPool Access Point
Job Components
HTCondor Submit File
Software
Scripts
Input Data
Output Data
Log/Error/Out
OSPool Execute Point
/home/user
/condor/scratch
Output transferred
back
HTCondor
16
HTC on the Open Science Pool
The OSPool is a good fit for HTC workloads that can be distributed and open:
17
What workloads are good for the OSPool*?
* the “less-so, but maybe” column could still be an HTC workload, but one that would run more effectively on a local, dedicated HTC system instead of the OSG
| Ideal Jobs! (1,000s of concurrent jobs) | Still Very Advantageous! (100s concurrent jobs) | Less-so, but maybe |
Cores (GPUs) | 1 (1; non-specific type) | <8 (1; specific GPU type) | >8 (or MPI) (multiple) |
Walltime | <10 hrs* *or checkpointable | <20 hrs* *or checkpointable | >20 hrs |
RAM | <few GB | <10s GB | >10s GB |
Input | <500 MB | <10 GB | >10 GB |
Output | <1 GB | <10 GB | >10 GB |
Software | ‘portable’ (pre-compiled binaries, transferable, containerizable, etc.) | most other than → | Licensed software; non-Linux |
18
HTC-Friendly Research Problems*
RNA/DNA sequence alignment | statistical model optimization | parameter sweep | multiple image/sample analysis |
*not exhaustive!
DNA by Arafat Uddin from the Noun Project
Image by Shastry from the Noun Project
grid by Nawicon Studio from the Noun Project
Line Graph by Gonzalo Bravo from the Noun Project
19
�High-Throughput BWA Read Mapping
https://datacarpentry.org/
20
Background
21
Details about how the data was modified can be found at https://datacarpentry.org/
Sample BWA Workflow
Quality Control
Align sequenced reads to a reference
Alignment cleanup
Variant Calling
Variant Annotation and Interpretation
Example Next Generation Sequencing Analysis Workflow
BWA (Burrows-Wheeler Aligner)
A software package that maps sequences to a reference file
23
Sample BWA Workflow���
Output
24
PE 1 (Forward Read)
SRR263_1.fastq
PE 2 (Reverse Read)
SRR263_2.fastq
Reference File
ecoli_rel606.fasta.gz
bwa executable
+
+
Input
Sample BWA Workflow���
Output
25
PE 1 (Forward Read)
SRR263_1.fastq
PE 2 (Reverse Read)
SRR263_2.fastq
Reference File
ecoli_rel606.fasta.gz
+
+
Input
Executable�bwa-analysis.sh
bwa executable
Sample BWA Workflow���
Output
26
PE 1 (Forward Read)
SRR263_1.fastq
PE 2 (Reverse Read)
SRR263_2.fastq
Reference File
ecoli_rel606.fasta.gz
“Sequences Aligned Map” Output File
SRR263.aligned.sam
+
+
Input
Executable�bwa-analysis.sh
bwa executable
Let’s Get Started!
1. Download the BWA tutorial materials
$ cd
$ pwd
$ tutorial bwa-materials
2. Navigate to tutorial-bwa-materials folder
$ cd tutorial-bwa-materials
3. Explore our work environment
$ ls
27
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
Our Workspace
Installing Software
Software/packages/programs only need to be installed once!
Many bioinformatics tools are available as ready-to-use Singularity or Docker containers or with Anaconda/conda environments!
28
Preparing Software to use in Jobs
29
Installing BWA
To learn how to install BWA, let’s go to one of BWA’s manual pages:
From https://github.com/lh3/bwa
30
Steps to install BWA
Image taken from: https://github.com/lh3/bwa
To learn how to install BWA, let’s go to one of BWA’s manual pages:
From https://github.com/lh3/bwa
31
Installing BWA
Overview of Installing BWA
BWA Instillation:
$ cd ~/tutorial-bwa-materials/software
$ git clone https://github.com/lh3/bwa.git
$ cd bwa
$ make
$ export PATH=$PATH:$PWD
�
Choose a location to install bwa
32
BWA Instillation:
$ cd ~/tutorial-bwa-materials/software
$ git clone https://github.com/lh3/bwa.git
$ cd bwa
$ make
$ export PATH=$PATH:$PWD
�
Install BWA
Steps taken from BWA manual
Choose a location to install bwa
33
Overview of Installing BWA
BWA Instillation:
$ cd ~/tutorial-bwa-materials/software
$ git clone https://github.com/lh3/bwa.git
$ cd bwa
$ make
$ export PATH=$PATH:$PWD
�
Install BWA
Steps taken from BWA manual
Tell the system where to find our software
Choose a location to install bwa
34
Overview of Installing BWA
Preparing Software to be Sent in a Job
Once we test our BWA installation, we want to create a compressed tarball of this software so that it is smaller and quicker to transfer to jobs to the OSPool.
$ tar -czvf bwa.tar.gz bwa
Image: https://www.portasouthjetty.com/
articles/workers-clean-up-tar-balls-on-beach/
Tarball (.tar.gz)
35
To do this, navigate to the directory with the bwa executable:
$ cd ~/tutorial-bwa-materials/software/bwa
$ tar -czvf bwa.tar.gz bwa
Analyze a Single Biological Sample with BWA�(Submit a single HTCondor job)
36
37
#!/bin/bash
echo "Unpacking software"
tar -xzf bwa.tar.gz
echo "Setting PATH for bwa"
export PATH=$_CONDOR_SCRATCH_DIR:$PATH
Executable = bwa-analysis.sh
To analyze one sample (SRR263)
38
#!/bin/bash
echo "Unpacking software"
tar -xzf bwa.tar.gz
echo "Setting PATH for bwa"
export PATH=$_CONDOR_SCRATCH_DIR:$PATH
echo "Indexing E. coli genome"
bwa index ecoli_rel606.fasta.gz
echo "Starting bwa alignment"
bwa mem ecoli_rel606.fasta.gz SRR263_1.fastq SRR263_2.fastq > SRR263.aligned.sam
Executable = bwa-analysis.sh
To analyze one sample (SRR263)
39
#!/bin/bash
echo "Unpacking software"
tar -xzf bwa.tar.gz
echo "Setting PATH for bwa"
export PATH=$_CONDOR_SCRATCH_DIR:$PATH
echo "Indexing E. coli genome"
bwa index ecoli_rel606.fasta.gz
echo "Starting bwa alignment"
bwa mem ecoli_rel606.fasta.gz SRR263_1.fastq SRR263_2.fastq > SRR263.aligned.sam
echo "Cleaning up files generated from genome indexing"
rm ecoli_rel606.fasta.gz.amb
rm ecoli_rel606.fasta.gz.ann
rm ecoli_rel606.fasta.gz.bwt
rm ecoli_rel606.fasta.gz.pac
rm ecoli_rel606.fasta.gz.sa
Executable = bwa-analysis.sh
To analyze one sample (SRR263)
Prepare Submit File
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
bwa/
bwa.tar.gz
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
40
Executable = bwa-analysis.sh
Prepare Submit File
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
bwa/
bwa.tar.gz
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
transfer_input_files = software/bwa/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR263_1.fastq, data/fastq/SRR263_2.fastq
Reminder:
Need to transfer bwa.tar.gz file,
the reference genome, and the .fastq files
41
Prepare Submit File
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
bwa/
bwa.tar.gz
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
transfer_input_files = software/bwa/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR263_1.fastq, data/fastq/SRR263_2.fastq
log = log/bwa_test.log
output = log/bwa_test.out
error = log/bwa_test.error
42
Prepare Submit File
Queue One Job
Queue one job to analyze one sample with BWA
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
transfer_input_files = software/bwa/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR263_1.fastq, data/fastq/SRR263_2.fastq
log = log/bwa_test.log
output = log/bwa_test.out
error = log/bwa_test.error
queue 1
43
Queue One Job
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
transfer_input_files = software/bwa/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR263_1.fastq, data/fastq/SRR263_2.fastq
log = log/bwa_test.log
output = log/bwa_test.out
error = log/bwa_test.error
queue 1
44
We are ready to submit, but before we do, let’s think about our BWA output files!
Scaling Up & Keeping Organized
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
bwa_test.log
…
software/
bwa/
bwa.tar.gz
SRR263.aligned.sam
Current Workspace
45
Scaling Up & Keeping Organized
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
bwa_test.log
…
software/
bwa/
bwa.tar.gz
SRR263.aligned.sam
Current Workspace
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
...
ref_genome/
ecoli_rel606.fasta.gz
log/
...
software/
bwa/
bwa.tar.gz
results/
SRR263.aligned.sam
...
Desired Workspace
46
Use HTCondor’s Submit File to Organizing Files
Syntax | Purpose | Features |
Transfer_output_remaps = “file1.out=path/to/file1.out; file2.out=path/to/renamedFile2.out” | Used to save output files in a specific path and using a certain name | - Used to save output files to a specific folder - Used to rename output files to avoid writing over existing files |
Must create the path to the folder that you want output files saved to before submitting the job.
47
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
bwa/
bwa.tar.gz
results/
SRR263.aligned.sam
…
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
transfer_input_files = software/bwa/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR263_1.fastq, data/fastq/SRR263_2.fastq
transfer_output_remaps = “SRR263.aligned.sam =
results/SRR263.aligned.sam”
log = log/bwa_test.log
output = log/bwa_test.out
error = log/bwa_test.error
queue 1
48
Use transfer_output_remaps
Create results/ directory before submitting job
Let’s Analyze One Biological Sample!
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
transfer_input_files = software/bwa/bwa.tar.gz,
data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR263_1.fastq,
data/fastq/SRR263_2.fastq
transfer_output_remaps = “SRR263.aligned.sam = results/SRR263.aligned.sam”
log = log/bwa_test.log
output = log/bwa_test.out
error = log/bwa_test.err
queue 1
Prepare submit file for analyzing one sample with BWA.
49
Analyze a Many Biological Samples using a Single Submit File
50
Queue Multiple Jobs
Syntax | List of Values | Variable Name |
queue N | Integers: 0 through N-1 | $(ProcID) |
queue Var matching pattern* | List of values that match the wildcard pattern. | $(Var) If no variable name is provided, default is $(Item) |
queue Var in (item1 item2 …) | List of values within parentheses. | |
queue Var from list.txt | List of values from list.txt where each value is on its own line. |
51
First, Create the List the Inputs
SRR263
SRR266
SRR244
…
Make a file called samples.txt containing the names of the texts we want to analyze:
$ pwd
../tutorial-bwa-materials/data/fastq/
$ ls *.fastq | cut -f 1 -d '_' | uniq > samples.txt
52
Submit File to Queue One Job
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
bwa/
bwa.tar.gz
results/
SRR263.aligned.sam
…
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR263_1.fastq, data/fastq/SRR263_2.fastq
transfer_output_remaps = “SRR263.aligned.sam =
results/SRR263.aligned.sam”
log = log/bwa_test.log
output = log/bwa_test.out
error = log/bwa_test.error
queue 1
53
Edit the Queue Statement to use Variables
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
bwa/
bwa.tar.gz
results/
SRR263.aligned.sam
…
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
# arguments =
transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/SRR263_1.fastq, data/fastq/SRR263_2.fastq
transfer_output_remaps = “SRR263.aligned.sam =
results/SRR263.aligned.sam”
log = log/bwa_test.log
output = log/bwa_test.out
error = log/bwa_test.error
queue sample from data/fastq/samples.txt
54
Replace Changing Values with Variables
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
bwa/
bwa.tar.gz
results/
SRR263.aligned.sam
…
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
arguments = $(sample)
transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/$(sample)_1.fastq, data/fastq/$(sample)_2.fastq
transfer_output_remaps = “$(sample).aligned.sam =
results/$(sample).aligned.sam”
log = log/bwa_test.log
output = log/bwa_test.out
error = log/bwa_test.error
queue sample from data/fastq/samples.txt
55
Use Variables with log/error/out Files
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq
SRR263_2.fastq
…
ref_genome/
ecoli_rel606.fasta.gz
log/
software/
bwa/
bwa.tar.gz
results/
SRR263.aligned.sam
…
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
arguments = $(sample)
transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/$(sample)_1.fastq, data/fastq/$(sample)_2.fastq
transfer_output_remaps = “$(sample).aligned.sam =
results/$(sample).aligned.sam”
log = log/bwa_$(sample).log
output = log/bwa_$(sample).out
error = log/bwa_$(sample).error
queue sample from data/fastq/samples.txt
56
Edit the Executable to use Variables
We need to edit the executable to use variables so that we can pass different sample names to it as arguments.
57
Currently, running ./bwa-analysis.sh analyses just one sample (SRR263).
We will edit our executable so that we can run:
./bwa-analysis.sh SRR263
./bwa-analysis.sh SRR266
./bwa-analysis.sh SRR244
Sample Name/ID
58
#!/bin/bash
echo "Unpacking software"
tar -xzf bwa.tar.gz
echo "Setting PATH for bwa"
export PATH=$_CONDOR_SCRATCH_DIR:$PATH
echo "Indexing E. coli genome"
bwa index ecoli_rel606.fasta.gz
echo "Starting bwa alignment"
bwa mem ecoli_rel606.fasta.gz SRR263_1.fastq SRR263_2.fastq > SRR263.aligned.sam
echo "Cleaning up files generated from genome indexing"
rm ecoli_rel606.fasta.gz.amb
rm ecoli_rel606.fasta.gz.ann
rm ecoli_rel606.fasta.gz.bwt
rm ecoli_rel606.fasta.gz.pac
rm ecoli_rel606.fasta.gz.sa
Executable = bwa-analysis.sh
59
#!/bin/bash
echo "Unpacking software"
tar -xzf bwa.tar.gz
echo "Setting PATH for bwa"
export PATH=$_CONDOR_SCRATCH_DIR:$PATH
echo "Indexing E. coli genome"
bwa index ecoli_rel606.fasta.gz
echo "Define variable"
sample=$1
echo "Starting bwa alignment"
bwa mem ecoli_rel606.fasta.gz ${sample}_1.fastq ${sample}_2.fastq > ${sample}.aligned.sam
echo "Cleaning up files generated from genome indexing"
rm ecoli_rel606.fasta.gz.amb
rm ecoli_rel606.fasta.gz.ann
rm ecoli_rel606.fasta.gz.bwt
rm ecoli_rel606.fasta.gz.pac
rm ecoli_rel606.fasta.gz.sa
Executable = bwa-analysis.sh
60
#!/bin/bash
echo "Unpacking software"
tar -xzf bwa.tar.gz
echo "Setting PATH for bwa"
export PATH=$_CONDOR_SCRATCH_DIR:$PATH
echo "Indexing E. coli genome"
bwa index ecoli_rel606.fasta.gz
echo "Define variable"
sample=$1
echo "Starting bwa alignment"
bwa mem ecoli_rel606.fasta.gz ${sample}_1.fastq ${sample}_2.fastq > ${sample}.aligned.sam
echo "Cleaning up files generated from genome indexing"
rm ecoli_rel606.fasta.gz.amb
rm ecoli_rel606.fasta.gz.ann
rm ecoli_rel606.fasta.gz.bwt
rm ecoli_rel606.fasta.gz.pac
rm ecoli_rel606.fasta.gz.sa
Executable = bwa-analysis.sh
Let’s make these changes now
Prepare Submit File to Analyze Many Sequencing Files
Prepare submit file for a full workload submission to analyze many .fastq files
# submit file name: bwa-analysis.submit
executable = bwa-analysis.sh
arguments = $(sample)
transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/fastq/$(sample)_1.fastq, data/fastq/$(sample)_2.fastq
transfer_output_remaps = “$(sample).aligned.sam = results/$(sample).aligned.sam”
log = log/bwa_$(sample).log
output = log/bwa_$(sample).out
error = log/bwa_$(sample).error
queue sample from data/fastq/samples.txt
61
Our New Project Directory
bwa-analysis.submit
bwa-analysis.sh
data/
fastq/
SRR263_1.fastq SRR266_1.fastq SRR244_1.fastq
SRR263_2.fastq SRR266_2.fastq SRR244_2.fastq
ref_genome/
ecoli_rel606.fasta.gz
software/
bwa/
bwa.tar.gz
results/
SRR263.aligned.sam SRR266.aligned.sam SRR244.aligned.sam
log/
bwa_SRR263.log bwa_SRR266.log bwa_SRR244.log
bwa_SRR263.err bwa_SRR266.err bwa_SRR244.err
bwa_SRR263.out bwa_SRR266.out bwa_SRR244.out
Organized Workflow
62
Key Takeaways
We have learned how to:
✓ Install software to use in jobs
✓ Convert an existing bioinformatics workflow to run on the OSPool
✓ Keep an organized workflow using HTCondor submit file
options
63
OSG Documentation Website
OSG User Documentation: https://portal.osg-htc.org
Information about:
64
OSG Documentation Website
OSG User Documentation: https://portal.osg-htc.org
Information about:
We also have information on getting started with other bioinformatics tools! (BLAST, SAMtools)
65
Acknowledgements
This material is based upon work supported by the National Science Foundation under Cooperative Agreement OAC-2030508 as part of the PATh Project. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
66
Questions?
67
68