1 of 58

Research with Computational Biology (ReComBio)

General Instructions

1

2 of 58

2024 Bioinformatics Bridge Course

2

3 of 58

Research in Computational Biology

  • Introduction

  • Research in Computational Biology (ReComBio) PyaR tutorials were thought as a systematic approach under the mission of the Science Internship Program (SIP) and the Genomics Institute at UCSC, to create equity in STEAM (Science, Technology, Engineering, Arts and Mathematics)

  • Initially, Bioinformatics projects mentored by Gepoliano Chaves were ran at the UCSC main campus in Santa Cruz (CA) as part of the Science Internship Program (SIP) during summers

  • SIP projects included a wet lab component where students were exposed to experiments such as DNA and protein extraction, PCR, RT-PCR, western blot, DNA sequencing and/or nanopipette technology demonstration

3

4 of 58

Research in Computational Biology

  • Due to the COVID-19 pandemic, SIP was first ran remotely in 2020

  • Over time, it became necessary to provide further support to interns willing to use bioinformatics tools in their research projects

  • Support is necessary for software installation (R, RStudio and Bash) in Windows, MacBook and (potentially) other machine systems

4

5 of 58

SIP 2020, 2021 and 2022 Projects

  • Links to description of projects identifying SARS-CoV-2 variants can be found at the BME Department at UCSC section of the SIP website.

  • The link below is a good starting point to start digging in the research of these projects:

  • Variant detection projects, including Huntington’s Disease projects, can be found at the SIP web-page indicated above

5

6 of 58

Computational tasks: Algorithms

  • The next section covers tasks that can be accomplished using the computational concept of an Algorithm

6

7 of 58

The concept of a Computational Algorithm

7

8 of 58

Pseudocode and code steps

  1. Describe in human language what the question is
    1. How does a gene set affect the EFS and OS survival rates?
  2. Start putting in R language, mixed with human language
  3. Write your command or script

8

9 of 58

Clustering and Classification: the concept of Algorithm

9

  • How many paws does the animal have?

(2 or 4)

  • Does the animal have feather or wool

(1 or 2)

10 of 58

SARS-CoV-2 Clustering based on Mutation Frequency (Incidence - Epidemiology)

  • A Bash (Linux) pipeline calculates mutation frequency in VCF files

  • This frequency is used to cluster SARS-CoV-2 sequences from different regions of the globe

10

11 of 58

Algorithm or Pipeline

The algorithm (also called a pipeline) needs to objectively explain how we go about answering our question or solving a problem

    • Align to reference sequence (FASTA)

    • Compare alignment to reference (SAM)

    • Annotate differences (mutations) (VCF)

    • Extract mutations from VCF (Frequency Table)

11

12 of 58

Bioinformatics Software Development

  1. Software development considers the analytical steps in human language
    • What are the exact steps that are necessary for execution of the analysis?

  • Then, the software product considers the steps the machine will execute
  • How files are produced and what are the processing steps?
  • Where in the computational infra-structure are the files stored?
  • How can these files be accessed?
  • What information is contained in the files?

12

13 of 58

Bioinformatics Software Development

  1. The present program is about how a scientific question is answered, not what the final answer is

  1. If how the question is answered is not addressed, opportunity is lost in terms of information that is embedded in the process of data analysis

  1. This is an important notion to have when developing computational tools that answer a scientific question

13

14 of 58

Project Assignments

By the end of next week I would like you to have decided which of these groups you want to join:

  • Group 1: Huntington’s Disease Visualization

  • Group 2: Pipeline development (Variant Call)

  • Group 3: Data-Base Construction (SARS-CoV-2 and Neuroblastoma)

  • Group 4: Dashboard Visualization (SARS-CoV-2 and Neuroblastoma)

    • SARS-CoV-2 Data visualization
    • Neuroblastoma Data Visualization

14

15 of 58

Sub-Group 1: Pipeline development

  • Download data from GISAID

    • Define regions of interest and samples to be processed (bat and pangolin samples)

    • Think critically about the pipeline design, the commands and the command order

    • Work in the Bash environment

15

16 of 58

Sub-Group 2: Data-Base Construction

  • Organize the information of the FASTQ and/or FASTA files downloaded

    • Extract Mutation information from the VCF files

    • Design strategies to handle/process the numbers of files processed by the pipeline of Sub-Group 1

    • Construct a Frequency Table for Classification of SARS-CoV-2 samples by geographic regions

16

17 of 58

BME02 Sub-groups notebooks

  • Every Sub-Group should have a code notebook

  • By the 4th or 5th week I expect to see your code and notebooks with Descriptors of what the code does
    • The Descriptor should express the task assigned to each Sub-group

  • I should help you in documenting your code on Github/Gitlab

17

18 of 58

GISAID Sample Download

  • Proceed to Slide 45

  • Part 3: Variant Call Pipeline using GISAID data

  • Start addressing GISAID sample data download

18

19 of 58

Study Plan for 09/29/2021

19

Introduction to the Command Line for Genomics with AMI

  1. Introducing the Shell
  2. Navigating Files and Directories
  3. Working with Files and Directories
  4. Redirection
  5. Writing Scripts and Working with Data
  6. Project Organization
  7. Finish

Data Wrangling and Processing for Genomics

  1. Background and Metadata
  2. Assessing Read Quality
  3. Trimming and Filtering
  4. Variant Calling Workflow
  5. Automating a Variant Calling Workflow
  6. Finish

20 of 58

Study Plan for 09/30/2021

20

Finish Data Wrangling and Processing for Genomics;

Start the R Markdown Part (Cover Software Installation): R Markdown Notebook 1

  • R Markdown Basics
    • Insert and run chunks (R and Bash)
    • knit to PDF and HTML
    • Write code notes to keep it reproducible
    • Introduction to basic programming in R
      • R basic commands
      • R basic operations
      • Dataframes and file upload
      • Dataset processing
      • Package Installation

R and Bash in R Markdown: The basics (Cover Biological Association - Notebook 1)

  • Package Installation
    • Ggplot
    • T-test
    • Pheatmap
    • Qqman
  • R basic functions
    • dim()
    • head()
    • View()
    • class()
    • str()
  • Manhattan plot of a dataframe
    • Biological Association (Huntington’s Disease paper)

21 of 58

Study Plan for 10/01/2021

21

R and Bash in R Markdown: SARS-CoV-2 Variant Call Pipeline - Notebook 2

  • Reading: Chaves et al., (2019); SNP variants in Huntington’s Disease
  • Bash installation in MacBook and Windows
  • Bash basic commands
  • Pipeline: Identification of SARS-CoV-2 Variants
    • Softwares used in pipeline:
      • Anaconda installation?
      • BWA
      • Samtools
      • BCFtools
    • Files generated by pipeline:
      • FASTA file
      • FASTQ file
      • SAM file
      • BAM file
      • BCF file
      • VCF file
  • Visualization of SAM file in Genome Browser
  • Visualization of VCF file in Genome Browser

Notebook 3 - Hierarchical Clustering of SARS-CoV-2 Mutations

  • Reading: Chaves et al., (2017) - Heatmap Visualization of Differential Gene Expression Analysis (DGE)
  • Exploratory Data Analysis of SARS-CoV-2 Mutations identified by students
  • Data-frame contains information for plotting heatmap
    • See excel spreadsheet
    • Include region of interest in the spreadsheet
    • Heatmap and Hierarchical Clustering:
      • Construction of VOC Frequency Table
  • https://github.com/gepolianochaves/Espanol/tree/main/Agrupacion_Jerarquica

22 of 58

Study Plan for Week 1

22

Finish Data Wrangling and Processing for Genomics;

Start the R Markdown Part (Cover Software Installation): R Markdown Notebook 1

  • R Markdown Basics
    • Insert and run chunks (R and Bash)
    • knit to PDF and HTML
    • Write code notes to keep it reproducible
    • Introduction to basic programming in R
      • R basic commands
      • R basic operations
      • Dataframes and file upload
      • Dataset processing
      • Package Installation
      • Package Installation
        • Ggplot
        • T-test
        • Pheatmap
        • Qqman
      • R basic functions
        • dim()
        • head()
        • View()
        • class()
        • str()
      • Manhattan plot of a dataframe

23 of 58

Study Plan for Week 2

23

R and Bash in R Markdown: SARS-CoV-2 Variant Call Pipeline - Notebook 2

  • Reading: Chaves et al., (2019); SNP variants in Huntington’s Disease
  • Bash installation in MacBook and Windows
  • Bash basic commands
  • Pipeline: Identification of SARS-CoV-2 Variants
    • Softwares used in pipeline:
      • Anaconda installation?
      • BWA
      • Samtools
      • BCFtools
    • Files generated by pipeline:
      • FASTA file
      • FASTQ file
      • SAM file
      • BAM file
      • BCF file
      • VCF file
  • Visualization of SAM file in Genome Browser
  • Visualization of VCF file in Genome Browser

24 of 58

Study Plan for Week 3

24

  • Introduction to Python and Machine Learning prediction of neuroblastoma

25 of 58

Example of Hierarchical Clustering Visualization

25

Cluster 1: China, Bat, Pangolin

Cluster 2: Germany-Argentina

Cluster 3: Australia

26 of 58

Study Plan for 10/01/2021

26

Notebook 4 - Variant Frequency Visualization

  • Notebook to be uploaded to GitHub

27 of 58

Operating System (OS)

  • I am assuming people use either an Apple MacBook or a Windows machine

  • Both Windows and MacBook support RStudio usage

  • The advantage of a MacBook is the built-in Terminal interface which can be easily accessed and used by the student/researcher

  • If your machine is different from the machines described above, please talk to me so we can find a solution

27

28 of 58

Operating System (OS)

  • Windows users need to install Putty, Git or WSL
    • Putty seems the easiest solution for Windows users
    • I used Putty when I was a Windows user

  • For Putty installation, follow the recommendations in the Amazon Machine Image link (same link as syllabus) in the Data Carpentry Genomics Workshop
    • https://datacarpentry.org/genomics-workshop/AMI-setup/

  • For Ubuntu (WSL) installation, follow the recommendations in the link below.
  • It should guide you to install Bash for the R Markdown notebooks:

https://ubuntu.com/tutorials/ubuntu-on-windows#1-overview

  • More instructions on WSL and Ubuntu installation can be found in the slack channel and in the next slide

28

29 of 58

Sub-Group 3: Dashboard Visualization

  • Construct visualization tools

    • Visualization through a Dashboard or a web application

    • Use Shiny (R) or Python to visualize information provided by Sub-Groups 1 and 2

    • Understand the processing steps taken by Sub-Groups 1 and 2

    • Make the application available in the internet

    • Make the application take user's input

29

30 of 58

Installing Ubuntu (Windows users)

  • Windows users need to install Windows Subsystem for Linux Installation (WSL) to have access to the Linux/Bash Terminal

  • Follow instructions in the following link for Ubuntu installation
    • https://docs.microsoft.com/en-us/windows/wsl/install-win10

  • After installing Ubuntu, you can access the AMI, by using ssh from the Ubuntu Linux Terminal

30

31 of 58

WSL/Ubuntu installation (Windows)

31

32 of 58

Complete the 6 steps in the instructions

32

33 of 58

Download and Install R and RStudio

  • R

The R Project for Statistical Computing

  • RStudio

https://www.rstudio.com/products/rstudio/download/

33

34 of 58

R, RStudio and R Markdown installation (Windows users)

  • Watch the ReComBio video on “Bash in Windows for R Markdown”, recorded on Monday, 07/19/2021

  • The video starts with instructions on how to download R and RStudio

  • 00:15:00 - Access .Rmd file
  • 00:17:00 - Install packages in R
  • 00:21:00 - Bash chunks (need Bash/Ubuntu installed in Windows)
  • 00:50:00 - Dealing with path problems in R in Windows
  • 01:14:00 - 01:28:00- Ubuntu and WSL installation
  • 01:24:00 - Troubleshooting Ubuntu Password
  • 01:26:00 - Connecting Ubuntu to R Markdown (RStudio needs to be re-started)
  • 01:27:00 - Terminal -> Terminal Options -> New Terminals open with -> Bash (WSL)

34

35 of 58

R Markdown installation for Windows users

35

  • The critical point of Ubuntu installation for Windows users is around 01:27:00
    • Seen in Video “Bash in Windows for R Markdown”, recorded on Monday, 07/19/2021

  • As seen in the next slide, once Ubuntu is installed, Bash can be used from Rmd following these steps
    • Terminal -> Terminal Options -> New Terminals open with -> Bash (WSL)

36 of 58

R Markdown installation for Windows users

36

37 of 58

Downloading Git

  • In helping the user to install Bash for R Markdown in Windows, we notice that Bash is not installed in their Windows Machine

  • After installing Git, the user will have the challenge of downloading bwa, samtools, bcftools

  • Downloading these softwares is necessary to run Bash from inside the R Markdown notebooks that

  • Alternatively, the user can run the alignment pipeline from the Command Line

  • 00:35:00 - bash not recognized error, but not addressed

37

38 of 58

Download Git (Windows Users)

  • Google Search: rstudio git bash terminal

  • https://www.bioinformatics.babraham.ac.uk/training/RStudio_GitHub/Initial_setup.html

38

39 of 58

Installing Shiny in a machine running Git

  • The link bellow helped me while installing Shiny in a Windows machine

https://community.rstudio.com/t/install-packages-unable-to-access-index-for-repository-try-disabling-secure-download-method-for-http/16578

  • For this question in R: Update all/some/none? [a/s/n]:
    • Start with “n”, the same as “No”

39

40 of 58

R Markdown installation and Layout

Once properly installed, these are the main advantages of R Markdown:

  • Documentation of the code with comments and notes

  • Publication-quality figures and slides

  • User is free to use multiple languages in parallel
    • Python
    • R
    • Bash

40

41 of 58

R Markdown Layout

41

Insert Chunk

Comments about code

Code Chunk

Console

Plots

42 of 58

R Markdown Cook-Book

  • R Markdown cheat-sheet:

42

43 of 58

Part 1: The Data Carpentry Genomics Workshop

  • Data Carpentry Genomics workshop

  • Following onboarding slides instructions should be sufficient for these two interactions

43

44 of 58

Part 2: Three Iterations of Software Installation

  • There should be plenty of information on installation in the recorded videos and with instructors

  • Software to be installed by 09/03:
    • R and RStudio (1)
    • Ubuntu/Bash/WSL and R Markdown (2)

  • SARS-CoV-2 Pipeline
    • splitfasta
  • Anaconda (3)
  • bwa
  • Samtools
  • Samtools and Bcftools (Alternative)

44

45 of 58

Download Anaconda

Docs instructions:

https://docs.google.com/document/d/1OWulKk9-9cJTExe4Bo5xq-VKgWrdw2Uy54RndjyrzJM/edit

Anaconda installation allows downloas of bwa and split-fasta

https://anaconda.org/bioconda/bwa

https://pypi.org/project/split-fasta/

The following link will download a Shell Script (Bash) that will be run with the Bash command to install Anaconda in the machine. One option is (64-Bit Command Line Installer (584 MB))

https://www.anaconda.com/products/distribution

The Following link (Documents the steps to download Anaconda)

https://docs.anaconda.com/anaconda/install/mac-os/

45

46 of 58

Part 3: Variant Call Pipeline using GISAID data

  • Students need to request access prior to downloading data from GISAID

  • Global Initiative on Sharing Avian Influenza Data

  • Request access to GISAID:
    • https://www.gisaid.org/register/
    • This step may take some time

46

47 of 58

GISAID Download Instructions

47

  • GISAID download Instructions are available in the SIP2022 BME02 Recording of 2022-06-24

  • Time of the recording was 17.03.17
    • there are two recordings for 2022-06-24)
    • this image is from 17.03.17 (5:03:17 PM)

  • Download instructions start at 01:12:55 of the video
    • Make this video available in ReComBio folder, not only SIP2022

48 of 58

GISAID Download Instructions

48

  • Note the Search Tab indicated by the red arrow
    • You should be able to click the Search Tab

  • After logging into GISAID, you need to click the Search tab

  • After clicking Search, follow the sequence: Search -> Location -> Collection Date -> Submission Date to filter the files of the region of your interest
    • Note how this is done in the image in the left

  • The image on the left illustrates filtering GISAID data using the following parameters:
    • Location: Brazil
    • Collection: 2020-07-01 and
    • Submission: 2020-12-31

49 of 58

Access to GitHub Markdown Notebooks

  • As shown in the next slide, click “Code”, then Download Zip

  • The notebook should be ready to be used in your computer
    • To run the notebook, you will need to have R and RStudio installed

49

50 of 58

Download R Markdown Notebook from GitHub

50

51 of 58

SIP2022 BME02 Markdown Notebooks

  • Notebooks for SIP2022 can be found at

  • Identify the Download button, near the Clone button as indicated in the next slide

  • You need to decide on which of the 4 notebooks you want to download

  • You can download for instance the 1-Introduction repository or all of the 4 repositories at the same time

51

52 of 58

SIP2022 BME02 Markdown Notebooks

  • If you click the download button, you will be lead to download the main repository, containing the 4 sub-repositories indicated by the green arrows

  • You can also click one of the four repositories (1-Introduction, 2-Variant-Call-Pipeline, etc) instead

  • This last action will download each sub-repository at a time

52

    • The image below is the screen from https://gitlab.com/genomics-research/recombio

53 of 58

Gitlab Commit Documentation

  • Commits are used to record modifications made to repository

  • Show example of commit in Gitlab Repository

  • Notebook descriptor

53

54 of 58

Git clone repository in the local machine

  • Be careful because you may delete files when using Git in the command line

  • git clone $path in the directory of interest will mount the GitHub repository in the local machine

  • Later, we will be able to collaborate as we progress to develop our coding abilities

  • Collaboration is done through GitHub Branching

54

55 of 58

Create Branches and Push Modifications to gitlab Repo

  • git checkout -b rachelle_shiny_r_script YourBranch

  • git add ./

  • git commit -m “Rachelle Shiny Application Notebook”
    • Git only allows you to push the Origin as upstream
    • You need to use git push --set-upstream origin YourBranch

  • git push --set-upstream origin YourBranch

55

56 of 58

Create a Merge Request

  • In blue pop-up click “Create Merge Request”
  • Assign it to me

  • git add ./

  • git commit -m “Rachelle Shiny Application Notebook”
    • Git only allows you to push the Origin as upstream
    • You need to use git push --set-upstream origin YourBranch

  • git push --set-upstream origin YourBranch

56

57 of 58

Instructions for R Markdown Notebook Use

  • Notebook 1: Introduction to R and Genotype-Phenotype Association

  • Notebook 2: SARS-CoV-2 Variant Call Pipeline - NB2
    • Requires Anaconda installation
    • Requires several software installation steps
      • This may be different for Windows and Linux (MacOS) systems
      • After software installation, folders fasta_reference_file, SARS-CoV-2_Regions need to be filled with FASTA files from the region of interest

  • Notebook 3: Hierarchical Clustering of SARS-CoV-2 Mutations
    • Need to decide whether will incorporate the Web Application development aspect
    • This was awarded 1st place in the Poster Presentation of a Research Symposium

57

58 of 58

Alex’s Lemonade Stand Foundation (ALSF) Git Workshop

Course Website

  • https://alexslemonade.github.io/reproducible-research/

Schedule (Contains Slides at the end of page)

Resources List (Variety of relevant resources that may be of interest)

  • https://alexslemonade.github.io/reproducible-research/reproducibility_resources.html

58