1 of 49

An introduction to computational genomics

Spring 2023

This work, “An introduction to computational genomics”, is a derivative of “Course overview and introduction to Unix” by Aaron Quinlan, used under CC BY SA 4.0. This work is licensed under CC BY SA 4.0 by Dennis Hazelett.

2 of 49

Why computational genomics?

Graduate school in 1998: People studied single genes in a single model (organism).

  • Perturb the system (drugs, surgery, environment)
  • Look for changes under microscope
  • Gene expression (rt-PCR, ISH)

Graduate school in 2023: Study hundreds or thousands of genes, in situ or in populations of individual cells

  • Comparative studies are more feasible
  • Still under microscope
  • New methods, new analysis
  • Computer skills are no longer optional

3 of 49

Scientists must have the ability to create, collect AND analyze their own experimental data, from start to finish

4 of 49

Objectives

Become competent at

  • Working with Unix-like operating systems
  • Manipulating files and their contents
  • Understanding NGS bioinformatically
  • What kind of analyses are common with NGS data, how are they conducted?
    • Bedtools
    • DESeq2 etc
    • Diffbind
    • Single cell analyses
  • Microbiome
  • Machine learning pipelines

5 of 49

About you

Range of bench and computer skills, prior experience

Range of access to computational resources

You are RESOURCEFUL

6 of 49

Expectations

Attendance is mandatory for an A (except grad school excused absences).

Do every assignment, completely, on time. Including readings.

Do your own work.

Participate in class! If you are confused, so are others!!!

Computer skills are learned by doing, not by osmosis.

7 of 49

Overview:

Course website https://junkdnalab.github.io/acg_2023/

Course instructors:

Dennis Hazelett

Simon Coetzee

Peter Nguyen (TA)

Guests:

Ivan Vujkovic-Cvijin

Obtained from nature.com

8 of 49

Acknowledgements

Hagen, NRG 2000

Gauthier et al, Briefings in Bioinformatics 2018

Extensively borrowed/shamelessly copied from Aaron Quinlan @aaronquinlan

—> his slides and lectures available online: https://github.com/quinlan-lab/applied-computational-genomics

9 of 49

A brief, staggeringly incomplete history of computational biology

10 of 49

Comp. biology was born in the 1960s

Joel Hagen, “The origins of bioinformatics”, NRG, Dec. 2000

  • Expanding collection of amino acid sequences in the 1960s
  • Need for computational power to answer questions and study protein biology
  • Scarcity of academic computers was no longer a major problem

11 of 49

Polypeptide theory of protein structure

  • Fred Sanger and colleagues sequenced Insulin, the first complete protein sequence from 1945-1955
  • Established that every protein had a characteristic primary structure
  • Moore and Stein developed semi-automated sequencing techniques that transformed protein sequencing (sound familiar?)

Frederick Sanger

His first Nobel Prize

1958

12 of 49

DNA and the genetic code

  • Rosalind Franklin, Francis Crick, James Watson, and Maurice Wilkins contributed to solving DNA structure and the proposed system for replication
  • Sequencing DNA proved to be a formidable task

Franklin’s x-ray diffraction of B form DNA

13 of 49

DNA sequencing: Maxam-Gilbert, Sanger

Sanger sequencing key points:

  • sequencing by synthesis (not degradation); radioactive primers hybridize to DNA
  • polymerase + dNTPS + ddNTP terminators at low concentration
  • 1 lane per base, visually interpret ladder

Frederick Sanger

His second Nobel Prize

1980

14 of 49

Margaret Dayhoff

  • Likely the first computational biologist
  • Trained in math and quantum chemistry
  • Associate director of the newly-formed National Biomedical Research Foundation
  • Wrote seminal FORTRAN programs to derive amino acids sequences by using partial overlaps of fragmented amino acid sequences.
  • From months to minutes!
  • Realized the applications to nucleic acids and gene sequences.

Margaret Dayhoff

15 of 49

Sequence searching and alignment

Bill Pearson

David Lipman

FASTA (1985)

BLAST (1990)

Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman

Innovation: heuristic database search (speed), followed by optimal alignment (accuracy, statistics)

Accessed via Unix programs GCG & DNA Strider

16 of 49

Genome assembly

https://dazzlerblog.files.wordpress.com/2016/03/asm-history.pdf

Gene Myers

  • Inherently challenging computational problem
  • Large, repetitive genomes
  • Small “alphabet” (A, C, G, T)
  • Overlap-Layout-Consensus
  • String graph
  • De brujn graphs

17 of 49

The HMM bible.

  • Highly influential book describing the use of Hidden Markov Models as probabilistic models of biological sequences
  • For example, how do we identify a CpG island?

Sean Eddy

Richard Durbin

18 of 49

How HMMs work (roughly)

Tyra Banks, model/entrepreneur/reality host

19 of 49

Importance of HMM to the field of epigenomics

Jason Ernst (currently UCLA) and Manolis Kellis (MIT) ‘2012

20 of 49

New: Application of ML and AI tech to genomics

  • Variant effect prediction:
    • Clinical variants (VUS)
    • Non-coding variants and RNAs
    • Splicing variants
  • Imputation of epigenomics data from incomplete data
  • Cell populations in single cell data
  • Language Models
    • BERT and cousins: representational medical reference
    • Electronic Health Records (EHR) (at Cedars: Ruowang Li)
    • Literature curation
    • Ontologies, Pathway DBs, PPI, hallmarks etc)

21 of 49

What is Unix?

Definition 1: Unix is not an acronym; it is a pun on "Multics". Multics was a large multi-user operating system that was being developed at Bell Labs shortly before Unix was created in the early '70s. Brian Kernighan is credited with the name.

Definition 2: Where computational genomics is done.

Definition 3: Your dear friend.

Recommended reading: “The Evolution of the Unix Time-sharing system”, Dennis M. Ritchie https://pdfs.semanticscholar.org/f64f/6e66da16e93ebf4221fc8915b2420fd56b66.pdf

22 of 49

What is GNU?

Pronounced “g’noo” GNU’s Not Unix

Linux: a Unix-like operating system built around the linux kernel with “free” software

Many of the command line tools we will encounter are from or distributed on GNU

Much of the internet is powered by linux via Apache servers (~47%)

Richard Stallman - Linus Torvalds

23 of 49

Unix history

  • Initial file system, command interpreter (shell), and process management started by Ken Thompson
  • Device files and further development from Dennis Ritchie, as well as McIlroy and Ossanna (to a lesser degree)
  • Vast array of simple, dependable tools that each do one simple task.
  • By combining these tools, one can conduct rather sophisticated analyses
  • Wildly popular platform for high performance computing. Supports parallelism.
  • SunOS/Solaris, IBM's AIX, Hewlett-Packard HP-UX, OSX, Linux, Android, etc.

Credit:https://en.wikipedia.org/wiki/History_of_Unix

Ken Thompson (sitting) and Dennis Ritchie working together at a PDP-11

24 of 49

Connecting to a Unix computer via the terminal

“Terminal” in OSX

“putty” for Windows

25 of 49

Launching OSX “Terminal”

Applications

Utilities

Terminal

26 of 49

Launching Putty for Windows

iapetus.csmc.edu

Save as “iapetus”

491XX

27 of 49

Launching Putty for Windows

iapetus.csmc.edu

Your uNID (e.g., hazelettd),

Hit enter.

Then enter your password

when prompted.

Hit enter

28 of 49

Another option for windows users: WSL

29 of 49

The “prompt”

The prompt is just a patient little thing that waits around for you to tell it what to do via “commands”.

Command syntax must be exact. In this way, Unix is dumb. It cannot infer what you meant if you misspell, provide the wrong syntax, etc.

30 of 49

Connect to VPN

31 of 49

Option 2

Virtual desktop:�

Login to workspace.cshs.org

32 of 49

rstudio Info - password is the first part of your email

Student

Port

Nicolas Angelillis

49153

Basia Gala

49154

Elena Ivleva

49155

Na Jeong Kim

49156

Nimisha Mazumdar

49157

Maya Modak

49158

Asli Beyza Ozdemir

49159

Roberta Piras

49160

Inga Yenokian

49166

33 of 49

That’s no moon…

mimas.csmc.edu:<port>

Mimas - moon of Saturn

rstudio Info - password is the first part of your email

34 of 49

Linux Terminal (bash) in rstudio

Menu: Tools > Terminal > New Terminal

Appears behind “console”

Virtual machine has 7 cores &

24gb RAM

35 of 49

Resources

36 of 49

Unix basics (May 26th)

37 of 49

First: Feedback forms

Do them right after class. Why?

Do not skip! Attendance credit is given for completion of the feedback.

Specific issues:

  1. Pros and Cons of Putty vs. Terminal ( WSL)
  2. Relevance of HMM to bioinformatics (genomics! CpG islands, ChromHMM)
  3. Syllabus not timely
  4. Too much history
  5. Background on clusters and why they are better
  6. “The lecture mentioned that most computational genomics is made with Unix, and I would be curious to understand what makes Unix particularly useful for computational genomics. Is it mainly power and dependability, or is there more to it?”

38 of 49

After Login - change your password

$ passwd

Changing password for coetzeesg.

Current password:

New password:

Retype new password:

passwd: password updated successfully

39 of 49

The Unix file system. A tree just like OSX and Windows

40 of 49

The ls command (list files and directories)

(What files and directories can be found in the current directory?)

Example Unix file system (a “tree”)

/

home

bin

vol

luke

leia

ls

head

cat

hi.txt

proj1

data2.txt

data1.txt

Imagine we are user “luke”

$ ssh luke@iapetus.csmc.edu

$ ls

hi.txt proj1

Luke’s home directory

list luke’s home directory

Command

result to standard

output (stdout)

41 of 49

The ls command (list files and directories)

(What files and directories can be found in the current directory?)

Example Unix file system (a “tree”)

/

home

bin

vol

luke

leia

ls

head

cat

hi.txt

proj1

data2.txt

data1.txt

Imagine we are user “luke”

$ ssh luke@iapetus.csmc.edu

$ ls

hi.txt proj1

Luke’s home directory

list luke’s home directory

Command

result to standard

output (stdout)

$ ls ~

hi.txt proj1

list luke’s home directory

42 of 49

The ls command (list files and directories)

(What files and directories can be found in the current directory?)

$ ls ~

hi.txt proj1

Example Unix file system (a “tree”)

/

home

bin

vol

luke

leia

ls

head

cat

hi.txt

proj1

data2.txt

data1.txt

Imagine we are user “luke”

$ ssh luke@iapetus.csmc.edu

$ ls

hi.txt proj1

Luke’s home directory

list luke’s home directory

Command

result to standard

output (stdout)

$ ls /home

luke leia

Q: Why does this directory need

a leading “slash”?

A: It is special because it is one “level” below “root”

$ ls ~luke

hi.txt proj1

list luke’s home directory

(the tilde means “home”)

list luke’s home directory

list usr directory

$ ls proj1

data1.txt data2.txt

list contents of luke’s “proj1” directory

43 of 49

The cd command (change directories)

(cd helps to navigate through the Unix directory tree)

root

home

bin

vol

luke

leia

ls

head

cat

hi.txt

proj1

data2.txt

data1.txt

$ ls

hi.txt proj1

$ cd proj1

$ ls

data1.txt data2.txt

$ cd ..

$ ls

hi.txt proj1

$ cd ..

$ ls

luke leia

$ cd luke/proj1

list luke’s home directory

move to luke’s “proj1” directory

list luke’s “proj1” directory

Move “up” the tree to luke’s home directory

Move “up” the tree to the “usr” directory

Move directly to luke’s “proj1” directory

$ cd /bin

Move to the system “bin” directory

Example Unix file system (a “tree”)

Imagine we are user “luke”

44 of 49

The pwd command (present working directory)

(Where am I? That is, in which directory am I?)

root

home

bin

vol

luke

leia

ls

head

cat

hi.txt

proj1

data2.txt

data1.txt

$ pwd

/home/luke

$ cd proj1

Where am I?

move to luke’s “proj1” directory

$ pwd

/home/luke/proj1

Where am I?

$ cd ..

move back to luke’s home directory

$ pwd

/home/luke

Where am I?

Example Unix file system (a “tree”)

Imagine we are user “luke”

45 of 49

The mkdir command (make a new directory)

root

home

bin

vol

luke

leia

ls

head

cat

hi.txt

proj1

data2.txt

data1.txt

$ pwd

/usr/luke

$ ls

Hi.txt proj1

Where am I?

List luke’s home directory

$ cd proj1

Move to the “proj1” directory

$ mkdir data

Make a new “data” directory in proj1

data

Example Unix file system (a “tree”)

Imagine we are user “luke”

46 of 49

The touch command (create an empty file)

root

home

bin

vol

luke

leia

ls

head

cat

hi.txt

proj1

data2.txt

data1.txt

$ pwd

/usr/luke

$ ls

Hi.txt proj1

Where am I?

List luke’s home directory

$ cd proj1/data

Move to the “proj1/data” directory

$ touch frost.txt

Create an empty text file called “frost.txt”

data

frost.txt

Example Unix file system (a “tree”)

Imagine we are user “luke”

47 of 49

The head command

(peak at the first n lines in an input file or stream)

Whose woods these are I think I know.

His house is in the village though;

He will not see me stopping here

To watch his woods fill up with snow.

My little horse must think it queer

To stop without a farmhouse near

Between the woods and frozen lake

The darkest evening of the year.

He gives his harness bells a shake

To ask if there is some mistake.

The only other sound’s the sweep

Of easy wind and downy flake.

The woods are lovely, dark and deep,

But I have promises to keep,

And miles to go before I sleep,

And miles to go before I sleep.

The contents of frost.txt

Whose woods these are I think I know.

His house is in the village though;

He will not see me stopping here

To watch his woods fill up with snow.

My little horse must think it queer

$ head -n 5 frost.txt

Whose woods these are I think I know.

His house is in the village though;

He will not see me stopping here

To watch his woods fill up with snow.

My little horse must think it queer

To stop without a farmhouse near

Between the woods and frozen lake

The darkest evening of the year.

He gives his harness bells a shake

To ask if there is some mistake.

$ head -n 10 frost.txt

48 of 49

The head command

(peak at the first n lines in an input file or stream)

Whose woods these are I think I know.

His house is in the village though;

He will not see me stopping here

To watch his woods fill up with snow.

My little horse must think it queer

To stop without a farmhouse near

Between the woods and frozen lake

The darkest evening of the year.

He gives his harness bells a shake

To ask if there is some mistake.

The only other sound’s the sweep

Of easy wind and downy flake.

The woods are lovely, dark and deep,

But I have promises to keep,

And miles to go before I sleep,

And miles to go before I sleep.

The contents of frost.txt

Whose woods these are I think I know.

His house is in the village though;

He will not see me stopping here

To watch his woods fill up with snow.

My little horse must think it queer

$ head -n 5 frost.txt

Whose woods these are I think I know.

His house is in the village though;

He will not see me stopping here

To watch his woods fill up with snow.

My little horse must think it queer

To stop without a farmhouse near

Between the woods and frozen lake

The darkest evening of the year.

He gives his harness bells a shake

To ask if there is some mistake.

$ head -n 10 frost.txt

Important!

Each line ends with a special, hidden character called the newline (\n) character.

This is how “head” knows where the lines start and end.

\n

\n

\n

\n

\n

\n

\n

\n

\n

\n

\n

\n

\n

\n

\n

\n

-n is an argument or parameter to the head command that modulates its behavior. In this case to report the first 5 lines instead of the first 10, which is the default behavior

49 of 49

Unix reference “cheat sheet”: (print/mark the link below!)

http://practicalcomputing.org/files/PCfB_Appendices.pdf