An introduction to computational genomics
Spring 2023
This work, “An introduction to computational genomics”, is a derivative of “Course overview and introduction to Unix” by Aaron Quinlan, used under CC BY SA 4.0. This work is licensed under CC BY SA 4.0 by Dennis Hazelett.
Why computational genomics?
Graduate school in 1998: People studied single genes in a single model (organism).
Graduate school in 2023: Study hundreds or thousands of genes, in situ or in populations of individual cells
Scientists must have the ability to create, collect AND analyze their own experimental data, from start to finish
Objectives
Become competent at
About you
Range of bench and computer skills, prior experience
Range of access to computational resources
You are RESOURCEFUL
Expectations
Attendance is mandatory for an A (except grad school excused absences).
Do every assignment, completely, on time. Including readings.
Do your own work.
Participate in class! If you are confused, so are others!!!
Computer skills are learned by doing, not by osmosis.
Overview:
Course website https://junkdnalab.github.io/acg_2023/
Course instructors:
Dennis Hazelett
Simon Coetzee
Peter Nguyen (TA)
Guests:
Ivan Vujkovic-Cvijin
Obtained from nature.com
Acknowledgements
Hagen, NRG 2000
Gauthier et al, Briefings in Bioinformatics 2018
Extensively borrowed/shamelessly copied from Aaron Quinlan @aaronquinlan
—> his slides and lectures available online: https://github.com/quinlan-lab/applied-computational-genomics
A brief, staggeringly incomplete history of computational biology
Comp. biology was born in the 1960s
Joel Hagen, “The origins of bioinformatics”, NRG, Dec. 2000
Polypeptide theory of protein structure
Frederick Sanger
His first Nobel Prize
1958
DNA and the genetic code
Franklin’s x-ray diffraction of B form DNA
DNA sequencing: Maxam-Gilbert, Sanger
Sanger sequencing key points:
Frederick Sanger
His second Nobel Prize
1980
Margaret Dayhoff
Margaret Dayhoff
Sequence searching and alignment
Bill Pearson
David Lipman
FASTA (1985)
BLAST (1990)
Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman
Innovation: heuristic database search (speed), followed by optimal alignment (accuracy, statistics)
Accessed via Unix programs GCG & DNA Strider
Genome assembly
https://dazzlerblog.files.wordpress.com/2016/03/asm-history.pdf
Gene Myers
The HMM bible.
Sean Eddy
Richard Durbin
How HMMs work (roughly)
Tyra Banks, model/entrepreneur/reality host
Importance of HMM to the field of epigenomics
Jason Ernst (currently UCLA) and Manolis Kellis (MIT) ‘2012
New: Application of ML and AI tech to genomics
What is Unix?
Definition 1: Unix is not an acronym; it is a pun on "Multics". Multics was a large multi-user operating system that was being developed at Bell Labs shortly before Unix was created in the early '70s. Brian Kernighan is credited with the name.
Definition 2: Where computational genomics is done.
Definition 3: Your dear friend.
Recommended reading: “The Evolution of the Unix Time-sharing system”, Dennis M. Ritchie https://pdfs.semanticscholar.org/f64f/6e66da16e93ebf4221fc8915b2420fd56b66.pdf
What is GNU?
Pronounced “g’noo” GNU’s Not Unix
Linux: a Unix-like operating system built around the linux kernel with “free” software
Many of the command line tools we will encounter are from or distributed on GNU
Much of the internet is powered by linux via Apache servers (~47%)
Richard Stallman - Linus Torvalds
Unix history
Credit:https://en.wikipedia.org/wiki/History_of_Unix
Ken Thompson (sitting) and Dennis Ritchie working together at a PDP-11
Connecting to a Unix computer via the terminal
“Terminal” in OSX
“putty” for Windows
Launching OSX “Terminal”
Applications
Utilities
Terminal
Launching Putty for Windows
iapetus.csmc.edu
Save as “iapetus”
491XX
Launching Putty for Windows
iapetus.csmc.edu
Your uNID (e.g., hazelettd),
Hit enter.
Then enter your password
when prompted.
Hit enter
Another option for windows users: WSL
The “prompt”
The prompt is just a patient little thing that waits around for you to tell it what to do via “commands”.
Command syntax must be exact. In this way, Unix is dumb. It cannot infer what you meant if you misspell, provide the wrong syntax, etc.
Connect to VPN
Option 2
Virtual desktop:�
Login to workspace.cshs.org
rstudio Info - password is the first part of your email
Student | Port |
Nicolas Angelillis | 49153 |
Basia Gala | 49154 |
Elena Ivleva | 49155 |
Na Jeong Kim | 49156 |
Nimisha Mazumdar | 49157 |
Maya Modak | 49158 |
Asli Beyza Ozdemir | 49159 |
Roberta Piras | 49160 |
Inga Yenokian | 49166 |
That’s no moon…
mimas.csmc.edu:<port>
Mimas - moon of Saturn
rstudio Info - password is the first part of your email
Linux Terminal (bash) in rstudio
Menu: Tools > Terminal > New Terminal
Appears behind “console”
Virtual machine has 7 cores &
24gb RAM
Resources
Unix basics (May 26th)
First: Feedback forms
Do them right after class. Why?
Do not skip! Attendance credit is given for completion of the feedback.
Specific issues:
After Login - change your password
$ passwd
Changing password for coetzeesg.
Current password:
New password:
Retype new password:
passwd: password updated successfully
The Unix file system. A tree just like OSX and Windows
The ls command (list files and directories)
(What files and directories can be found in the current directory?)
Example Unix file system (a “tree”)
/
home
bin
vol
luke
leia
ls
head
cat
hi.txt
proj1
data2.txt
data1.txt
Imagine we are user “luke”
$ ssh luke@iapetus.csmc.edu
$ ls
hi.txt proj1
Luke’s home directory
list luke’s home directory
Command
result to standard
output (stdout)
The ls command (list files and directories)
(What files and directories can be found in the current directory?)
Example Unix file system (a “tree”)
/
home
bin
vol
luke
leia
ls
head
cat
hi.txt
proj1
data2.txt
data1.txt
Imagine we are user “luke”
$ ssh luke@iapetus.csmc.edu
$ ls
hi.txt proj1
Luke’s home directory
list luke’s home directory
Command
result to standard
output (stdout)
$ ls ~
hi.txt proj1
list luke’s home directory
The ls command (list files and directories)
(What files and directories can be found in the current directory?)
$ ls ~
hi.txt proj1
Example Unix file system (a “tree”)
/
home
bin
vol
luke
leia
ls
head
cat
hi.txt
proj1
data2.txt
data1.txt
Imagine we are user “luke”
$ ssh luke@iapetus.csmc.edu
$ ls
hi.txt proj1
Luke’s home directory
list luke’s home directory
Command
result to standard
output (stdout)
$ ls /home
luke leia
Q: Why does this directory need
a leading “slash”?
A: It is special because it is one “level” below “root”
$ ls ~luke
hi.txt proj1
list luke’s home directory
(the tilde means “home”)
list luke’s home directory
list usr directory
$ ls proj1
data1.txt data2.txt
list contents of luke’s “proj1” directory
The cd command (change directories)
(cd helps to navigate through the Unix directory tree)
root
home
bin
vol
luke
leia
ls
head
cat
hi.txt
proj1
data2.txt
data1.txt
$ ls
hi.txt proj1
$ cd proj1
$ ls
data1.txt data2.txt
$ cd ..
$ ls
hi.txt proj1
$ cd ..
$ ls
luke leia
$ cd luke/proj1
list luke’s home directory
move to luke’s “proj1” directory
list luke’s “proj1” directory
Move “up” the tree to luke’s home directory
Move “up” the tree to the “usr” directory
Move directly to luke’s “proj1” directory
$ cd /bin
Move to the system “bin” directory
Example Unix file system (a “tree”)
Imagine we are user “luke”
The pwd command (present working directory)
(Where am I? That is, in which directory am I?)
root
home
bin
vol
luke
leia
ls
head
cat
hi.txt
proj1
data2.txt
data1.txt
$ pwd
/home/luke
$ cd proj1
Where am I?
move to luke’s “proj1” directory
$ pwd
/home/luke/proj1
Where am I?
$ cd ..
move back to luke’s home directory
$ pwd
/home/luke
Where am I?
Example Unix file system (a “tree”)
Imagine we are user “luke”
The mkdir command (make a new directory)
root
home
bin
vol
luke
leia
ls
head
cat
hi.txt
proj1
data2.txt
data1.txt
$ pwd
/usr/luke
$ ls
Hi.txt proj1
Where am I?
List luke’s home directory
$ cd proj1
Move to the “proj1” directory
$ mkdir data
Make a new “data” directory in proj1
data
Example Unix file system (a “tree”)
Imagine we are user “luke”
The touch command (create an empty file)
root
home
bin
vol
luke
leia
ls
head
cat
hi.txt
proj1
data2.txt
data1.txt
$ pwd
/usr/luke
$ ls
Hi.txt proj1
Where am I?
List luke’s home directory
$ cd proj1/data
Move to the “proj1/data” directory
$ touch frost.txt
Create an empty text file called “frost.txt”
data
frost.txt
Example Unix file system (a “tree”)
Imagine we are user “luke”
The head command
(peak at the first n lines in an input file or stream)
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
The only other sound’s the sweep
Of easy wind and downy flake.
The woods are lovely, dark and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
The contents of frost.txt
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
$ head -n 5 frost.txt
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
$ head -n 10 frost.txt
The head command
(peak at the first n lines in an input file or stream)
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
The only other sound’s the sweep
Of easy wind and downy flake.
The woods are lovely, dark and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
The contents of frost.txt
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
$ head -n 5 frost.txt
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
$ head -n 10 frost.txt
Important!
Each line ends with a special, hidden character called the newline (\n) character.
This is how “head” knows where the lines start and end.
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
-n is an argument or parameter to the head command that modulates its behavior. In this case to report the first 5 lines instead of the first 10, which is the default behavior
Unix reference “cheat sheet”: (print/mark the link below!)
http://practicalcomputing.org/files/PCfB_Appendices.pdf