1 of 28

Presenter:

Nanjala Ruth

2 of 28

Introduction

Computers & Specialized software have become an essential part of the biologist’s toolkit for:

Routine DNA or protein sequence analysis
To parse meaningful information

With the advent of NGS analysis, it is easier to believe that modern bioinformatics emerged from it.
However bioinformatics begun more than 50 years ago when desktop computers were still a hypothesis & DNA wasn’t sequenced.

3 of 28

Objective

To present an integrative timeline of key events in bioinformatics, some background on parallel advances in molecular biology & computer science, & some reflections on the future of bioinformatics.

4 of 28

1950-1970: The Origins

1950s - Status of DNA as a carrier molecule was still controversial.
Avery & MacLeod (1944 ) showed that uptake of pure DNA from a virulent strain could confer virulence to a non virulent strain but their results were not immediately accepted.
Many thought that proteins were the carrier of genetic information.

5 of 28

1950-1970: The Origins

1952 - Role of DNA as a carrier for genetic information was validated.
Hershey & Chase proved that it was DNA & not protein that was uptaken & transmitted by bacterial cells infected by a bacteriophage

6 of 28

1950-1970: The Origins

Still not much was known about the arrangement of DNA, only that pairs of its monomers were in equimolar proportions.
1953 - the double helix structure of DNA was solved by Watson & Crick

7 of 28

1950-1970: The Origins

13 more years to decipher the genetic code
25 more years before the 1st DNA sequencing method became available.
Use of bioinformatics in DNA analysis lagged nearly two decades behind protein analysis whose chemical nature was already understood.

8 of 28

Protein analysis was the starting point

1950s

Major advances in determination of protein structure through crystallography.
The first sequence of a protein (insulin) was published.

This encouraged the development of more efficient methods for obtaining protein sequences (Edman degradation)
It was however limited in obtaining large protein sequences as assembling whole protein sequence from hundreds of small Edman peptide sequences was cumbersome.
1960s - A bioinformatics software was developed to solve this problem.

9 of 28

Protein analysis was the starting point

10 of 28

Dayhoff: the first bioinformatician

Margaret Dayhoff (1925-1983) pioneered the application of computational methods to biochemistry.
1958-1962 - Margaret & Robert combined their expertise & developed COMPROTEIN, designed to determine the protein primary structure using Edman protein sequencing data. (first occurrence of a de novo sequence assembler)
COMPROTEIN used the 3 letter amino acid code, Dayhoff letter developed a one letter amino acid code that is still in use to date.

11 of 28

Dayhoff: the first bioinformatician

12 of 28

The Computer assisted genealogy of life

1970 - Needleman & Wunsch developed the 1st algorithm for pairwise protein sequence alignment
1980s - The first Multiple Sequence Alignment emerged, it was however computationally impractical with aligning tens of proteins of greater length.
1987 - Da-Fei & Russell developed the first practical approach to MSA, (progressive sequence alignment)
1988 - CLUSTAL software was developed.

A mathematical framework for amino acid substitutions

1978 - Dayhoff, Schwartz & Orcutt developed the 1st probabilistic method of amino acid substitution (Point Accepted Mutation (PAM) matrix).

13 of 28

The Computer assisted genealogy of life

Needleman-Wunsch algorithm

14 of 28

1970- 1980: Paradigm shift from Protein to DNA

Deciphering the DNA language: the genetic code

The specifications of any living thing are encoded in the nucleotide arrangement of the DNA molecule (Central Dogma of molecular biology).
By 1968, all of the 64 codons were deciphered; DNA was now readable.

Cost efficient reading of DNA

1976 - The Maxam Gilbert (first DNA Sequencing method) was adopted.
1977 - Sanger’s team developed the plus and minus DNA sequencing method
Technical modifications to the plus & minus method led to the development of the common Sanger Chain termination method.
1979 - First software dedicated to analyzing Sanger reads was published.

15 of 28

1970- 1980: Paradigm shift from Protein to DNA

Using DNA sequences in phylogenetic inferences

The first molecular phylogenetic trees were reconstructed from protein sequences (maximum parsimony).
Using DNA sequences in phylogenetics added additional information that could not be obtained with amino acid sequences (synonymous mutations)
Felsenstein developed a maximum likelihood method to infer phylogenetic trees from DNA sequences
His work inspired the development of several bioinformatics tools using ML & several statistical methods.

16 of 28

1980- 1990: Parallel advances in biology and computer science

Molecular methods to target & amplify specific genes

Introduction of gene cloning - isolation & amplification of genes from their source organism
Inventing of the PCR method by Kary Mullis

Access to computers & specialized software

1977 -The first wave of microcomputers hit the consumer market
1984 - The GCG package for manipulating DNA, RNA, protein sequences was published
DNASTAR package was developed in the same year.

17 of 28

1980- 1990: Parallel advances in biology and computer science

Bioinformatics & the free software movement

1985 - Richard Stallman published the GNU (a free unix based operating system) manifesto.
The EMBL, GenBank & DDBJ united to form the International Nucleotide Sequence Database Collaboration
1985 - The Computer Applications in the Biosciences (now Bioinformatics) journal was established.

18 of 28

1980- 1990: Parallel advances in biology and computer science

Desktop Computers & new programming languages

Desktop computers emerged during this period.
1987 - Larry Wall created the Perl Scripting language which was the lingua franca of bioinformatics until 2000s.
Perl’s flexibility coupled with its heavily punctuated syntax could easily result to low code readability making its maintenance difficult.
Python, which was first implemented in 1989 by Guido Van became a major actor in bioinformatics scene in the late 2000s.
Several non scripting languages originated in the early 1990s.

19 of 28

1980- 1990: Parallel advances in biology and computer science

20 of 28

1990- 2000: Genomics, structural bioinformatics & the information superhighway

Dawn of the genomics era

1995 - The 1st complete genome sequence of Haemophilus influenzae was sequenced.
The turning point for the genomics era was the publication of the human genome @ the start of the 21st century.
The globalized information network was another important player as the NIH funded human genome sequencing project made its data available through this network.

21 of 28

1990- 2000: Genomics, structural bioinformatics & the information superhighway

Bioinformatics online

1990 - Tim initiated the World wide web.
1993 - EMBL Nucleotide Sequence Data Library was made available on the web.
1994 - NCBI was made available online.
1995 - Genomes database.
1997- PubMed database.
1999- Human Genome database.

22 of 28

2000- 2010: High throughput bioinformatics

Second generation sequencing

DNA sequencing was democratized with the advent of NGS.

Biological Big Data

Since 2008, Moore’s law stopped being an accurate predictor of sequencing costs, as they dropped several orders of magnitude with the arrival of massively parallel sequencing technology.
This resulted to an increase of sequences in public databases & further preoccupations towards the big data issue.
Major computational resources are needed to handle all this information.

23 of 28

2000- 2010: High throughput bioinformatics

24 of 28

2000- 2010: High throughput bioinformatics

High performance computing & collaborative computing

Several government sponsored organization specialized in HPC have emerged.
A collaborative platform that allows users to make their computers available for distributed calculations for different projects have emerged. (BIOINC)

25 of 28

2010- Today: Present and future perspectives

Clearly defining the bioinformatics profession

No clear definition of who a bioinformatician is.
ISCB published guidelines & recommendations of core competencies that a bioinformatician should have.

Is the term bioinformatics obsolete?

Today, when virtually any research require using computers, one may question the relevance of the term.
C.Titus at the 15th Bioinformatics Open Source Conference said that there was no need to distinguish one from the other as they are intertwined.

26 of 28

2010- Today: Present and future perspectives

Towards modelling life as a whole: systems biology

Instead of independently investigating whole genomes, whole transcriptomes, or whole metabolomes, whole living organisms & their environments will be computationally modelled, with all molecular categories taken into account simultaneously.

27 of 28

Conclusion

Bioinformatics begun 50 years ago.
In the 1960s, the 1st de novo peptide sequence assembler , the first protein sequence database, the first amino acid substitution model were developed.
Through the 1970s &1980s, parallel advances in molecular biology & computer science set the path for increasingly complex endeavours such as analyzing complete genomes.
In the 1990-2000s, use of the internet, coupled with NGS led to an exponential influx of data & a rapid proliferation of bioinformatics tools.
Today, bioinformatics faces multiple challenges such as handling big data, ensuring reproducibility of results & a proper integration into academic curriculums.

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28