1 of 28

Presenter:

Nanjala Ruth

2 of 28

Introduction

  • Computers & Specialized software have become an essential part of the biologist’s toolkit for:
    • Routine DNA or protein sequence analysis
    • To parse meaningful information
  • With the advent of NGS analysis, it is easier to believe that modern bioinformatics emerged from it.
  • However bioinformatics begun more than 50 years ago when desktop computers were still a hypothesis & DNA wasn’t sequenced.

3 of 28

Objective

  • To present an integrative timeline of key events in bioinformatics, some background on parallel advances in molecular biology & computer science, & some reflections on the future of bioinformatics.

4 of 28

1950-1970: The Origins

  • 1950s - Status of DNA as a carrier molecule was still controversial.
  • Avery & MacLeod (1944 ) showed that uptake of pure DNA from a virulent strain could confer virulence to a non virulent strain but their results were not immediately accepted.
  • Many thought that proteins were the carrier of genetic information.

5 of 28

1950-1970: The Origins

  • 1952 - Role of DNA as a carrier for genetic information was validated.
  • Hershey & Chase proved that it was DNA & not protein that was uptaken & transmitted by bacterial cells infected by a bacteriophage

6 of 28

1950-1970: The Origins

  • Still not much was known about the arrangement of DNA, only that pairs of its monomers were in equimolar proportions.
  • 1953 - the double helix structure of DNA was solved by Watson & Crick

7 of 28

1950-1970: The Origins

  • 13 more years to decipher the genetic code
  • 25 more years before the 1st DNA sequencing method became available.
  • Use of bioinformatics in DNA analysis lagged nearly two decades behind protein analysis whose chemical nature was already understood.

8 of 28

Protein analysis was the starting point

  • 1950s
    • Major advances in determination of protein structure through crystallography.
    • The first sequence of a protein (insulin) was published.
  • This encouraged the development of more efficient methods for obtaining protein sequences (Edman degradation)
  • It was however limited in obtaining large protein sequences as assembling whole protein sequence from hundreds of small Edman peptide sequences was cumbersome.
  • 1960s - A bioinformatics software was developed to solve this problem.

9 of 28

Protein analysis was the starting point

10 of 28

Dayhoff: the first bioinformatician

  • Margaret Dayhoff (1925-1983) pioneered the application of computational methods to biochemistry.
  • 1958-1962 - Margaret & Robert combined their expertise & developed COMPROTEIN, designed to determine the protein primary structure using Edman protein sequencing data. (first occurrence of a de novo sequence assembler)
  • COMPROTEIN used the 3 letter amino acid code, Dayhoff letter developed a one letter amino acid code that is still in use to date.

11 of 28

Dayhoff: the first bioinformatician

12 of 28

The Computer assisted genealogy of life

  • 1970 - Needleman & Wunsch developed the 1st algorithm for pairwise protein sequence alignment
  • 1980s - The first Multiple Sequence Alignment emerged, it was however computationally impractical with aligning tens of proteins of greater length.
  • 1987 - Da-Fei & Russell developed the first practical approach to MSA, (progressive sequence alignment)
  • 1988 - CLUSTAL software was developed.

A mathematical framework for amino acid substitutions

  • 1978 - Dayhoff, Schwartz & Orcutt developed the 1st probabilistic method of amino acid substitution (Point Accepted Mutation (PAM) matrix).

13 of 28

The Computer assisted genealogy of life

Needleman-Wunsch algorithm

14 of 28

1970- 1980: Paradigm shift from Protein to DNA

Deciphering the DNA language: the genetic code

  • The specifications of any living thing are encoded in the nucleotide arrangement of the DNA molecule (Central Dogma of molecular biology).
  • By 1968, all of the 64 codons were deciphered; DNA was now readable.

Cost efficient reading of DNA

  • 1976 - The Maxam Gilbert (first DNA Sequencing method) was adopted.
  • 1977 - Sanger’s team developed the plus and minus DNA sequencing method
  • Technical modifications to the plus & minus method led to the development of the common Sanger Chain termination method.
  • 1979 - First software dedicated to analyzing Sanger reads was published.

15 of 28

1970- 1980: Paradigm shift from Protein to DNA

Using DNA sequences in phylogenetic inferences

  • The first molecular phylogenetic trees were reconstructed from protein sequences (maximum parsimony).
  • Using DNA sequences in phylogenetics added additional information that could not be obtained with amino acid sequences (synonymous mutations)
  • Felsenstein developed a maximum likelihood method to infer phylogenetic trees from DNA sequences
  • His work inspired the development of several bioinformatics tools using ML & several statistical methods.

16 of 28

1980- 1990: Parallel advances in biology and computer science

Molecular methods to target & amplify specific genes

  • Introduction of gene cloning - isolation & amplification of genes from their source organism
  • Inventing of the PCR method by Kary Mullis

Access to computers & specialized software

  • 1977 -The first wave of microcomputers hit the consumer market
  • 1984 - The GCG package for manipulating DNA, RNA, protein sequences was published
  • DNASTAR package was developed in the same year.

17 of 28

1980- 1990: Parallel advances in biology and computer science

Bioinformatics & the free software movement

  • 1985 - Richard Stallman published the GNU (a free unix based operating system) manifesto.
  • The EMBL, GenBank & DDBJ united to form the International Nucleotide Sequence Database Collaboration
  • 1985 - The Computer Applications in the Biosciences (now Bioinformatics) journal was established.

18 of 28

1980- 1990: Parallel advances in biology and computer science

Desktop Computers & new programming languages

  • Desktop computers emerged during this period.
  • 1987 - Larry Wall created the Perl Scripting language which was the lingua franca of bioinformatics until 2000s.
  • Perl’s flexibility coupled with its heavily punctuated syntax could easily result to low code readability making its maintenance difficult.
  • Python, which was first implemented in 1989 by Guido Van became a major actor in bioinformatics scene in the late 2000s.
  • Several non scripting languages originated in the early 1990s.

19 of 28

1980- 1990: Parallel advances in biology and computer science

20 of 28

1990- 2000: Genomics, structural bioinformatics & the information superhighway

Dawn of the genomics era

  • 1995 - The 1st complete genome sequence of Haemophilus influenzae was sequenced.
  • The turning point for the genomics era was the publication of the human genome @ the start of the 21st century.
  • The globalized information network was another important player as the NIH funded human genome sequencing project made its data available through this network.

21 of 28

1990- 2000: Genomics, structural bioinformatics & the information superhighway

Bioinformatics online

  • 1990 - Tim initiated the World wide web.
  • 1993 - EMBL Nucleotide Sequence Data Library was made available on the web.
  • 1994 - NCBI was made available online.
  • 1995 - Genomes database.
  • 1997- PubMed database.
  • 1999- Human Genome database.

22 of 28

2000- 2010: High throughput bioinformatics

Second generation sequencing

  • DNA sequencing was democratized with the advent of NGS.

Biological Big Data

  • Since 2008, Moore’s law stopped being an accurate predictor of sequencing costs, as they dropped several orders of magnitude with the arrival of massively parallel sequencing technology.
  • This resulted to an increase of sequences in public databases & further preoccupations towards the big data issue.
  • Major computational resources are needed to handle all this information.

23 of 28

2000- 2010: High throughput bioinformatics

24 of 28

2000- 2010: High throughput bioinformatics

High performance computing & collaborative computing

  • Several government sponsored organization specialized in HPC have emerged.
  • A collaborative platform that allows users to make their computers available for distributed calculations for different projects have emerged. (BIOINC)

25 of 28

2010- Today: Present and future perspectives

Clearly defining the bioinformatics profession

  • No clear definition of who a bioinformatician is.
  • ISCB published guidelines & recommendations of core competencies that a bioinformatician should have.

Is the term bioinformatics obsolete?

  • Today, when virtually any research require using computers, one may question the relevance of the term.
  • C.Titus at the 15th Bioinformatics Open Source Conference said that there was no need to distinguish one from the other as they are intertwined.

26 of 28

2010- Today: Present and future perspectives

Towards modelling life as a whole: systems biology

  • Instead of independently investigating whole genomes, whole transcriptomes, or whole metabolomes, whole living organisms & their environments will be computationally modelled, with all molecular categories taken into account simultaneously.

27 of 28

Conclusion

  • Bioinformatics begun 50 years ago.
  • In the 1960s, the 1st de novo peptide sequence assembler , the first protein sequence database, the first amino acid substitution model were developed.
  • Through the 1970s &1980s, parallel advances in molecular biology & computer science set the path for increasingly complex endeavours such as analyzing complete genomes.
  • In the 1990-2000s, use of the internet, coupled with NGS led to an exponential influx of data & a rapid proliferation of bioinformatics tools.
  • Today, bioinformatics faces multiple challenges such as handling big data, ensuring reproducibility of results & a proper integration into academic curriculums.

28 of 28