1 of 19

BLAST: Basic Local Alignment Search Tool

Submitted by

Purnima Sharma

Department of Bioinformatcs

2 of 19

CONTENTS

  • Definition
  • History
  • Types of BLAST
  • Algorithm
  • BLAST input and output format
  • BLAST Search
  • BLAST function
  • Uses of BLAST
  • Applications of BLAST

3 of 19

Definition

  • BLAST, or Basic Local Alignment Search Tool, is a powerful algorithm used in bioinformatics to compare biological sequences, such as DNA, RNA, or protein sequences, against a database of sequences.
  • It identifies regions of similarity between the query sequence and sequences in the database, which can help in determining the function, evolutionary relationships, or structure of the query sequence.
  • BLAST is widely used in biological research and is available as both a standalone software package and a web-based tool provided by the National Center for Biotechnology Information (NCBI) among other resources.

4 of 19

History of BLAST

  • The history of BLAST traces back to the early 1990s when it was developed by Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman at the National Institutes of Health (NIH).
  • Their aim was to create a more efficient and effective tool for comparing biological sequences than existing methods at the time.
  • The first version of BLAST, BLAST1, was released in 1990. It was a significant improvement over previous sequence alignment algorithms due to its speed and sensitivity.
  • BLAST quickly gained popularity within the scientific community, becoming one of the most widely used bioinformatics tools.
  • Over the years, BLAST has undergone several iterations and improvements, with new versions being released to enhance its accuracy, speed, and usability.
  • These iterations include BLAST2, BLAST3, and more recently, BLAST+, which introduced enhancements such as multithreading support and improved database handling.

5 of 19

Type

s

of

BLAST

6 of 19

Types of BLAST:

  • blastp: compares a protein sequence against aprotein sequence database.
  • blastn: compares a nucleotide sequence against anucleotide sequence database.
  • blastx: compares a six frame translation of anucleotide sequence against a protein database
  • tblastn: compares a protein sequence against asix frame translation of a nucleotide database
  • tblastx: compares a six frame translation of anucleotide sequence against a six frametranslation of a nucleotide database.

7 of 19

Algorithm of BLAST

Some BLAST terminology

  • word – substring of a sequence word pair – pair of words of the same length.
  • score of a word pair – score of the gapless
  • alignment of the two words: V A L M R V A K N S
  • Score=-4+3+-4+-3+-1 = -9 (PAM120)
  • HSP – high scoring sequence pair.

8 of 19

Main steps of BLAST

Parameters: w = length of a hit; T = min. score of a hit (for proteins: w=3, T=13 (BLOSUM62)

  • Step 1: Given query sequence Q, compile the list of possible words which form with words in Q high scoring word pairs.
  • Step 2: Scan database for exact matching with the list of words complied in

step 1.

  • Step 3: Extending hits from step 2.
  • Step 4: Evaluating significance of extended hits from step 3.

9 of 19

How BLAST works

Blast searches begin with a query sequence that will be matched against sequence databases specified by the user.

•Begins by breaking down the query sequence into a series of short overlapping “words”

•Default word size for BLAST N is 28 nucleotides

•Default word size for BLAST P is 3 amino acids

•Results obtained depend on the scoring matrix used.

•BLOSUM 62 matrix is the default scoring matrix for BLASTP

10 of 19

The basic strategy used by the BLAST algorithms

11 of 19

The BLASTP algorithm

Query sequence is broken into all possible 3-letterwords using a moving window

• Numerical score is calculated for each word by adding up the values for the amino acids fromtheBLOSUM62 matrix

• Words with a score of 12 or more are collected In to the initial BLASTP search set.

• The search set is broadened by adding synonyms that differ from the words at one position.

• Only synonyms with scores above a threshold value are added to the search set. NCBI BLASTP uses a default threshold of 10 for synonyms

12 of 19

13 of 19

Contd….

 Using this search set, BLAST scans a database and identifies word hits/matches that score above the threshold.

 These short matches serve as seeds. The BLAST algorithm attempts to extend the match in the immediate sequence neighborhood

 BLAST keeps a running raw score, using scoring matrices, as it extends the matches. Each new amino acid either increases or decreases the raw score

 Penalties are assigned for mismatches and for gaps between the two alignments.

 Using this search set, BLAST scans a database and identifies word hits/matches that score above the threshold.

 These short matches serve as seeds. The BLAST algorithm attempts to extend the match in the immediate sequence neighborhood

 BLAST keeps a running raw score, using scoring matrices, as it extends the matches. Each new amino acid either increases or decreases the raw score

 Penalties are assigned for mismatches and for gaps between the two alignments.

14 of 19

Contd….

  • In the NCBI default settings, a gap bringsaninitial penalty of 11, which increases by 1foreach missing amino acid. • Once the score falls below a set level, thealignment ceases and blast stops tryingtoextend the alignment.
  • An extended sequence alignment that wasinitially seeded by a word hit is produced -calledan hsp, or high-scoring segment pair.
  • All HSPs that have a cumulative score abovethe threshold score are reported in BLASTresults
  • Raw scores are then converted into bit scoresby correcting for the scoring matrix used

15 of 19

16 of 19

The Blast output

 Includes a table with the bit scores (S) for each alignment andits E-value, or “expect score”

 the score (S) is a measure of the quality of an alignment (calculated as the sum of substitution and gap scores for eachaligned residue)

 E-value (E), or expectation value is a measure of the significanceof the alignment. The E-value is the number of different alignments, with scores equivalent to or better than S, that areexpected to occur in a database search by chance.

 The lower the E-value, the more significant the alignment result.

 Alignments with the highest bit scores and lowest E-values arelisted at the top of the table.

17 of 19

Uses of BLAST:

  • Search a database for sequences similar to an input sequence.

• Identify previously characterized sequences.

• Find phylogenetically related sequences.

• Identify possible functions based onsimilarities to known sequences.

18 of 19

BLAST (Basic Local Alignment Search Tool) has numerous applications in various fields of biology and bioinformatics:

1.Sequence Alignment: BLAST is primarily used for comparing biological sequences (DNA, RNA, or protein) against databases to find regions of similarity. This is essential for identifying homologous sequences, which can provide insights into evolutionary relationships and functional similarities.

2. Functional Annotation: BLAST results can be used to annotate the function of unknown sequences by identifying similar sequences with known functions in the database. This is crucial for interpreting the biological significance of newly sequenced genes or proteins.

3. Genomic and Proteomic Analysis: BLAST can be used to analyze entire genomes or proteomes to identify genes, regulatory regions, or protein domains. It helps in understanding the organization and structure of genomes and proteomes.

4. Phylogenetic Analysis: By identifying homologous sequences across different species, BLAST can aid in phylogenetic analysis to study evolutionary relationships and infer the evolutionary history of organisms.2

Applications of BLAST

19 of 19

5.Disease Research: BLAST is used in biomedical research to identify genetic variations associated with diseases, study gene expression patterns, and investigate the role of specific genes or proteins in disease pathways.

6.Drug Discovery: BLAST can be employed in drug discovery efforts to identify potential drug targets by comparing protein sequences of pathogens or disease-related genes to sequences of known drug targets.

7.Agricultural Biotechnology: BLAST is used in agricultural research to study crop genomes, identify genes related to desirable traits such as disease resistance or yield, and develop molecular markers for breeding purposes.

8.Microbial Ecology: BLAST is used to analyze microbial communities in various environments, such as soil, water, or the human microbiome, by comparing sequences obtained from environmental samples to reference databases.

9.Forensic Analysis: BLAST can be used in forensic biology to compare DNA sequences obtained from crime scenes to databases of known DNA sequences to identify suspects or victims.

10. Virology: BLAST is used to study the genetic diversity of viruses, identify novel viruses, and understand viral evolution and transmission dynamics.

Overall, BLAST is an indispensable tool in molecular biology and bioinformatics, playing a vital role in a wide range of research areas and applications.