1 of 20

A BLAST from the past: revisiting BLAST's E-value

Yang Lu

University of Waterloo

2 of 20

BLAST (Basic Local Alignment Search Tool) serves as a fundamental tool for comparing sequences and conducting database searches

2

Altschul et al. Journal of Molecular Biology (1990); Altschul et al. Nucleic Acids Research (1997)

#1

widely used program

in bioinformatics

30+

years since

publication

~200K

accumulated

citations

Daily

used in academia

and industry

BLAST has been called the Google of biological research

https://www.nytimes.com/2008/02/21/us/21karlin.html

3 of 20

BLAST is extensively employed for studying functional and evolutionary relationships among DNA and protein sequences

3

https://www.rcsb.org/news/feature/5e74d55d2d410731e9944f52

Tracing COVID19 evolution

https://www.pinterest.com/pin/tree-of-life--303359724875658900/

Constructing tree of life

Jumper et al. Nature (2021)

Predicting protein structures

4 of 20

The workflow of BLAST for protein database search

4

Database

Query protein

Query protein:

DB protein Q28RX4:

Optimal Local alignment

Scoring scheme

  • Substitution matrix
  • Gap open/extend

5 of 20

The E-value measures the statistical confidence of the alignment

5

 

Query

length

 

Location: A Gumbel distribution parameter

Rate: A Gumbel distribution parameter

 

 

 

Altschul et al. Journal of Molecular Biology (1990); Altschul et al. Nucleic Acids Research (1997)

6 of 20

The BLAST E-value may overestimate the significance!

6

 

 

Settings:

  • Query: 1M shuffles of a random protein (175 AA)
  • Database: SCOP
  • Scoring scheme: BLOSUM45 (14, 2)

Murzin et al. Journal of Molecular Biology (1995)

Significance overestimation

(too liberal)

Expectation:

 

 

 

 

Can we simply select a stricter E-value cutoff threshold?

7 of 20

The BLAST E-value may also underestimate the significance

7

 

 

Settings:

  • Database: Swiss-Prot
  • Scoring scheme: PAM70 (10, 1)

Expectation:

 

 

  • Query: 1M shuffles of a random protein (175 AA)

 

 

Significance underestimation

(too conservative)

Bairoch et al. Nucleic Acids Research (2000)

We cannot predict whether BLAST is overestimating or underestimating!

8 of 20

The problem is universal across various implementations, extensions, and settings

AB-BLAST

Blastn

PSI-BLAST

Diamond

MMSeq2

FASTA

9 of 20

Troubleshooting: The Gumbel distribution still applies to BLAST

9

 

 

 

Significance

overestimation

A Gumbel distribution is fitted to 1M alignment scores to compute p-values

Gumbel

fitting

Significance

underestimation

 

 

 

 

10 of 20

Dilemma: statistical accuracy vs. computational feasibility

10

  • BLAST E-value
  • Gumbel-fitted p-value
  • Pre-estimated Gumbel parameters
  • Only depend on the scoring scheme
  • Estimate Gumbel parameters in real-time
  • Depend on scoring scheme, query, and DB

Computational feasibility

Statistical Accuracy

General

Applicability

  • Studentized Gumbel (SG) p-value

11 of 20

Studentized Gumbel (SG) p-value is inspired by Student's t-test

11

 

 

 

Standardization

 

Hypothesis testing

 

 

Studentization

 

Hypothesis testing

 

Sample generation

Studentization applies to any family of distributions characterized by location and scale parameters, including Gumbel!

12 of 20

The workflow of SG p-value

12

Query protein

Shuffled query proteins

Database

 

 

 

 

 

 

 

 

 

 

13 of 20

The workflow of SG p-value (Cont.)

13

 

 

 

 

 

Studentized Gumbel

distribution

Independent of the scoring scheme after studentization

14 of 20

The SG p-values seem to be valid

14

 

 

BLAST E-value

Significance

overestimation

 

 

Gumbel p-value

 

 

SG p-value

15 of 20

The SG p-values seem to be valid (Cont.)

15

 

 

BLAST E-value

Significance

underestimation

 

 

Gumbel p-value

 

 

SG p-value

16 of 20

The SG p-values identify more homologous sequences

16

Data:

  • Database: ASTRAL40
  • Proteins with the same superfamily label
  • Query: Every sequence in ASTRAL40 DB
  • Search for homologous sequences

Brenner et al. Nucleic Acids Research (2000)

Task:

  • A protein domain DB with 15,178 domains
  • The number of reported homologous sequence subject to a cutoff threshold (1%)

Metric:

113522

116094

146834

N/A

17 of 20

Future works

17

Accelerate SG p-value without searching the shuffled query

Extend SG p-value to different settings of BLAST

Extend SG p-value to Position-Specific Iterated BLAST

  • PSI-BLAST is a more powerful search tool based on BLAST
  • Extend to blastn, blastx, tblastn
  • Direct estimation of sample mean/variance using ML models

 

18 of 20

Conclusion

18

We reported the problems of BLAST's statistical estimation, the first-ever in the past 30 years

Key idea:

  • BLAST is a cornerstone in biomedical analysis
  • Potential to reshape many conclusions drawn by researchers

Impact:

  • The problem is universal
  • Studentized-Gumbel (SG) p-value is a valid solution
  • Independent of the scoring scheme

19 of 20

One more thing:

2024/7/16 Tuesday 8:40—9:00 MLCSB

20 of 20

Questions?