A BLAST from the past: revisiting BLAST's E-value
Yang Lu
University of Waterloo
BLAST (Basic Local Alignment Search Tool) serves as a fundamental tool for comparing sequences and conducting database searches
2
Altschul et al. Journal of Molecular Biology (1990); Altschul et al. Nucleic Acids Research (1997)
#1
widely used program
in bioinformatics
30+
years since
publication
~200K
accumulated
citations
Daily
used in academia
and industry
BLAST has been called the Google of biological research
https://www.nytimes.com/2008/02/21/us/21karlin.html
BLAST is extensively employed for studying functional and evolutionary relationships among DNA and protein sequences
3
https://www.rcsb.org/news/feature/5e74d55d2d410731e9944f52
Tracing COVID19 evolution
https://www.pinterest.com/pin/tree-of-life--303359724875658900/
Constructing tree of life
Jumper et al. Nature (2021)
Predicting protein structures
The workflow of BLAST for protein database search
4
Database
Query protein
Query protein:
DB protein Q28RX4:
Optimal Local alignment
Scoring scheme
The E-value measures the statistical confidence of the alignment
5
Query
length
Location: A Gumbel distribution parameter
Rate: A Gumbel distribution parameter
Altschul et al. Journal of Molecular Biology (1990); Altschul et al. Nucleic Acids Research (1997)
The BLAST E-value may overestimate the significance!
6
Settings:
Murzin et al. Journal of Molecular Biology (1995)
Significance overestimation
(too liberal)
Expectation:
Can we simply select a stricter E-value cutoff threshold?
The BLAST E-value may also underestimate the significance
7
Settings:
Expectation:
Significance underestimation
(too conservative)
Bairoch et al. Nucleic Acids Research (2000)
We cannot predict whether BLAST is overestimating or underestimating!
The problem is universal across various implementations, extensions, and settings
AB-BLAST
Blastn
PSI-BLAST
Diamond
MMSeq2
FASTA
Troubleshooting: The Gumbel distribution still applies to BLAST
9
Significance
overestimation
A Gumbel distribution is fitted to 1M alignment scores to compute p-values
Gumbel
fitting
Significance
underestimation
Dilemma: statistical accuracy vs. computational feasibility
10
Computational feasibility
Statistical Accuracy
General
Applicability
Studentized Gumbel (SG) p-value is inspired by Student's t-test
11
Standardization
Hypothesis testing
Studentization
Hypothesis testing
Sample generation
Studentization applies to any family of distributions characterized by location and scale parameters, including Gumbel!
The workflow of SG p-value
12
Query protein
Shuffled query proteins
Database
…
…
The workflow of SG p-value (Cont.)
13
…
Studentized Gumbel
distribution
Independent of the scoring scheme after studentization
The SG p-values seem to be valid
14
BLAST E-value
Significance
overestimation
Gumbel p-value
SG p-value
The SG p-values seem to be valid (Cont.)
15
BLAST E-value
Significance
underestimation
Gumbel p-value
SG p-value
The SG p-values identify more homologous sequences
16
Data:
Brenner et al. Nucleic Acids Research (2000)
Task:
Metric:
113522
116094
146834
N/A
Future works
17
Accelerate SG p-value without searching the shuffled query
Extend SG p-value to different settings of BLAST
Extend SG p-value to Position-Specific Iterated BLAST
Conclusion
18
We reported the problems of BLAST's statistical estimation, the first-ever in the past 30 years
Key idea:
Impact:
One more thing:
2024/7/16 Tuesday 8:40—9:00 MLCSB
Questions?