Repeated KBand DP for Pairwise Alignments
with
Visualization
Repeated
K-Band DP for Pairwise Comparison
Table of Contents
Kband Algorithm
Repeated KBand Helper Functions
Scoring System
Alignments
Results and Visualization
Comparisons with other tools
Future Work
Conclusion
References
1. KBand Algorithm
If we are comparing two similar sequences, the best alignments are located near the main diagonal. It is not necessary to calculate the entire matrix to obtain the optimal score and alignments.
In this project, KBand is used to specify the narrow
band along the main diagonal. Thus, if the two sequences are similar,
KBand can be narrow and it's in the best case. In worst case, the
KBand is the same with traditional pairwise alignment.
Repeated KBand
Repeated
KBand is essentially using the same KBand algorithm. Instead of
having the static KBand, it dynamically adjusts Kband by computing
best possible score at each cell. The best possible score can be
calculated by the helper function: bestPossibleScore(),
and the cells within KBand can be identified with the helper
function: InsideStrip().
2. Repeated KBand Helper Functions
2.a InsideStrip
To check
if the cell is withing a Kband or not. It is useful because we don't
compute cells outside of the strip.
int
insideStrip(int
i, int j,
int k)
{
if (-k <= i - j && i - j <= k)
return 1;
else
return 0;
}
2.b BestPossibleScore
To know the possible score at the arbitrary cell.
If the best possible score is greater or equal to the main diagonal score, by convention, double the K.
The best possible score does not depend on the
neighbor cells.
n: distance from k band
M: Matching Score
g: Gap
double bestPossibleScore(int n, int k, int M, float g)
{
//M(n - k - 1) + 2(k+1)g where M > 0 and g <= 0 (Taken from Setubal's book)
double score = (n - k - 1.0) * M + (2.0 * k + 2.0) * g;
return score;
}
* Assuming M is positive and g is negative value
3.a
Scoring System
Case1:
a( i – 1, j – 1 ) + p( S[i], T[j] )
Case2: a( i – 1, j ) + p( S[i], - )
Case1: a( i, j – 1) + p( -, T[j] )
p(S[i], T[j]) returns matching score if S[i] == S[j]
p(S[i], T[j]) returns mismatching score if S[i] != S[j]
a(i, j) = max(Case1, Case2, Case3)
* Each cells takes the maximum of three cases.
3.b Penalties
Match and Mis-match:
User-specifiable fixed value is implemented in this program. User specify the match and mis-match penalties. More realistic scoring system should consider the distribution in real life such as BLOSOM.
Gap:
Gap starting and Gap extension are both implemented.
4. Alignments
4.a
Global Alignments
Global Alignments are shown
with the cells with blue background. I removed global alignment paths
whose score are lower than other possible paths (up, left, diagonal).
So, the paths whose score are largest are considered, but lower paths
are discarded. In the screen output, the cells with blue background
is the best path of global alignments.
4.b Local Alignments
Local Alignments are shown with the cells with yellow triangle at left bottom corner. By showing this way, we can see Local Alignments and Global alignments simultaneously.
4.c Semiglobal Alignment
Semiglobal Alignments are the same with Global Alignments with GAP penalty = 0. Other adjustments will be possible by changing gap extending penalty.
5. Results and Visualization
in OpenGL
5.a
Result of No KBand
The best global alignment path is shown in blue background. The yellow triangles at left bottom of cell means a part of local alignment.
5.b KBand with K = 1, then K adjusted to 4
The cells in black is not computed because it's outside of KBand.
This Kband = 4 is
enough to obtain the same result 1 with no KBand.
5.c Large sequence with
KBand
Large Sequences of 197 * 197
with KBand
5.d Close Look at the KBand
T
hese
sequences are taken from Setubal's book.
Initial KBand as k = 1, then adjusted k = 4 during the computation.
Cells above diagonal are calculated first and cells below diagonal are calculated next.
In the first row, k = 1 and best possible score is lower than the main diagonal cell (-1), so the Kband stays the same, but in the second row, k is doubled because the possible score is greater or equal to the diagonal cell (-1). Best Possible Score at [1][10] = 0, Best Possible Score at [1][15] = 0
6.
Comparison
Other tools are using BLOSOM for match and mismatch penalty. Resulting alignments are checked with this program, but it may have alternative results because of the variable match and mismatch are being used in other programs.
7. Future work
Most of available tools are using PAM or BLOSUM
for match and mismatch penalties. So, it will be reasonable to
implement the matrix to obtain more realistic results. But in this
project, the simple scoring system of match, and mismatch score are
used for the simplicity.
To improve Visualization
may be possible such as in 3-D. More flexible optimization will be
possible by utilizing vector and volumetric computations. It can
also be more interactive with user's input in the tool.
To use clusters or GPUGPU porting may be
challenging but it will improve computational time for large
sequences. But it will be beyond this course to learn
device-specific programming as in Nvidia's CUDA.
To implement another optimization techniques and combines them together.
8. Conclusion
The KBand algorithm used in this project will reduce computation cost, and produce correct results. Many other optimization are possible for pairwise alignments.
9. References
[1] Benjamin N Jackson, Handbook of Computational Molecular Biology
http://www.bgjackson.net/bcb597/supplement/alignment.pdf
[2] Daniel Huson, Algorithmic Bioinformatics
http://lectures.molgen.mpg.de/Algorithmische_Bioinformatik_WS0607/reinert1.pdf
[3] Prof. Khuri, Sequence Alignment, SJSU CS223 course slides
http://cs.sjsu.edu/~khuri/CS223/CS223_Seq_Alignment_2perpage.pdf
[4] Setubal, J. and
Medidanis J. Introduction to Computational
Molecular Biology. PWS
Publishing Company, 1997.
Tools
Pairwise
Align Protein
http://www.ebioinfogen.com/biotools/pairwise-align-protein.htm