Repeated KBand DP for Pairwise Alignments

with Visualization

Repeated K-Band DP for Pairwise Comparison



Table of Contents

  1. Kband Algorithm

  2. Repeated KBand Helper Functions

  3. Scoring System

  4. Alignments

  5. Results and Visualization

  6. Comparisons with other tools

  7. Future Work

  8. Conclusion

  9. References

1. KBand Algorithm

If we are comparing two similar sequences, the best alignments are located near the main diagonal. It is not necessary to calculate the entire matrix to obtain the optimal score and alignments.

In this project, KBand is used to specify the narrow band along the main diagonal. Thus, if the two sequences are similar, KBand can be narrow and it's in the best case. In worst case, the KBand is the same with traditional pairwise alignment.

Repeated KBand
Repeated KBand is essentially using the same KBand algorithm. Instead of having the static KBand, it dynamically adjusts Kband by computing best possible score at each cell. The best possible score can be calculated by the helper function: bestPossibleScore(), and the cells within KBand can be identified with the helper function: InsideStrip().


2. Repeated KBand Helper Functions

2.a InsideStrip
To check if the cell is withing a Kband or not. It is useful because we don't compute cells outside of the strip.
int insideStrip(int i, int j, int k)

{

if (-k <= i - j && i - j <= k)

return 1;

else

return 0;

}


2.b BestPossibleScore

To know the possible score at the arbitrary cell.

If the best possible score is greater or equal to the main diagonal score, by convention, double the K.

The best possible score does not depend on the neighbor cells.
n: distance from k band

M: Matching Score

g: Gap

double bestPossibleScore(int n, int k, int M, float g)

{

//M(n - k - 1) + 2(k+1)g where M > 0 and g <= 0 (Taken from Setubal's book)

double score = (n - k - 1.0) * M + (2.0 * k + 2.0) * g;

return score;

}

* Assuming M is positive and g is negative value

3.a Scoring System
Case1: a( i – 1, j – 1 ) + p( S[i], T[j] )

Case2: a( i – 1, j ) + p( S[i], - )

Case1: a( i, j – 1) + p( -, T[j] )

p(S[i], T[j]) returns matching score if S[i] == S[j]

p(S[i], T[j]) returns mismatching score if S[i] != S[j]

a(i, j) = max(Case1, Case2, Case3)

* Each cells takes the maximum of three cases.

3.b Penalties

Match and Mis-match:

User-specifiable fixed value is implemented in this program. User specify the match and mis-match penalties. More realistic scoring system should consider the distribution in real life such as BLOSOM.


Gap:

Gap starting and Gap extension are both implemented.




4. Alignments
4.a Global Alignments

Global Alignments are shown with the cells with blue background. I removed global alignment paths whose score are lower than other possible paths (up, left, diagonal). So, the paths whose score are largest are considered, but lower paths are discarded. In the screen output, the cells with blue background is the best path of global alignments.

4.b Local Alignments

Local Alignments are shown with the cells with yellow triangle at left bottom corner. By showing this way, we can see Local Alignments and Global alignments simultaneously.


4.c Semiglobal Alignment

Semiglobal Alignments are the same with Global Alignments with GAP penalty = 0. Other adjustments will be possible by changing gap extending penalty.


5. Results and Visualization in OpenGL
5.a Result of No KBand

The best global alignment path is shown in blue background. The yellow triangles at left bottom of cell means a part of local alignment.




5.b KBand with K = 1, then K adjusted to 4

The cells in black is not computed because it's outside of KBand.

This Kband = 4 is enough to obtain the same result 1 with no KBand.


5.c Large sequence with KBand
Large Sequences of 197 * 197 with KBand

5.d Close Look at the KBand

T
hese sequences are taken from Setubal's book.

Initial KBand as k = 1, then adjusted k = 4 during the computation.

Cells above diagonal are calculated first and cells below diagonal are calculated next.

In the first row, k = 1 and best possible score is lower than the main diagonal cell (-1), so the Kband stays the same, but in the second row, k is doubled because the possible score is greater or equal to the diagonal cell (-1). Best Possible Score at [1][10] = 0, Best Possible Score at [1][15] = 0


6. Comparison


Other tools are using BLOSOM for match and mismatch penalty. Resulting alignments are checked with this program, but it may have alternative results because of the variable match and mismatch are being used in other programs.



7. Future work


  1. Most of available tools are using PAM or BLOSUM for match and mismatch penalties. So, it will be reasonable to implement the matrix to obtain more realistic results. But in this project, the simple scoring system of match, and mismatch score are used for the simplicity.

  2. To improve Visualization may be possible such as in 3-D. More flexible optimization will be possible by utilizing vector and volumetric computations. It can also be more interactive with user's input in the tool.

  3. To use clusters or GPUGPU porting may be challenging but it will improve computational time for large sequences. But it will be beyond this course to learn device-specific programming as in Nvidia's CUDA.

  4. To implement another optimization techniques and combines them together.



8. Conclusion


The KBand algorithm used in this project will reduce computation cost, and produce correct results. Many other optimization are possible for pairwise alignments.



9. References


[1] Benjamin N Jackson, Handbook of Computational Molecular Biology

http://www.bgjackson.net/bcb597/supplement/alignment.pdf

[2] Daniel Huson, Algorithmic Bioinformatics

http://lectures.molgen.mpg.de/Algorithmische_Bioinformatik_WS0607/reinert1.pdf


[3] Prof. Khuri, Sequence Alignment, SJSU CS223 course slides

http://cs.sjsu.edu/~khuri/CS223/CS223_Seq_Alignment_2perpage.pdf

[4] Setubal, J. and Medidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.

Tools
Pairwise Align Protein

http://www.ebioinfogen.com/biotools/pairwise-align-protein.htm