CS481/CS583: Bioinformatics Algorithms
Can Alkan
EA509
calkan@cs.bilkent.edu.tr
http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/
MULTIPLE SEQUENCE ALIGNMENT
Multiple Alignment versus Pairwise Alignment
Generalizing the Notion of Pairwise Alignment
2-row matrix
A T _ G C G _
A _ C G T _ A
A T C A C _ A
Alignments = Paths in…
| A | A | T | -- | C |
| A | -- | T | G | C |
| -- | A | T | G | C |
Alignment Paths
0 | 1 | 1 | 2 | 3 | 4 |
| A | A | T | -- | C |
| A | -- | T | G | C |
| -- | A | T | G | C |
x coordinate
Alignment Paths
ATGC, AATC,ATGC
0 | 1 | 1 | 2 | 3 | 4 |
0 | 1 | 2 | 3 | 3 | 4 |
| A | A | T | -- | C |
| A | -- | T | G | C |
| -- | A | T | G | C |
x coordinate
y coordinate
Alignment Paths
0 | 1 | 1 | 2 | 3 | 4 |
0 | 1 | 2 | 3 | 3 | 4 |
| A | A | T | -- | C |
| A | -- | T | G | C |
0 | 0 | 1 | 2 | 3 | 4 |
| -- | A | T | G | C |
(0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)
x coordinate
y coordinate
z coordinate
Aligning Three Sequences
source
sink
2-D vs 3-D Alignment Grid
V
W
2-D edit graph
3-D edit graph
Architecture of 3-D Alignment Cell
(i-1,j-1,k-1)
(i,j-1,k-1)
(i,j-1,k)
(i-1,j-1,k)
(i-1,j,k)
(i,j,k)
(i-1,j,k-1)
(i,j,k-1)
Multiple Alignment: Dynamic Programming
si-1,j-1,k-1 + δ(vi, wj, uk)
si-1,j-1,k + δ (vi, wj, _ )
si-1,j,k-1 + δ (vi, _, uk)
si,j-1,k-1 + δ (_, wj, uk)
si-1,j,k + δ (vi, _ , _)
si,j-1,k + δ (_, wj, _)
si,j,k-1 + δ (_, _, uk)
cube diagonal: no indels
face diagonal: one indel
edge diagonal: two indels
Multiple Alignment: Running Time
Multiple Alignment Induces Pairwise Alignments
Every multiple alignment induces pairwise alignments
x: AC-GCGG-C
y: AC-GC-GAG
z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG
y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments
Given 3 arbitrary pairwise alignments:
x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG
y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG
can we construct a multiple alignment that induces
them?
Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments
Given 3 arbitrary pairwise alignments:
x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG
y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG
can we construct a multiple alignment that induces
them?
NOT ALWAYS
Pairwise alignments may be inconsistent
Inferring Multiple Alignment from Pairwise Alignments
Combining Optimal Pairwise Alignments into Multiple Alignment
Can combine pairwise alignments into multiple alignment
Can not combine pairwise alignments into multiple alignment
Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 1 1 .8
C .6 1 .4 1 .6 .2
G 1 .2 .2 .4 1
T .2 1 .6 .2
- .2 .8 .4 .8 .4
Profile Representation of Multiple Alignment
In the past we were aligning a sequence against a sequence
Can we align a sequence against a profile?
Can we align a profile against a profile?
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 1 1 .8
C .6 1 .4 1 .6 .2
G 1 .2 .2 .4 1
T .2 1 .6 .2
- .2 .8 .4 .8 .4
Aligning alignments
x GGGCACTGCAT
y GGTTACGTC-- Alignment 1
z GGGAACTGCAG
w GGACGTACC-- Alignment 2
v GGACCT-----
Aligning alignments
x GGGCACTGCAT
y GGTTACGTC-- Combined Alignment
z GGGAACTGCAG
w GGACGTACC--
v GGACCT-----
Sequence to profile alignment
Col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
C .6 0 0 0 1 0 0 .4 1 0 0 .6 .2 0 0
p(C, 1) = 0.6, p(C,2) = 0, p(C, 3) = 0, …. p(C, 9) = 1 …
Sequence S1 to Profile C
Sequence to profile alignment
O(|∑|nm). |∑|: alphabet size, n: sequence length, m: profile length
Sequence to profile
Profile | |||||||||||||||
| 0 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -10 | -11 | -12 | -13 | -14 |
C | -1 | 0.8 | -0.2 | -1.2 | -2.2 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -10 | -11 |
A | -2 | -0.2 | 2.8 | 1.8 | 0.8 | -0.2 | -1.2 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 |
G | -3 | -1.2 | 1.8 | 4.8 | 3.8 | 2.8 | 1.8 | 0.8 | -0.2 | -1.2 | -2.2 | -3.2 | -4.2 | -5.2 | -5 |
G | -4 | -2.2 | 0.8 | 3.8 | 4.4 | 3.4 | 2.4 | 1.4 | 0.4 | -0.6 | -1.6 | -2.6 | -3.6 | -4 | -3.2 |
T | -5 | -3.2 | -0.2 | 2.8 | 3.4 | 3.4 | 5.4 | 4.4 | 3.4 | 2.4 | 1.4 | 0.4 | -0.6 | -1.6 | -2.6 |
A | -6 | -4.2 | -1.2 | 1.8 | 2.4 | 2.4 | 4.4 | 7.4 | 6.4 | 5.4 | 4.4 | 3.4 | 2.4 | 1.4 | 0.4 |
C | -7 | -5.2 | -2.2 | 0.8 | 1.4 | 4.4 | 3.4 | 6.4 | 7.6 | 8.4 | 7.4 | 6.4 | 5.4 | 4.4 | 3.4 |
C | -8 | -6.2 | -3.2 | -0.2 | 0.4 | 3.4 | 3.4 | 5.4 | 6.6 | 9.6 | 8.6 | 8.2 | 7.2 | 6.2 | 5.2 |
A | -9 | -7.2 | -4.2 | -1.2 | -0.6 | 2.4 | 2.4 | 5.4 | 5.6 | 8.6 | 11 | 10 | 9 | 8 | 7 |
C | -10 | -8.2 | -5.2 | -2.2 | -1.6 | 1.4 | 1.4 | 4.4 | 5.6 | 7.6 | 10 | 11.8 | 10.8 | 9.8 | 8.8 |
G | -11 | -9.2 | -6.2 | -3.2 | -2.6 | 0.4 | 0.4 | 3.4 | 4.6 | 6.6 | 9 | 10.8 | 10.8 | 11 | 11.8 |
G | -12 | -10.2 | -7.2 | -4.2 | -3.6 | -0.6 | -0.6 | 2.4 | 3.6 | 5.6 | 8 | 9.8 | 9.8 | 11 | 13 |
C A G G - T A C C A C – G G
Match = 2
Mismatch = -1
Gap = -1
Sequence to profile alignment
New Profile C:
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
C A G G - T A C C A C – G G
A 1 1 .83
C .67 .83 .5 1 .67 .17
G 1 .33 .17 .50 1
T .16 1 .5 .17
- .17 .67 .17 .33 .83 .33
Multiple Alignment: Greedy Approach
u1= ACGTACGTACGT…
u2 = TTAATTAATTAA…
u3 = ACTACTACTACT…
…
uk = CCGGCCGGCCGG
u1= ACg/tTACg/tTACg/cT…
u2 = TTAATTAATTAA…
…
uk = CCGGCCGGCCGG…
k
k-1
Greedy Approach: Example
s1 GATTCA
s2 GTCTGA
s3 GATATT
s4 GTCAGC
Greedy Approach: Example (cont’d)
s2 GTCTGA
s4 GTCAGC (score = 2)
s1 GAT-TCA
s2 G-TCTGA (score = 1)
s1 GAT-TCA
s3 GATAT-T (score = 1)
s1 GATTCA--
s4 G—T-CAGC(score = 0)
s2 G-TCTGA
s3 GATAT-T (score = -1)
s3 GAT-ATT
s4 G-TCAGC (score = -1)
Greedy Approach: Example (cont’d)
s2 and s4 are closest; combine:
s2 GTCTGA
s4 GTCAGC
s2,4 GTCt/aGa/cA (profile)
s1 GATTCA
s3 GATATT
s2,4 GTCt/aGa/c
new set of 3 sequences:
Progressive Alignment
ClustalW
1.) Construct pairwise alignments
2.) Build Guide Tree
3.) Progressive Alignment guided by the tree
Step 1: Pairwise Alignment
v1 v2 v3 v4
v1 -
v2 .17 -
v3 .87 .28 -
v4 .59 .33 .62 -
(.17 means 17 % identical)
Step 2: Guide Tree
Step 2: Guide Tree (cont’d)
v1
v3
v4
v2
Calculate:�v1,3 = alignment (v1, v3)�v1,3,4 = alignment((v1,3),v4)�v1,2,3,4 = alignment((v1,3,4),v2)
v1 v2 v3 v4
v1 -
v2 .17 -
v3 .87 .28 -
v4 .59 .33 .62 -
Step 3: Progressive Alignment
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD
FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD
FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD
FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ
FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
. . : ** . :.. *:.* * . * **:
Dots and stars show how well-conserved a column is.
SCORING ALIGNMENTS
Multiple Alignments: Scoring
Multiple LCS Score
AAA
AAA
AAT
ATC
Entropy
AAA
AAA
AAT
ATC
Entropy: Example
Best case
Worst case
Multiple Alignment: Entropy Score
Entropy for a multiple alignment is the sum of entropies of its columns:
Σ over all columns Σ X=A,T,G,C pX logpX�
Entropy of an Alignment: Example
column entropy
-( pAlogpA + pClogpC + pGlogpG + pTlogpT)
c1 = -[1*log(1) + 0*log0 + 0*log0 + 0*log0]
= 0
c2 = -[(¼)*log(¼) + (¾)*log(¾) + 0*log0 + 0*log0]
= -[(¼)*(-2) + (¾)*(-0.415)]
= +0.811
c3 = -[(¼)*log(¼) + (¼)*log(¼) + (¼)*log(¼) + (¼)*log(¼)]
= 4*-[(¼)*(-2)]
= +2.0
Alignment Entropy = 0 + 0.811 + 2.0 = +2.811
c1 | c2 | c3 |
A | A | A |
A | C | C |
A | C | G |
A | C | T |
Multiple Alignment Induces Pairwise Alignments
Every multiple alignment induces pairwise alignments
x: AC-GCGG-C
y: AC-GC-GAG
z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG
y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Not necessarily optimal
Sum of Pairs Score(SP-Score)
ai and aj
imposed by a multiple alignment of k sequences
s*(ai, aj)
s(a1,…,ak) = Σi,j s*(ai, aj)
Computing SP-Score
Aligning 4 sequences: 6 pairwise alignments
Given a1,a2,a3,a4:
s(a1…a4) = Σ s*(ai,aj) = s*(a1,a2) +
s*(a1,a3) +
s*(a1,a4) +
s*(a2,a3) +
s*(a2,a4) +
s*(a3,a4)
SP-Score: Example
a1
.
ak
ATG-C-AAT
A-G-CATAT
ATCCCATTT
Pairs of Sequences
A
A
A
1
1
1
G
C
G
1
−μ
−μ
Score=3
Score = 1 – 2μ
Column 1
Column 3
s
s*(
To calculate each column:
Back to guide trees for MSA
Star alignments
Star alignment Algorithm
Star alignment example
Sc AA--CCTT
S1 AATGCC--
Sc A-ACC-TT
S2 AGACCGT-
Sc A-A--CC-TT
S1 A-ATGCC---
S2 AGA--CCGT-
Multiple Alignment: History 📜
1975 Sankoff
Formulated multiple alignment problem and gave dynamic programming solution
1988 Carrillo-Lipman
Branch and Bound approach for MSA
1990 Feng-Doolittle
Progressive alignment
1994 Thompson-Higgins-Gibson-ClustalW
Most popular multiple alignment program
1998 Morgenstern et al.-DIALIGN
Segment-based multiple alignment
2000 Notredame-Higgins-Heringa-T-coffee
Using the library of pairwise alignments
2004 MUSCLE
PARTIAL ORDER ALIGNMENT
Basics
Assume 2 sequences:
seq1 : CCGCTTTTCCGC
seq2 : CCGCAAAACCGC
Alignments:
CCGC----TTTTCGCG CCGCTTTT----CCGC CCGC-TT-TT--CGCG CCGC-T-T-T-TCCGC
CCGCAAAA----CGCG CCGC----AAAACCGC CCGCA--A--AACCGC CCGCA-A-A-A-CCGC
DAG representation
Alignments:
CCGC----TTTTCGCG CCGCTTTT----CCGC CCGC-TT-TT--CGCG CCGC-T-T-T-TCCGC
CCGCAAAA----CGCG CCGC----AAAACCGC CCGCA--A--AACCGC CCGCA-A-A-A-CCGC
“Regular” alignment
String to Graph Alignment
String to Graph Alignment
String to Graph Alignment
Problem: order of dependencies in the graph
Solution: topological sort
Steps:
Pairwise alignment example
CGATTACG
||.|||.
CGCTTAT-
POA implementations