JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 61

CS481/CS583: Bioinformatics Algorithms

Can Alkan

EA509

calkan@cs.bilkent.edu.tr

http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/

2 of 61

MULTIPLE SEQUENCE ALIGNMENT

3 of 61

Multiple Alignment versus Pairwise Alignment

Up until now we have only tried to align two sequences.
What about more than two?
A faint similarity between two sequences becomes significant if present in many
Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

4 of 61

Generalizing the Notion of Pairwise Alignment

Alignment of 2 sequences is represented as a

2-row matrix

In a similar way, we represent alignment of 3 sequences as a 3-row matrix

A T _ G C G _

A _ C G T _ A

A T C A C _ A

Score: more conserved columns, better alignment

5 of 61

Alignments = Paths in_…

Align 3 sequences: ATGC, AATC,ATGC

	A	A	T	--	C

	A	--	T	G	C

	--	A	T	G	C

6 of 61

Alignment Paths

0	1	1	2	3	4

	A	A	T	--	C

	A	--	T	G	C

	--	A	T	G	C

x coordinate

7 of 61

Alignment Paths

Align the following 3 sequences:

ATGC, AATC,ATGC

0	1	1	2	3	4

0	1	2	3	3	4

	A	A	T	--	C

	A	--	T	G	C

	--	A	T	G	C

x coordinate

y coordinate

8 of 61

Alignment Paths

0	1	1	2	3	4

0	1	2	3	3	4

	A	A	T	--	C

	A	--	T	G	C

0	0	1	2	3	4

	--	A	T	G	C

Resulting path in (x,y,z) space:

(0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

x coordinate

y coordinate

z coordinate

9 of 61

Aligning Three Sequences

Same strategy as aligning two sequences
Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align
For global alignments, go from source to sink

source

sink

10 of 61

2-D vs 3-D Alignment Grid

2-D edit graph

3-D edit graph

11 of 61

Architecture of 3-D Alignment Cell

(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k)

(i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

12 of 61

Multiple Alignment: Dynamic Programming

s_i,j,k = max��

δ(x, y, z) is an entry in the 3-D scoring matrix

s_i-1,j-1,k-1 + δ(v_i, w_j, u_k)

s_i-1,j-1,k + δ (v_i, w_j, _ )

s_i-1,j,k-1 + δ (v_i, _, u_k)

s_i,j-1,k-1 + δ (_, w_j, u_k)

s_i-1,j,k + δ (v_i, _ , _)

s_i,j-1,k + δ (_, w_j, _)

s_i,j,k-1 + δ (_, _, u_k)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

13 of 61

Multiple Alignment: Running Time

For 3 sequences of length n, the run time is 7n³; O(n³)

For k sequences, build a k-dimensional Manhattan, with run time (2^k-1)(n^k); O(2^kn^k)

Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

14 of 61

Multiple Alignment Induces Pairwise Alignments

Every multiple alignment induces pairwise alignments

x: AC-GCGG-C

y: AC-GC-GAG

z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG

y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

15 of 61

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments

Given 3 arbitrary pairwise alignments:

x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG

y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG

can we construct a multiple alignment that induces

them?

16 of 61

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments

Given 3 arbitrary pairwise alignments:

x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG

y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG

can we construct a multiple alignment that induces

them?

NOT ALWAYS

Pairwise alignments may be inconsistent

17 of 61

Inferring Multiple Alignment from Pairwise Alignments

From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal
It is difficult to infer a “good” multiple alignment from optimal pairwise alignments between all sequences

18 of 61

Combining Optimal Pairwise Alignments into Multiple Alignment

Can combine pairwise alignments into multiple alignment

Can not combine pairwise alignments into multiple alignment

19 of 61

Profile Representation of Multiple Alignment

- A G G C T A T C A C C T G

T A G – C T A C C A - - - G

C A G – C T A C C A - - - G

C A G – C T A T C A C – G G

C A G – C T A T C G C – G G

A 1 1 .8

C .6 1 .4 1 .6 .2

G 1 .2 .2 .4 1

T .2 1 .6 .2

- .2 .8 .4 .8 .4

20 of 61

Profile Representation of Multiple Alignment

In the past we were aligning a sequence against a sequence

Can we align a sequence against a profile?

Can we align a profile against a profile?

- A G G C T A T C A C C T G

T A G – C T A C C A - - - G

C A G – C T A C C A - - - G

C A G – C T A T C A C – G G

C A G – C T A T C G C – G G

A 1 1 .8

C .6 1 .4 1 .6 .2

G 1 .2 .2 .4 1

T .2 1 .6 .2

- .2 .8 .4 .8 .4

21 of 61

Aligning alignments

Given two alignments, can we align them?

x GGGCACTGCAT

y GGTTACGTC-- Alignment 1

z GGGAACTGCAG

w GGACGTACC-- Alignment 2

v GGACCT-----

22 of 61

Aligning alignments

Given two alignments, can we align them?
Hint: use alignment of corresponding profiles

x GGGCACTGCAT

y GGTTACGTC-- Combined Alignment

z GGGAACTGCAG

w GGACGTACC--

v GGACCT-----

23 of 61

Sequence to profile alignment

Col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

C .6 0 0 0 1 0 0 .4 1 0 0 .6 .2 0 0

p(C, 1) = 0.6, p(C,2) = 0, p(C, 3) = 0, …. p(C, 9) = 1 …

Sequence S₁ to Profile C

24 of 61

Sequence to profile alignment

O(|∑|nm). |∑|: alphabet size, n: sequence length, m: profile length

25 of 61

Sequence to profile

Profile
	0	-1	-2	-3	-4	-5	-6	-7	-8	-9	-10	-11	-12	-13	-14
C	-1	0.8	-0.2	-1.2	-2.2	-2	-3	-4	-5	-6	-7	-8	-9	-10	-11
A	-2	-0.2	2.8	1.8	0.8	-0.2	-1.2	-1	-2	-3	-4	-5	-6	-7	-8
G	-3	-1.2	1.8	4.8	3.8	2.8	1.8	0.8	-0.2	-1.2	-2.2	-3.2	-4.2	-5.2	-5
G	-4	-2.2	0.8	3.8	4.4	3.4	2.4	1.4	0.4	-0.6	-1.6	-2.6	-3.6	-4	-3.2
T	-5	-3.2	-0.2	2.8	3.4	3.4	5.4	4.4	3.4	2.4	1.4	0.4	-0.6	-1.6	-2.6
A	-6	-4.2	-1.2	1.8	2.4	2.4	4.4	7.4	6.4	5.4	4.4	3.4	2.4	1.4	0.4
C	-7	-5.2	-2.2	0.8	1.4	4.4	3.4	6.4	7.6	8.4	7.4	6.4	5.4	4.4	3.4
C	-8	-6.2	-3.2	-0.2	0.4	3.4	3.4	5.4	6.6	9.6	8.6	8.2	7.2	6.2	5.2
A	-9	-7.2	-4.2	-1.2	-0.6	2.4	2.4	5.4	5.6	8.6	11	10	9	8	7
C	-10	-8.2	-5.2	-2.2	-1.6	1.4	1.4	4.4	5.6	7.6	10	11.8	10.8	9.8	8.8
G	-11	-9.2	-6.2	-3.2	-2.6	0.4	0.4	3.4	4.6	6.6	9	10.8	10.8	11	11.8
G	-12	-10.2	-7.2	-4.2	-3.6	-0.6	-0.6	2.4	3.6	5.6	8	9.8	9.8	11	13

C A G G - T A C C A C – G G

Match = 2

Mismatch = -1

Gap = -1

26 of 61

Sequence to profile alignment

New Profile C:

- A G G C T A T C A C C T G

T A G – C T A C C A - - - G

C A G – C T A C C A - - - G

C A G – C T A T C A C – G G

C A G – C T A T C G C – G G

C A G G - T A C C A C – G G

A 1 1 .83

C .67 .83 .5 1 .67 .17

G 1 .33 .17 .50 1

T .16 1 .5 .17

- .17 .67 .17 .33 .83 .33

27 of 61

Multiple Alignment: Greedy Approach

Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat
This is a heuristic greedy method

u₁= ACGTACGTACGT…

u₂ = TTAATTAATTAA…

u₃ = ACTACTACTACT…

…

u_k = CCGGCCGGCCGG

u₁= ACg/tTACg/tTACg/cT…

u₂ = TTAATTAATTAA…

…

u_k = CCGGCCGGCCGG…

k-1

28 of 61

Greedy Approach: Example

Consider these 4 sequences

s1 GATTCA

s2 GTCTGA

s3 GATATT

s4 GTCAGC

29 of 61

Greedy Approach: Example (cont’d)

There are = 6 possible alignments

s2 GTCTGA

s4 GTCAGC (score = 2)

s1 GAT-TCA

s2 G-TCTGA (score = 1)

s1 GAT-TCA

s3 GATAT-T (score = 1)

s1 GATTCA--

s4 G—T-CAGC(score = 0)

s2 G-TCTGA

s3 GATAT-T (score = -1)

s3 GAT-ATT

s4 G-TCAGC (score = -1)

30 of 61

Greedy Approach: Example (cont’d)

s₂ and s₄ are closest; combine:

s2 GTCTGA

s4 GTCAGC

s_2,4 GTCt/aGa/cA (profile)

s₁ GATTCA

s₃ GATATT

s_2,4 GTCt/aGa/c

new set of 3 sequences:

31 of 61

Progressive Alignment

Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments.
Progressive alignment works well for close sequences, but deteriorates for distant sequences

Gaps in consensus string are permanent
Use profiles to compare sequences

32 of 61

ClustalW

Popular multiple alignment tool
‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently).
Three-step process

1.) Construct pairwise alignments

2.) Build Guide Tree

3.) Progressive Alignment guided by the tree

33 of 61

Step 1: Pairwise Alignment

Aligns each sequence again each other giving a similarity matrix
Similarity = exact matches / sequence length (percent identity)

v₁ v₂ v₃ v₄

v₁ -

v₂ .17 -

v₃ .87 .28 -

v₄ .59 .33 .62 -

(.17 means 17 % identical)

34 of 61

Step 2: Guide Tree

Create Guide Tree using the similarity matrix

ClustalW uses the neighbor-joining method

Guide tree roughly reflects evolutionary relations

35 of 61

Step 2: Guide Tree (cont’d)

v₁

v₃

v₄

v₂

Calculate:�v_1,3 = alignment (v₁, v₃)�v_1,3,4 = alignment((v_1,3),v₄)�v_1,2,3,4 = alignment((v_1,3,4),v₂)

v₁ v₂ v₃ v₄

v₁ -

v₂ .17 -

v₃ .87 .28 -

v₄ .59 .33 .62 -

36 of 61

Step 3: Progressive Alignment

Start by aligning the two most similar sequences
Following the guide tree, add in the next sequences, aligning to the existing alignment
Insert gaps as necessary

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD

FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD

FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD

FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ

FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

. . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is.

37 of 61

SCORING ALIGNMENTS

38 of 61

Multiple Alignments: Scoring

Number of matches (multiple longest common subsequence score)

Entropy score

Sum of pairs (SP-Score)

39 of 61

Multiple LCS Score

A column is a “match” if all the letters in the column are the same

Only good for very similar sequences

AAA

AAT

ATC

40 of 61

Entropy

Define frequencies for the occurrence of each letter in each column of multiple alignment

p_A = 1, p_T=p_G=p_C=0 (1^st column)
p_A = 0.75, p_T = 0.25, p_G=p_C=0 (2^nd column)
p_A = 0.50, p_T = 0.25, p_C=0.25 p_G=0 (3^rd column)

Compute entropy of each column

AAA

AAT

ATC

41 of 61

Entropy: Example

Best case

Worst case

42 of 61

Multiple Alignment: Entropy Score

Entropy for a multiple alignment is the sum of entropies of its columns:

Σ _{over all columns} Σ _X=A,T,G,C p_Xlogp_X�

43 of 61

Entropy of an Alignment: Example

column entropy

-( p_Alogp_A+ p_Clogp_C+ p_Glogp_G+ p_Tlogp_T)

c1 = -[1*log(1) + 0*log0 + 0*log0 + 0*log0]

= 0

c2 = -[(¼)*log(¼) + (¾)*log(¾) + 0*log0 + 0*log0]

= -[(¼)*(-2) + (¾)*(-0.415)]

= +0.811

c3 = -[(¼)*log(¼) + (¼)*log(¼) + (¼)*log(¼) + (¼)*log(¼)]

= 4*-[(¼)*(-2)]

= +2.0

Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

c1	c2	c3
A	A	A
A	C	C
A	C	G
A	C	T

44 of 61

Multiple Alignment Induces Pairwise Alignments

Every multiple alignment induces pairwise alignments

x: AC-GCGG-C

y: AC-GC-GAG

z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG

y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Not necessarily optimal

45 of 61

Sum of Pairs Score(SP-Score)

Consider pairwise alignment of sequences

a_i and a_j

imposed by a multiple alignment of k sequences

Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as

s*(a_i, a_j)

Sum up the pairwise scores for a multiple alignment:

s(a₁,…,a_k) = Σ_i,j s*(a_i, a_j)

46 of 61

Computing SP-Score

Aligning 4 sequences: 6 pairwise alignments

Given a₁,a₂,a₃,a₄:

s(a₁…a₄) = Σ s*(a_i,a_j) = s*(a₁,a₂) +

s*(a₁,a₃) +

s*(a₁,a₄) +

s*(a₂,a₃) +

s*(a₂,a₄) +

s*(a₃,a₄)

47 of 61

SP-Score: Example

a₁

a_k

ATG-C-AAT

A-G-CATAT

ATCCCATTT

Pairs of Sequences

−μ

Score=3

Score = 1 – 2μ

Column 1

Column 3

s^*(

To calculate each column:

48 of 61

Back to guide trees for MSA

Guide tree construction

UPGMA
Neighbor Joining
….

Easy MSA: Center Star

49 of 61

Star alignments

Construct multiple alignments using pair-wise alignment relative to a fixed sequence
Out of a set S = {S₁, S₂, . . . , S_r} of sequences, pick sequence S_c that maximizes��star_score(c) = ∑ {sim(S_c, S_i) : 1 ≤ i ≤ r, i ≠ c}��where sim(S_i, S_j) is the optimal score of a pair-wise alignment between S_i and S_j

50 of 61

Star alignment Algorithm

Compute sim(S_i, S_j) for every pair (i,j)
Compute star_score(i) for every i
Choose the index c that minimizes star_score(c) and make it the center of the star
Produce a multiple alignment M such that, for every i, the induced pairwise alignment of S_c and S_i is the same as the optimum alignment of S_c and S_i.

51 of 61

Star alignment example

S_c AA--CCTT

S₁ AATGCC--

S_c A-ACC-TT

S₂ AGACCGT-

S_c A-A--CC-TT

S₁ A-ATGCC---

S₂ AGA--CCGT-

52 of 61

Multiple Alignment: History 📜

1975 Sankoff

Formulated multiple alignment problem and gave dynamic programming solution

1988 Carrillo-Lipman

Branch and Bound approach for MSA

1990 Feng-Doolittle

Progressive alignment

1994 Thompson-Higgins-Gibson-ClustalW

53 of 61

PARTIAL ORDER ALIGNMENT

54 of 61

Basics

Assume 2 sequences:

seq1 : CCGCTTTTCCGC

seq2 : CCGCAAAACCGC

Alignments:

CCGC----TTTTCGCG CCGCTTTT----CCGC CCGC-TT-TT--CGCG CCGC-T-T-T-TCCGC

CCGCAAAA----CGCG CCGC----AAAACCGC CCGCA--A--AACCGC CCGCA-A-A-A-CCGC

55 of 61

DAG representation

Alignments:

CCGC----TTTTCGCG CCGCTTTT----CCGC CCGC-TT-TT--CGCG CCGC-T-T-T-TCCGC

CCGCAAAA----CGCG CCGC----AAAACCGC CCGCA--A--AACCGC CCGCA-A-A-A-CCGC

56 of 61

“Regular” alignment

57 of 61

String to Graph Alignment

58 of 61

String to Graph Alignment

The primary difference for the purposes of dynamic programming is that while a base in a sequence has exactly one predecessor, a base in a graph can have two or more. Thus, the score may have come from one of several previous locations for the same (graph) Insert or Align moves being considered; and thus those scores must be considered too in determining the best previous position. (Note that insertions from the sequence are unchanged).
The only difference deep inside the dynamic programming loop is that multiple previous scores (and any associated gap-open information) must be considered for insertions or alignments of the graph base. This is implemented by a loop over predecessors for the current base, and all else remains the same.

59 of 61

String to Graph Alignment

Problem: order of dependencies in the graph

Solution: topological sort

Steps:

Perform a topological sort if the graph has been updated
Do the dynamic programming step as usual, with:

The graph nodes visited in the order of the topological sort, and
Considering all valid predecessors for align/insert moves.

60 of 61

Pairwise alignment example

CGATTACG

||.|||.

CGCTTAT-

61 of 61

POA implementations

Original version: https://sourceforge.net/projects/poamsa/
Poapy: https://github.com/ljdursi/poapy
abPOA: https://github.com/Xinglab/abPOA
SIMD-accelerated POA: https://github.com/rvaser/spoa