1 of 61

CS481/CS583: Bioinformatics Algorithms

Can Alkan

EA509

calkan@cs.bilkent.edu.tr

http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/

2 of 61

MULTIPLE SEQUENCE ALIGNMENT

3 of 61

Multiple Alignment versus Pairwise Alignment

  • Up until now we have only tried to align two sequences.
  • What about more than two?
  • A faint similarity between two sequences becomes significant if present in many
  • Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

4 of 61

Generalizing the Notion of Pairwise Alignment

  • Alignment of 2 sequences is represented as a

2-row matrix

  • In a similar way, we represent alignment of 3 sequences as a 3-row matrix

A T _ G C G _

A _ C G T _ A

A T C A C _ A

  • Score: more conserved columns, better alignment

5 of 61

Alignments = Paths in

  • Align 3 sequences: ATGC, AATC,ATGC

A

A

T

--

C

A

--

T

G

C

--

A

T

G

C

6 of 61

Alignment Paths

0

1

1

2

3

4

A

A

T

--

C

A

--

T

G

C

--

A

T

G

C

x coordinate

7 of 61

Alignment Paths

  • Align the following 3 sequences:

ATGC, AATC,ATGC

0

1

1

2

3

4

0

1

2

3

3

4

A

A

T

--

C

A

--

T

G

C

--

A

T

G

C

x coordinate

y coordinate

8 of 61

Alignment Paths

0

1

1

2

3

4

0

1

2

3

3

4

A

A

T

--

C

A

--

T

G

C

0

0

1

2

3

4

--

A

T

G

C

  • Resulting path in (x,y,z) space:

(0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

x coordinate

y coordinate

z coordinate

9 of 61

Aligning Three Sequences

  • Same strategy as aligning two sequences
  • Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align
  • For global alignments, go from source to sink

source

sink

10 of 61

2-D vs 3-D Alignment Grid

V

W

2-D edit graph

3-D edit graph

11 of 61

Architecture of 3-D Alignment Cell

(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k)

(i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

12 of 61

Multiple Alignment: Dynamic Programming

  • si,j,k = max����

  • δ(x, y, z) is an entry in the 3-D scoring matrix

si-1,j-1,k-1 + δ(vi, wj, uk)

si-1,j-1,k + δ (vi, wj, _ )

si-1,j,k-1 + δ (vi, _, uk)

si,j-1,k-1 + δ (_, wj, uk)

si-1,j,k + δ (vi, _ , _)

si,j-1,k + δ (_, wj, _)

si,j,k-1 + δ (_, _, uk)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

13 of 61

Multiple Alignment: Running Time

  • For 3 sequences of length n, the run time is 7n3; O(n3)

  • For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk)

  • Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

14 of 61

Multiple Alignment Induces Pairwise Alignments

Every multiple alignment induces pairwise alignments

x: AC-GCGG-C

y: AC-GC-GAG

z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG

y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

15 of 61

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments

Given 3 arbitrary pairwise alignments:

x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG

y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG

can we construct a multiple alignment that induces

them?

16 of 61

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments

Given 3 arbitrary pairwise alignments:

x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG

y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG

can we construct a multiple alignment that induces

them?

NOT ALWAYS

Pairwise alignments may be inconsistent

17 of 61

Inferring Multiple Alignment from Pairwise Alignments

  • From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal
  • It is difficult to infer a “good” multiple alignment from optimal pairwise alignments between all sequences

18 of 61

Combining Optimal Pairwise Alignments into Multiple Alignment

Can combine pairwise alignments into multiple alignment

Can not combine pairwise alignments into multiple alignment

19 of 61

Profile Representation of Multiple Alignment

- A G G C T A T C A C C T G

T A G – C T A C C A - - - G

C A G – C T A C C A - - - G

C A G – C T A T C A C – G G

C A G – C T A T C G C – G G

A 1 1 .8

C .6 1 .4 1 .6 .2

G 1 .2 .2 .4 1

T .2 1 .6 .2

- .2 .8 .4 .8 .4

20 of 61

Profile Representation of Multiple Alignment

In the past we were aligning a sequence against a sequence

Can we align a sequence against a profile?

Can we align a profile against a profile?

- A G G C T A T C A C C T G

T A G – C T A C C A - - - G

C A G – C T A C C A - - - G

C A G – C T A T C A C – G G

C A G – C T A T C G C – G G

A 1 1 .8

C .6 1 .4 1 .6 .2

G 1 .2 .2 .4 1

T .2 1 .6 .2

- .2 .8 .4 .8 .4

21 of 61

Aligning alignments

  • Given two alignments, can we align them?

x GGGCACTGCAT

y GGTTACGTC-- Alignment 1

z GGGAACTGCAG

w GGACGTACC-- Alignment 2

v GGACCT-----

22 of 61

Aligning alignments

  • Given two alignments, can we align them?
  • Hint: use alignment of corresponding profiles

x GGGCACTGCAT

y GGTTACGTC-- Combined Alignment

z GGGAACTGCAG

w GGACGTACC--

v GGACCT-----

23 of 61

Sequence to profile alignment

 

Col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

C .6 0 0 0 1 0 0 .4 1 0 0 .6 .2 0 0

p(C, 1) = 0.6, p(C,2) = 0, p(C, 3) = 0, …. p(C, 9) = 1 …

Sequence S1 to Profile C

24 of 61

Sequence to profile alignment

  •  

O(|∑|nm). |∑|: alphabet size, n: sequence length, m: profile length

25 of 61

Sequence to profile

Profile

0

-1

-2

-3

-4

-5

-6

-7

-8

-9

-10

-11

-12

-13

-14

C

-1

0.8

-0.2

-1.2

-2.2

-2

-3

-4

-5

-6

-7

-8

-9

-10

-11

A

-2

-0.2

2.8

1.8

0.8

-0.2

-1.2

-1

-2

-3

-4

-5

-6

-7

-8

G

-3

-1.2

1.8

4.8

3.8

2.8

1.8

0.8

-0.2

-1.2

-2.2

-3.2

-4.2

-5.2

-5

G

-4

-2.2

0.8

3.8

4.4

3.4

2.4

1.4

0.4

-0.6

-1.6

-2.6

-3.6

-4

-3.2

T

-5

-3.2

-0.2

2.8

3.4

3.4

5.4

4.4

3.4

2.4

1.4

0.4

-0.6

-1.6

-2.6

A

-6

-4.2

-1.2

1.8

2.4

2.4

4.4

7.4

6.4

5.4

4.4

3.4

2.4

1.4

0.4

C

-7

-5.2

-2.2

0.8

1.4

4.4

3.4

6.4

7.6

8.4

7.4

6.4

5.4

4.4

3.4

C

-8

-6.2

-3.2

-0.2

0.4

3.4

3.4

5.4

6.6

9.6

8.6

8.2

7.2

6.2

5.2

A

-9

-7.2

-4.2

-1.2

-0.6

2.4

2.4

5.4

5.6

8.6

11

10

9

8

7

C

-10

-8.2

-5.2

-2.2

-1.6

1.4

1.4

4.4

5.6

7.6

10

11.8

10.8

9.8

8.8

G

-11

-9.2

-6.2

-3.2

-2.6

0.4

0.4

3.4

4.6

6.6

9

10.8

10.8

11

11.8

G

-12

-10.2

-7.2

-4.2

-3.6

-0.6

-0.6

2.4

3.6

5.6

8

9.8

9.8

11

13

C A G G - T A C C A C – G G

Match = 2

Mismatch = -1

Gap = -1

26 of 61

Sequence to profile alignment

New Profile C:

- A G G C T A T C A C C T G

T A G – C T A C C A - - - G

C A G – C T A C C A - - - G

C A G – C T A T C A C – G G

C A G – C T A T C G C – G G

C A G G - T A C C A C – G G

A 1 1 .83

C .67 .83 .5 1 .67 .17

G 1 .33 .17 .50 1

T .16 1 .5 .17

- .17 .67 .17 .33 .83 .33

27 of 61

Multiple Alignment: Greedy Approach

  • Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat
  • This is a heuristic greedy method

u1= ACGTACGTACGT…

u2 = TTAATTAATTAA…

u3 = ACTACTACTACT…

uk = CCGGCCGGCCGG

u1= ACg/tTACg/tTACg/cT…

u2 = TTAATTAATTAA…

uk = CCGGCCGGCCGG…

k

k-1

28 of 61

Greedy Approach: Example

  • Consider these 4 sequences

s1 GATTCA

s2 GTCTGA

s3 GATATT

s4 GTCAGC

29 of 61

Greedy Approach: Example (cont’d)

  • There are = 6 possible alignments

s2 GTCTGA

s4 GTCAGC (score = 2)

s1 GAT-TCA

s2 G-TCTGA (score = 1)

s1 GAT-TCA

s3 GATAT-T (score = 1)

s1 GATTCA--

s4 GT-CAGC(score = 0)

s2 G-TCTGA

s3 GATAT-T (score = -1)

s3 GAT-ATT

s4 G-TCAGC (score = -1)

30 of 61

Greedy Approach: Example (cont’d)

s2 and s4 are closest; combine:

s2 GTCTGA

s4 GTCAGC

s2,4 GTCt/aGa/cA (profile)

s1 GATTCA

s3 GATATT

s2,4 GTCt/aGa/c

new set of 3 sequences:

31 of 61

Progressive Alignment

  • Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments.
  • Progressive alignment works well for close sequences, but deteriorates for distant sequences
    • Gaps in consensus string are permanent
    • Use profiles to compare sequences

32 of 61

ClustalW

  • Popular multiple alignment tool
  • ‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently).
  • Three-step process

1.) Construct pairwise alignments

2.) Build Guide Tree

3.) Progressive Alignment guided by the tree

33 of 61

Step 1: Pairwise Alignment

  • Aligns each sequence again each other giving a similarity matrix
  • Similarity = exact matches / sequence length (percent identity)

v1 v2 v3 v4

v1 -

v2 .17 -

v3 .87 .28 -

v4 .59 .33 .62 -

(.17 means 17 % identical)

34 of 61

Step 2: Guide Tree

  • Create Guide Tree using the similarity matrix

    • ClustalW uses the neighbor-joining method

    • Guide tree roughly reflects evolutionary relations

35 of 61

Step 2: Guide Tree (cont’d)

v1

v3

v4

v2

Calculate:�v1,3 = alignment (v1, v3)�v1,3,4 = alignment((v1,3),v4)�v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4

v1 -

v2 .17 -

v3 .87 .28 -

v4 .59 .33 .62 -

36 of 61

Step 3: Progressive Alignment

  • Start by aligning the two most similar sequences
  • Following the guide tree, add in the next sequences, aligning to the existing alignment
  • Insert gaps as necessary

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD

FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD

FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD

FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ

FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

. . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is.

37 of 61

SCORING ALIGNMENTS

38 of 61

Multiple Alignments: Scoring

  • Number of matches (multiple longest common subsequence score)

  • Entropy score

  • Sum of pairs (SP-Score)

39 of 61

Multiple LCS Score

  • A column is a “match” if all the letters in the column are the same

  • Only good for very similar sequences

AAA

AAA

AAT

ATC

40 of 61

Entropy

  • Define frequencies for the occurrence of each letter in each column of multiple alignment
    • pA = 1, pT=pG=pC=0 (1st column)
    • pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)
    • pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)
  • Compute entropy of each column

AAA

AAA

AAT

ATC

41 of 61

Entropy: Example

Best case

Worst case

42 of 61

Multiple Alignment: Entropy Score

Entropy for a multiple alignment is the sum of entropies of its columns:

Σ over all columns Σ X=A,T,G,C pX logpX�

43 of 61

Entropy of an Alignment: Example

column entropy

-( pAlogpA + pClogpC + pGlogpG + pTlogpT)

c1 = -[1*log(1) + 0*log0 + 0*log0 + 0*log0]

= 0

c2 = -[(¼)*log(¼) + (¾)*log(¾) + 0*log0 + 0*log0]

= -[(¼)*(-2) + (¾)*(-0.415)]

= +0.811

c3 = -[(¼)*log(¼) + (¼)*log(¼) + (¼)*log(¼) + (¼)*log(¼)]

= 4*-[(¼)*(-2)]

= +2.0

Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

c1

c2

c3

A

A

A

A

C

C

A

C

G

A

C

T

44 of 61

Multiple Alignment Induces Pairwise Alignments

Every multiple alignment induces pairwise alignments

x: AC-GCGG-C

y: AC-GC-GAG

z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG

y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Not necessarily optimal

45 of 61

Sum of Pairs Score(SP-Score)

  • Consider pairwise alignment of sequences

ai and aj

imposed by a multiple alignment of k sequences

  • Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as

s*(ai, aj)

  • Sum up the pairwise scores for a multiple alignment:

s(a1,…,ak) = Σi,j s*(ai, aj)

46 of 61

Computing SP-Score

Aligning 4 sequences: 6 pairwise alignments

Given a1,a2,a3,a4:

s(a1…a4) = Σ s*(ai,aj) = s*(a1,a2) +

s*(a1,a3) +

s*(a1,a4) +

s*(a2,a3) +

s*(a2,a4) +

s*(a3,a4)

47 of 61

SP-Score: Example

a1

.

ak

ATG-C-AAT

A-G-CATAT

ATCCCATTT

Pairs of Sequences

A

A

A

1

1

1

G

C

G

1

−μ

−μ

Score=3

Score = 1 –

Column 1

Column 3

s

s*(

To calculate each column:

48 of 61

Back to guide trees for MSA

  • Guide tree construction
    • UPGMA
    • Neighbor Joining
    • ….
  • Easy MSA: Center Star

49 of 61

Star alignments

  • Construct multiple alignments using pair-wise alignment relative to a fixed sequence
  • Out of a set S = {S1, S2, . . . , Sr} of sequences, pick sequence Sc that maximizes��star_score(c) = ∑ {sim(Sc, Si) : 1 ≤ i ≤ r, i ≠ c}��where sim(Si, Sj) is the optimal score of a pair-wise alignment between Si and Sj

50 of 61

Star alignment Algorithm

  1. Compute sim(Si, Sj) for every pair (i,j)
  2. Compute star_score(i) for every i
  3. Choose the index c that minimizes star_score(c) and make it the center of the star
  4. Produce a multiple alignment M such that, for every i, the induced pairwise alignment of Sc and Si is the same as the optimum alignment of Sc and Si.

51 of 61

Star alignment example

Sc AA--CCTT

S1 AATGCC--

Sc A-ACC-TT

S2 AGACCGT-

Sc A-A--CC-TT

S1 A-ATGCC---

S2 AGA--CCGT-

52 of 61

Multiple Alignment: History 📜

1975 Sankoff

Formulated multiple alignment problem and gave dynamic programming solution

1988 Carrillo-Lipman

Branch and Bound approach for MSA

1990 Feng-Doolittle

Progressive alignment

1994 Thompson-Higgins-Gibson-ClustalW

Most popular multiple alignment program

1998 Morgenstern et al.-DIALIGN

Segment-based multiple alignment

2000 Notredame-Higgins-Heringa-T-coffee

Using the library of pairwise alignments

2004 MUSCLE

53 of 61

PARTIAL ORDER ALIGNMENT

54 of 61

Basics

Assume 2 sequences:

seq1 : CCGCTTTTCCGC

seq2 : CCGCAAAACCGC

Alignments:

CCGC----TTTTCGCG CCGCTTTT----CCGC CCGC-TT-TT--CGCG CCGC-T-T-T-TCCGC

CCGCAAAA----CGCG CCGC----AAAACCGC CCGCA--A--AACCGC CCGCA-A-A-A-CCGC

55 of 61

DAG representation

Alignments:

CCGC----TTTTCGCG CCGCTTTT----CCGC CCGC-TT-TT--CGCG CCGC-T-T-T-TCCGC

CCGCAAAA----CGCG CCGC----AAAACCGC CCGCA--A--AACCGC CCGCA-A-A-A-CCGC

56 of 61

“Regular” alignment

57 of 61

String to Graph Alignment

58 of 61

String to Graph Alignment

  • The primary difference for the purposes of dynamic programming is that while a base in a sequence has exactly one predecessor, a base in a graph can have two or more. Thus, the score may have come from one of several previous locations for the same (graph) Insert or Align moves being considered; and thus those scores must be considered too in determining the best previous position. (Note that insertions from the sequence are unchanged).
  • The only difference deep inside the dynamic programming loop is that multiple previous scores (and any associated gap-open information) must be considered for insertions or alignments of the graph base. This is implemented by a loop over predecessors for the current base, and all else remains the same.

59 of 61

String to Graph Alignment

Problem: order of dependencies in the graph

Solution: topological sort

Steps:

  • Perform a topological sort if the graph has been updated
  • Do the dynamic programming step as usual, with:
    • The graph nodes visited in the order of the topological sort, and
    • Considering all valid predecessors for align/insert moves.

60 of 61

Pairwise alignment example

CGATTACG

||.|||.

CGCTTAT-

61 of 61

POA implementations