1 of 71

CS481/CS583: Bioinformatics Algorithms

Can Alkan

EA509

calkan@cs.bilkent.edu.tr

http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/

2 of 71

RNA STRUCTURE

3 of 71

RNA Basics

  • RNA bases A,C,G,U
  • Canonical Base Pairs
    • A-U
    • G-C
    • G-U

“wobble” pairing

    • Bases can only pair with one other base.

2 Hydrogen Bonds

3 Hydrogen Bonds – more stable

4 of 71

RNA Basics

  • transfer RNA (tRNA)
  • messenger RNA (mRNA)
  • ribosomal RNA (rRNA)
  • small interfering RNA (siRNA)
  • micro RNA (miRNA)
  • small nucleolar RNA (snoRNA)
  • Long non-coding RNA (lncRNA)
  • ……

5 of 71

RNA folding

  • Prediction of secondary structure of an RNA given its sequence
  • General problem is NP-hard due to “difficult” substructures, like pseudoknots
  • Most existing algorithms require too much memory (≥O(n2)), and run time (≥O(n3)) thus limited to smaller RNA sequences

6 of 71

RNA Structural Levels�

Primary

AAUCG...CUUCUUCCA

Primary

Secondary

Tertiary

7 of 71

RNA families

  • Rfam : General non-coding RNA database

(most of the data is taken from specific databases)

http://www.sanger.ac.uk/Software/Rfam/

Includes many families of non coding RNAs and functional

Motifs, as well as their alignment and their secondary structures

8 of 71

RNA Secondary Structure

Hairpin loop

Junction (Multiloop)

Bulge Loop

Single-Stranded

Interior Loop

Stem

Pseudoknot

9 of 71

Example: 5S rRNA

E. coli 5S

120 bases

T. thermophilus 5S

120 bases

10 of 71

Example: E. coli 16S rRNA

1542 bases

11 of 71

Example: E. coli 23S rRNA

2904 bases

12 of 71

Example: HIV

9173 bases

Watts et al., Nature, 2009

13 of 71

Example: SARS-CoV and SARS-CoV-2

Simmonds, 2020

14 of 71

Binary Tree Representation of RNA Secondary Structure

  • Representation of RNA structure using Binary tree
  • Nodes represent
    • Base pair if two bases are shown
    • Loop if base and “gap” (dash) are shown
  • Pseudoknots still not represented
  • Tree does not permit varying sequences
    • Mismatches
    • Insertions & Deletions

Images – Eddy et al.

15 of 71

Circular Representation

Images – David Mount

16 of 71

Examples of known interactions of RNA secondary structural elements

Pseudoknot

Kissing hairpins

Hairpin-bulge contact

These patterns are excluded from the prediction schemes as their computation is too intensive.

17 of 71

Predicting RNA secondary structure

  • Base pair maximization
  • Minimum free energy (most common)
    • Fold, Mfold (Zuker & Stiegler)
    • RNAfold (Hofacker)
  • Multiple sequence alignment
    • Use known structure of RNA with similar sequence
  • Covariance
  • Stochastic Context-Free Grammars

18 of 71

Sequence Alignment as a method to determine structure

  • Bases pair in order to form backbones and determine the secondary structure
  • Aligning bases based on their ability to pair with each other gives an algorithmic approach to determining the optimal structure

19 of 71

Simplifying Assumptions

  • RNA folds into one minimum free-energy structure.
  • There are no knots (base pairs never cross).
  • The energy of a particular base pair in a double stranded regions is sequence independent
    • Neighbors do not influence the energy.
  • Was solved by dynamic programming, Zuker and Stiegler 1981

20 of 71

Base Pair Maximization

U

C

C

A

G

G

A

C

Zuker (1981) Nucleic Acids Research 9(1) 133-149

21 of 71

Base Pair Maximization – Dynamic Programming Algorithm

Simple Example:

Maximizing Base Pairing

http://bix.ucsd.edu/bioalgorithms/slides.php

S(i,j) is the folding of the subsequence of the RNA strand from index i to index j which results in the highest number of base pairs

22 of 71

Base Pair Maximization – Dynamic Programming Algorithm

  • Alignment Method
    • Align RNA strand to itself
    • Score increases for feasible base pairs
  • Each score independent of overall structure
  • Bifurcation adds extra dimension

Initialize first two diagonal arrays to 0

Fill in squares sweeping diagonally

Images – Sean Eddy

Bases cannot pair, similar

to unmatched alignment

Bases can pair, similar

to matched alignment

S(i + 1, j)

Dynamic Programming – possible paths

S(i + 1, j – 1) +1

S(i, j – 1)

23 of 71

Base Pair Maximization – Dynamic Programming Algorithm

  • Alignment Method
    • Align RNA strand to itself
    • Score increases for feasible base pairs
  • Each score independent of overall structure
  • Bifurcation adds extra dimension

Initialize first two diagonal arrays to 0

Fill in squares sweeping diagonally

Images – Sean Eddy

Reminder:

For all k

S(i,k) + S(k + 1, j)

k = 0 : Bifurcation max in this case

S(i,k) + S(k + 1, j)

Reminder:

For all k

S(i,k) + S(k + 1, j)

Bases cannot pair, similar

Bases can pair, similar

to matched alignment

Dynamic Programming – possible paths

Bifurcation – add values for all k

24 of 71

Base Pair Maximization - Drawbacks

  • Base pair maximization will not necessarily lead to the most stable structure
    • May create structure with many interior loops or hairpins which are energetically unfavorable
  • Comparable to aligning sequences with scattered matches – not biologically reasonable

25 of 71

Energy Minimization

  • Thermodynamic Stability
    • Estimated using experimental techniques
    • Theory : Most Stable is the Most likely
  • No Pseudoknots due to algorithm limitations
  • Uses Dynamic Programming alignment technique
  • Attempts to maximize the score taking into account thermodynamics
  • MFOLD and ViennaRNA

26 of 71

Free energy model

Free energy of a structure is the sum of all interactions energies

Each interaction energy can be calculated thermodynamically

Free Energy(E) = E(CG)+E(CG)+…..

27 of 71

Why is MFE secondary structure prediction hard?

  • MFE structure can be found by calculating free energy of all possible structures

  • BUT the number of potential structures grows exponentially with the number, n, of bases

28 of 71

RNA folding with Dynamic programming �(Zuker and Stiegler)

  • W(i,j): MFE structure of substrand from i to j

i

j

W(i,j)

29 of 71

RNA folding with dynamic programming

  • Assume a function W(i,j) which is the MFE for the sequence starting at i and ending at j (i<j)

  • Define scores, for example base pair (CG) =-1 non-pair(CA)=1 (we want a negative score )
  • Consider 4 possibilities:
    • i,j are a base pair, added to the structure for i+1..j-1
    • i is unpaired, added to the structure for i+1..j
    • j is unpaired, added to the structure for i..j-1
    • i,j are paired, but not to each other;
  • Choose the minimal energy

i (i+1)

W(i,j)

(j-1) j

30 of 71

Energy Minimization Results

  • Linear RNA strand folded back on itself to create secondary structure
  • Circularized representation uses this requirement
    • Arcs represent base pairing

Images – David Mount

  • All loops must have at least 3 bases in them
    • Equivalent to having 3 base pairs between all arcs

Exception: Location where the beginning and end of RNA come together in circularized representation

31 of 71

Trouble with Pseudoknots

  • Pseudoknots cause a breakdown in the Dynamic Programming Algorithm.
  • In order to form a pseudoknot, checks must be made to ensure base is not already paired – this breaks down the recurrence relations

Images – David Mount

32 of 71

Sequence dependent free-energy �Nearest Neighbor Model

U U

C G

G C

A U

G C

A UCGAC 3’

5’

U U

C G

U A

A U

G C

A UCGAC 3’

5’

Energy is influenced by the previous base pair

(not by the base pairs further down).

33 of 71

Sequence dependent free-energy values of the base pairs

U U

C G

G C

A U

G C

A UCGAC 3’

5’

U U

C G

U A

A U

G C

A UCGAC 3’

5’

Example values:

GC GC GC GC

AU GC CG UA

-2.3 -2.9 -3.4 -2.1

These energies are estimated experimentally from small synthetic RNAs.

34 of 71

Adding Complexity to Energy Calculations

  • Stacking energy - Assign negative energies to these between base pair regions.
    • Energy is influenced by the previous base pair (not by the base pairs further down).
    • These energies are estimated experimentally from small synthetic RNAs.
  • Positive energy - added for destabilizing regions such as bulges, loops, etc.
  • More than one structure can be predicted

35 of 71

Mfold

  • Positive energy - added for destabilizing regions such as bulges, loops, etc.
  • More than one structure can be predicted

36 of 71

Free energy computation

U U

A A

G C

G C

A

G C

U A

A U

C G

A U

A 3’

A

5’

-0.3

-0.3

-1.1 mismatch of hairpin

-2.9 stacking

+3.3 1nt bulge

-2.9 stacking

-1.8 stacking

5’ dangling

-0.9 stacking

-1.8 stacking

-2.1 stacking

G= -4.6 KCAL/MOL

+5.9 4 nt loop

37 of 71

Mfold

  • Positive energy - added for destabilizing regions such as bulges, loops, etc.
  • More than one structure can be predicted

38 of 71

Frey U H et al. Clin Cancer Res 2005;11:5071-5077

©2005 by American Association for Cancer Research

More than one structure can be predicted for the

same RNA

39 of 71

Energy Minimization Drawbacks

  • Compute only one optimal structure
  • Usual drawbacks of purely mathematical approaches
    • Similar difficulties in other algorithms
      • Protein structure
      • Exon finding

40 of 71

RNA fold prediction based on Multiple Alignment

Information from multiple sequence alignment (MSA) can help to predict the probability of positions i,j to be base-paired.

G C C U U C G G G C

G A C U U C G G U C

G G C U U C G G C C

41 of 71

Compensatory Substitutions

U U

C G

U A

A U

G C

A UCGAC 3’

G

C

5’

Mutations that maintain the secondary structure can help predict the fold

42 of 71

RNA secondary structure can be revealed by identification of compensatory mutations

G C C U U C G G G C

G A C U U C G G U C

G G C U U C G G C C

U C

U G

C G

N N’

G C

43 of 71

Insight from Multiple Alignment

Information from multiple sequence alignment (MSA) can help to predict the

probability of positions i,j to be base-paired.

  • Conservation – no additional information
  • Consistent mutations (GC🡪 GU) – support stem
  • Inconsistent mutations – does not support stem.
  • Compensatory mutations – support stem.

44 of 71

  • Predicts the consensus secondary structure for a set of aligned RNA sequences by using modified dynamic programming algorithm that add alignment information to the standard energy model
  • Improvement in prediction accuracy

RNAalifold

45 of 71

LOCAL MINIMUM FREE ENERGY: DENSITYFOLD

46 of 71

E.coli 5S rRNA

Energy Density Landscape

47 of 71

E.coli 5S rRNA predictions

mFold

RNAalifold

rnaScf

48 of 71

Densityfold (alteRNA)

  • Instead of finding minimum global free energy, find local minimum free energies
  • Emulate the folding process of RNA folding by aiming to keep locally stable substructures
  • Energy density seen by a basepair: the free energy of the “optimal substructure” normalized by distance
  • Energy density of an unpaired base: energy density of the nearest encapsulating basepair
  • Densityfold optimizes a linear combination of free energy and total energy density
  • For every potential basepair, compute the optimal contribution of the implied substructure
  • The optimization function is non linear

Hill climbing process for approximating the contributions of unpaired bases

49 of 71

Substructures

Hairpin loop

Junction (Multiloop)

Bulge Loop

Single-Stranded

Interior Loop

Pseudoknot

50 of 71

Densityfold energy types

  • eH(i,j,): free energy of a hairpin loop enclosed by the base pair S[i].S[j]
  • eS(i,j,): free energy of the base pair S[i].S[j] provided that it forms a stacking pair with S[i+1].S[j-1]
  • eBI(i,j,i’,j’): free energy of an internal loop or a bulge that starts with S[i].S[j] and ends with S[i’].S[j’]

51 of 71

Densityfold energy types

  • eM(i,j,i1,j1,…,ik,jk): free energy of multibranch loop that starts with S[i].S[j] and branches out S[i1].S[j1], S[i2].S[j2], …, S[ik].S[jk]
  • eDA(j,j-1): free energy of an unpaired dangling base S[j] when S[j-1] forms a base pair with another base

52 of 71

Densityfold energy tables

  • ED(j): minimum total free energy density of a secondary structure for substring S[1, j].
  • E(j): free energy of the energy density minimized secondary structure for substring S[1, j].
  • EDS(i, j): minimum total free energy density of a secondary structure for S[i, j], provided that S[i].S[j] is a base pair.
  • ES(i, j): free energy of the energy density minimized secondary structure for the substring S[i, j], provided that S[i].S[j] is a base pair.

53 of 71

Densityfold energy tables

  • EDBI (i, j): minimum total free energy density of a secondary structure for S[i, j], provided that there is a bulge or an internal loop starting with base pair S[i].S[j].
  • EBI (i, j): free energy of an energy density minimized structure for S[i, j], provided that a bulge or an internal loop starting with base pair S[i].S[j].
  • EDM(i, j): minimum total free energy density of a secondary structure for S[i, j], such that there is a multibranch loop starting with base pair S[i].S[j].
  • EM(i, j): free energy of an energy density minimized structure for S[i, j], provided there is a multibranch loop starting with base pair S[i].S[j].

54 of 71

Calculating energy tables

  • Similar calculations for other tables
  • O(nk+2) time and O(n2) space

55 of 71

Linear combination of MFE and ED

  • Similar formulations for ELCBI and ELCM
  • O(n4) running time

For any x ε {S,BI,M} let ELCx(i, j) = EDx(i, j) + Ex(i, j).

Optimize ELC(n) = ED(n) + E(n).

56 of 71

Densityfold prediction: E.coli 5S rRNA

Known Structure

Densityfold Prediction

57 of 71

STOCHASTIC CONTEXT-FREE GRAMMARS

58 of 71

SCFG

  • RNA folding can be represented as context-free grammars

59 of 71

Chomsky hierarchy

unrestricted grammars

context-sensitive grammars

context-free grammars

regular grammars

(equivalent to finite automata & HMM’s)

(equivalent to SCFG’s & pushdown automata)

(equivalent to Turing machines & recursively enumerable sets)

(equivalent to linear bounded automata)

B. Majoros

60 of 71

Context-free grammars

A context-free grammar is a generative model denoted by a 4-tuple:

G = (V, Σ, S, P)

where:

Σ is a terminal alphabet, (e.g., {a, c, g, u} )

V is a nonterminal alphabet, (e.g., {A, B, C, D, E, ...} )

SV is a special start symbol, and

P is a set of rewriting rules called productions.

Productions in P are rules of the form:

X λ

where XV, λ∈(Vα)*

B. Majoros

61 of 71

Context “freeness”

The context-freeness is imposed by the requirement that the l.h.s of each production rule may contain only a single symbol, and that symbol must be a nonterminal:

X λ

Thus, a CFG cannot specify context-sensitive rules such as:

wXz wλz

B. Majoros

62 of 71

Derivations

Suppose a CFG G has generated a terminal string x Σ *. A derivation S*x denotes a possible derivation for generating x.

A derivation (or parse) consists of a series of applications of productions from P, beginning with the start symbol S and ending with the terminal string x:

S s1 s2 s3 L x

where si∈(V Σ)*.

We’ll concentrate of leftmost derivations where the leftmost nonterminal is always replaced first.

B. Majoros

63 of 71

Context-free vs. regular

The advantage of CFGs over HMMs lies in their ability to model arbitrary runs of matching pairs of elements, such as matching pairs of parentheses:

L((((((((L))))))))L

When the number of matching pairs is unbounded, a finite-state model such as a DFA or an HMM is inadequate to enforce the constraint that all left elements must have a matching right element.

In contrast, in a CFG we can use rules such as X(X). A sample derivation using such a rule is:

X (X) ((X)) (((X))) ((((X)))) (((((X)))))

An additional rule such as Xε is necessary to terminate the recursion.

B. Majoros

64 of 71

A CFG for an RNA

  • RNA hairpin with 3 bp stem and a 4-base loop (GAAA or GCAA)

S-> aXu | cXg | gXc | uXa

X-> aYu | cYg | gYc | uYa

Y-> aZu | cZg | gZc | uZa

Z->gaaa | gcaa

R. Shamir & R. Sharan

65 of 71

Parse trees

  • A representation of a parse of a string by a CFG
  • Root – start nonterminal S
  • Leaves – terminal symbols in the given string
  • Internal nodes - nonterminals
  • The children of an internal node are the productions of that nonterminal (left-to-right order

R. Shamir & R. Sharan

66 of 71

Stochastic CFG

A stochastic context-free grammar (SCFG) is a CFG plus a probability distribution on productions:

G = (V, Σ, S, P, Rp)

where Pp : P a ¡, and probabilities are normalized at the level of each l.h.s. symbol X:

[ Rp(Xλ)=1 ]

XV Xλ

Thus, we can compute the probability of a single derivation S*x by multiplying the probabilities for all productions used in the derivation:

i R(Xiλi)

We can sum over all possible (leftmost) derivations of a given string x to get the probability that G will generate x at random:

R(x | G) = ∑ R(Sj*x | G).

j

B. Majoros

67 of 71

An example

As an example, consider G=(VG, α, S, PG, RG), for VG={S, L, N}, Σ ={a,c,g,t}, and PG the set consisting of:

S a S u | u S a | c S g | g S c | L

L N N N N

N a | c | g | u

Then the probability of the sequence acguacguacgu is given by:

P(acguacguacgu) =

P( SaSuacSguacgScguacguSacgu

acguLacguacguNNNNacguacguaNNNacgu

acguacNNacguacguacgNacguacguacguacgu) =

0.2 × 0.2 × 0.2 × 0.2 × 0.2 × 1 × 0.25 × 0.25 × 0.25 × 0.25 = 1.25×10-6

because this sequence has only one possible (leftmost) derivation under grammar G.

(P=0.2)

(P=1.0)

(P=0.25)

B. Majoros

68 of 71

Structure using SFCG

  • Grammar rules with associated probabilities

acuSag

S 🡪 aSu | cSg | aS | uS | … | Su | SS | ε

P .21 .15 .11 .08 .03 .22 .02

S

aS

acSg

acuSuag

acugScuag

acuguScuag

acuguaScuag

acuguauScuag

acuguaucuag

acuguaucuag

.(((...).))

  • Let’s generate a structure for the sequence acuguaucuag

acuguacuag

.(((..).))

acugucuag

.(((.).))

acugcuag

.((().))

acuuag

.((.))

acuag

.(())

acg

.()

a

.

  • We select the set of transformations that highest probability of generating the input sequence. This set gives us our structure.

69 of 71

Chomsky Normal Form

Non-CNF:

S a S t | t S a | c S g | g S c | L

L N N N N

N a | c | g | u

CNF:

S A ST | T SA | C SG | G SC | N L1

SA S A

ST S T

SC S C

SG S G

L1 N L2

L2 N N

N a | c | g | u

A a

C c

G g

T u

A CNF grammar is one in which all productions are of the form:

X Y Z

or:

X a

B. Majoros

70 of 71

Parsing CFG

Two questions for a CFG:

  • Can a grammar G derive string x?
  • If so, what series of productions would be used during the derivation? (there may be multiple answers!)

Additional questions for an SCFG:

  • What is the probability that G derives string x?
  • What is the most probable derivation of x via G?

B. Majoros

71 of 71

Parsing CFG

  • CYK Algorithm (Cocke-Younger-Kasami)
    • Dynamic Programming method
  • Modified CYK for SCFG
    • “Inside algorithm”
    • Training similar to HMM
      • If parses are known for training data sequences, simply count the number of times for each production, calculate probabilities (labeled sequence training for HMM)
      • If parses are not known, apply an EM algorithm called “Inside-Outside” (“forward-backward” for HMM)