3 of 71

RNA Basics

RNA bases A,C,G,U
Canonical Base Pairs

“wobble” pairing

Bases can only pair with one other base.

2 Hydrogen Bonds

3 Hydrogen Bonds – more stable

4 of 71

RNA Basics

transfer RNA (tRNA)
messenger RNA (mRNA)
ribosomal RNA (rRNA)
small interfering RNA (siRNA)
micro RNA (miRNA)
small nucleolar RNA (snoRNA)
Long non-coding RNA (lncRNA)
……

http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

5 of 71

RNA folding

Prediction of secondary structure of an RNA given its sequence
General problem is NP-hard due to “difficult” substructures, like pseudoknots
Most existing algorithms require too much memory (≥O(n²)), and run time (≥O(n³)) thus limited to smaller RNA sequences

6 of 71

RNA Structural Levels�

Primary

AAUCG...CUUCUUCCA

Primary

Secondary

Tertiary

7 of 71

RNA families

Rfam : General non-coding RNA database

(most of the data is taken from specific databases)

http://www.sanger.ac.uk/Software/Rfam/

Includes many families of non coding RNAs and functional

Motifs, as well as their alignment and their secondary structures

8 of 71

RNA Secondary Structure

Hairpin loop

Junction (Multiloop)

Bulge Loop

Single-Stranded

Interior Loop

Stem

Pseudoknot

9 of 71

Example: 5S rRNA

E. coli 5S

120 bases

T. thermophilus 5S

120 bases

10 of 71

Example: E. coli 16S rRNA

1542 bases

11 of 71

Example: E. coli 23S rRNA

2904 bases

12 of 71

Example: HIV

9173 bases

Watts et al., Nature, 2009

13 of 71

Example: SARS-CoV and SARS-CoV-2

Simmonds, 2020

14 of 71

Binary Tree Representation of RNA Secondary Structure

Representation of RNA structure using Binary tree
Nodes represent

Base pair if two bases are shown
Loop if base and “gap” (dash) are shown

Pseudoknots still not represented
Tree does not permit varying sequences

Mismatches
Insertions & Deletions

Images – Eddy et al.

15 of 71

Circular Representation

Images – David Mount

16 of 71

Examples of known interactions of RNA secondary structural elements

Pseudoknot

Kissing hairpins

Hairpin-bulge contact

These patterns are excluded from the prediction schemes as their computation is too intensive.

17 of 71

Predicting RNA secondary structure

Base pair maximization
Minimum free energy (most common)

Fold, Mfold (Zuker & Stiegler)
RNAfold (Hofacker)

Multiple sequence alignment

Use known structure of RNA with similar sequence

Covariance
Stochastic Context-Free Grammars

18 of 71

Sequence Alignment as a method to determine structure

Bases pair in order to form backbones and determine the secondary structure
Aligning bases based on their ability to pair with each other gives an algorithmic approach to determining the optimal structure

19 of 71

Simplifying Assumptions

RNA folds into one minimum free-energy structure.
There are no knots (base pairs never cross).
The energy of a particular base pair in a double stranded regions is sequence independent

Neighbors do not influence the energy.

Was solved by dynamic programming, Zuker and Stiegler 1981

20 of 71

Base Pair Maximization

Zuker (1981) Nucleic Acids Research 9(1) 133-149

21 of 71

Base Pair Maximization – Dynamic Programming Algorithm

Simple Example:

Maximizing Base Pairing

http://bix.ucsd.edu/bioalgorithms/slides.php

S(i,j) is the folding of the subsequence of the RNA strand from index i to index j which results in the highest number of base pairs

22 of 71

Base Pair Maximization – Dynamic Programming Algorithm

Alignment Method

Align RNA strand to itself
Score increases for feasible base pairs

Each score independent of overall structure
Bifurcation adds extra dimension

Initialize first two diagonal arrays to 0

Fill in squares sweeping diagonally

Images – Sean Eddy

Bases cannot pair, similar

to unmatched alignment

Bases can pair, similar

to matched alignment

S(i + 1, j)

Dynamic Programming – possible paths

S(i + 1, j – 1) +1

S(i, j – 1)

23 of 71

Base Pair Maximization – Dynamic Programming Algorithm

Alignment Method

Align RNA strand to itself
Score increases for feasible base pairs

Each score independent of overall structure
Bifurcation adds extra dimension

Initialize first two diagonal arrays to 0

Fill in squares sweeping diagonally

Images – Sean Eddy

Reminder:

For all k

S(i,k) + S(k + 1, j)

k = 0 : Bifurcation max in this case

S(i,k) + S(k + 1, j)

Reminder:

For all k

S(i,k) + S(k + 1, j)

Bases cannot pair, similar

Bases can pair, similar

to matched alignment

Dynamic Programming – possible paths

Bifurcation – add values for all k

24 of 71

Base Pair Maximization - Drawbacks

Base pair maximization will not necessarily lead to the most stable structure

May create structure with many interior loops or hairpins which are energetically unfavorable

Comparable to aligning sequences with scattered matches – not biologically reasonable

25 of 71

Energy Minimization

Thermodynamic Stability

Estimated using experimental techniques
Theory : Most Stable is the Most likely

No Pseudoknots due to algorithm limitations
Uses Dynamic Programming alignment technique
Attempts to maximize the score taking into account thermodynamics
MFOLD and ViennaRNA

26 of 71

Free energy model

Free energy of a structure is the sum of all interactions energies

Each interaction energy can be calculated thermodynamically

Free Energy(E) = E(CG)+E(CG)+…..

27 of 71

Why is MFE secondary structure prediction hard?

MFE structure can be found by calculating free energy of all possible structures

BUT the number of potential structures grows exponentially with the number, n, of bases

28 of 71

RNA folding with Dynamic programming �(Zuker and Stiegler)

W(i,j): MFE structure of substrand from i to j

W(i,j)

29 of 71

RNA folding with dynamic programming

Assume a function W(i,j) which is the MFE for the sequence starting at i and ending at j (i<j)

Define scores, for example base pair (CG) =-1 non-pair(CA)=1 (we want a negative score )
Consider 4 possibilities:

i,j are a base pair, added to the structure for i+1..j-1
i is unpaired, added to the structure for i+1..j
j is unpaired, added to the structure for i..j-1
i,j are paired, but not to each other;

Choose the minimal energy

i (i+1)

W(i,j)

(j-1) j

30 of 71

Energy Minimization Results

Linear RNA strand folded back on itself to create secondary structure
Circularized representation uses this requirement

Arcs represent base pairing

Images – David Mount

All loops must have at least 3 bases in them

Equivalent to having 3 base pairs between all arcs

Exception: Location where the beginning and end of RNA come together in circularized representation

31 of 71

Trouble with Pseudoknots

Pseudoknots cause a breakdown in the Dynamic Programming Algorithm.
In order to form a pseudoknot, checks must be made to ensure base is not already paired – this breaks down the recurrence relations

Images – David Mount

32 of 71

Sequence dependent free-energy �Nearest Neighbor Model

U U

C G

G C

A U

G C

A UCGAC 3’

5’

U U

C G

U A

A U

G C

A UCGAC 3’

5’

Energy is influenced by the previous base pair

(not by the base pairs further down).

33 of 71

Sequence dependent free-energy values of the base pairs �

U U

C G

G C

A U

G C

A UCGAC 3’

5’

U U

C G

U A

A U

G C

A UCGAC 3’

5’

Example values:

GC GC GC GC

AU GC CG UA

-2.3 -2.9 -3.4 -2.1

These energies are estimated experimentally from small synthetic RNAs.

34 of 71

Adding Complexity to Energy Calculations

Stacking energy - Assign negative energies to these between base pair regions.

Energy is influenced by the previous base pair (not by the base pairs further down).
These energies are estimated experimentally from small synthetic RNAs.

Positive energy - added for destabilizing regions such as bulges, loops, etc.
More than one structure can be predicted

35 of 71

Mfold

Positive energy - added for destabilizing regions such as bulges, loops, etc.
More than one structure can be predicted

36 of 71

Free energy computation

U U

A A

G C

U A

A U

C G

A U

A 3’

5’

-0.3

-1.1 mismatch of hairpin

-2.9 stacking

+3.3 1nt bulge

-2.9 stacking

-1.8 stacking

5’ dangling

-0.9 stacking

-1.8 stacking

-2.1 stacking

G= -4.6 KCAL/MOL

+5.9 4 nt loop

37 of 71

Mfold

Positive energy - added for destabilizing regions such as bulges, loops, etc.
More than one structure can be predicted

38 of 71

Frey U H et al. Clin Cancer Res 2005;11:5071-5077

More than one structure can be predicted for the

same RNA

39 of 71

Energy Minimization Drawbacks

Compute only one optimal structure
Usual drawbacks of purely mathematical approaches

Similar difficulties in other algorithms

Protein structure
Exon finding

40 of 71

RNA fold prediction based on Multiple Alignment

Information from multiple sequence alignment (MSA) can help to predict the probability of positions i,j to be base-paired.

G C C U U C G G G C

G A C U U C G G U C

G G C U U C G G C C

41 of 71

Compensatory Substitutions

U U

C G

U A

A U

G C

A UCGAC 3’

5’

Mutations that maintain the secondary structure can help predict the fold

42 of 71

RNA secondary structure can be revealed by identification of compensatory mutations

G C C U U C G G G C

G A C U U C G G U C

G G C U U C G G C C

U C

U G

C G

N N’

G C

43 of 71

Insight from Multiple Alignment

Information from multiple sequence alignment (MSA) can help to predict the

probability of positions i,j to be base-paired.

Conservation – no additional information
Consistent mutations (GC🡪 GU) – support stem
Inconsistent mutations – does not support stem.
Compensatory mutations – support stem.

44 of 71

Predicts the consensus secondary structure for a set of aligned RNA sequences by using modified dynamic programming algorithm that add alignment information to the standard energy model
Improvement in prediction accuracy

RNAalifold

45 of 71

LOCAL MINIMUM FREE ENERGY: DENSITYFOLD

46 of 71

E.coli 5S rRNA

Energy Density Landscape

47 of 71

E.coli 5S rRNA predictions

mFold

RNAalifold

rnaScf

48 of 71

Densityfold (alteRNA)

Instead of finding minimum global free energy, find local minimum free energies
Emulate the folding process of RNA folding by aiming to keep locally stable substructures
Energy density seen by a basepair: the free energy of the “optimal substructure” normalized by distance
Energy density of an unpaired base: energy density of the nearest encapsulating basepair
Densityfold optimizes a linear combination of free energy and total energy density
For every potential basepair, compute the optimal contribution of the implied substructure
The optimization function is non linear

Hill climbing process for approximating the contributions of unpaired bases

49 of 71

Substructures

Hairpin loop

Junction (Multiloop)

Bulge Loop

Single-Stranded

Interior Loop

Pseudoknot

50 of 71

Densityfold energy types

eH(i,j,): free energy of a hairpin loop enclosed by the base pair S[i].S[j]
eS(i,j,): free energy of the base pair S[i].S[j] provided that it forms a stacking pair with S[i+1].S[j-1]
eBI(i,j,i’,j’): free energy of an internal loop or a bulge that starts with S[i].S[j] and ends with S[i’].S[j’]

51 of 71

Densityfold energy types

eM(i,j,i₁,j₁,…,i_k,j_k): free energy of multibranch loop that starts with S[i].S[j] and branches out S[i₁].S[j₁], S[i₂].S[j₂], …, S[i_k].S[j_k]
eDA(j,j-1): free energy of an unpaired dangling base S[j] when S[j-1] forms a base pair with another base

52 of 71

Densityfold energy tables

ED(j): minimum total free energy density of a secondary structure for substring S[1, j].
E(j): free energy of the energy density minimized secondary structure for substring S[1, j].
ED_S(i, j): minimum total free energy density of a secondary structure for S[i, j], provided that S[i].S[j] is a base pair.
E_S(i, j): free energy of the energy density minimized secondary structure for the substring S[i, j], provided that S[i].S[j] is a base pair.

53 of 71

Densityfold energy tables

ED_BI (i, j): minimum total free energy density of a secondary structure for S[i, j], provided that there is a bulge or an internal loop starting with base pair S[i].S[j].
E_BI (i, j): free energy of an energy density minimized structure for S[i, j], provided that a bulge or an internal loop starting with base pair S[i].S[j].
ED_M(i, j): minimum total free energy density of a secondary structure for S[i, j], such that there is a multibranch loop starting with base pair S[i].S[j].
E_M(i, j): free energy of an energy density minimized structure for S[i, j], provided there is a multibranch loop starting with base pair S[i].S[j].

54 of 71

Calculating energy tables

Similar calculations for other tables
O(n^k+2) time and O(n²) space

55 of 71

Linear combination of MFE and ED

Similar formulations for ELC_BI and ELC_M
O(n⁴) running time

For any x ε {S,BI,M} let ELC_x(i, j) = ED_x(i, j) + E_x(i, j).

Optimize ELC(n) = ED(n) + E(n).

56 of 71

Densityfold prediction: E.coli 5S rRNA

Known Structure

Densityfold Prediction

57 of 71

STOCHASTIC CONTEXT-FREE GRAMMARS

58 of 71

SCFG

RNA folding can be represented as context-free grammars

59 of 71

Chomsky hierarchy

unrestricted grammars

context-sensitive grammars

context-free grammars

regular grammars

(equivalent to finite automata & HMM’s)

(equivalent to SCFG’s & pushdown automata)

(equivalent to Turing machines & recursively enumerable sets)

(equivalent to linear bounded automata)

B. Majoros

60 of 71

Context-free grammars

A context-free grammar is a generative model denoted by a 4-tuple:

G = (V, Σ, S, P)

where:

Σ is a terminal alphabet, (e.g., {a, c, g, u} )

V is a nonterminal alphabet, (e.g., {A, B, C, D, E, ...} )

S∈V is a special start symbol, and

P is a set of rewriting rules called productions.

Productions in P are rules of the form:

X → λ

where X∈V, λ∈(V∪α)^*

B. Majoros

61 of 71

Context “freeness”

The “context-freeness” is imposed by the requirement that the l.h.s of each production rule may contain only a single symbol, and that symbol must be a nonterminal:

X → λ

Thus, a CFG cannot specify context-sensitive rules such as:

wXz → wλz

B. Majoros

62 of 71

Derivations

Suppose a CFG G has generated a terminal string x∈ Σ ^*. A derivation S ⇒ ^*x denotes a possible derivation for generating x.

A derivation (or parse) consists of a series of applications of productions from P, beginning with the start symbol S and ending with the terminal string x:

S ⇒ s₁ ⇒ s₂ ⇒ s₃ ⇒ L ⇒ x

where s_i∈(V∪ Σ)^*.

We’ll concentrate of leftmost derivations where the leftmost nonterminal is always replaced first.

B. Majoros

63 of 71

Context-free vs. regular

The advantage of CFG’s over HMM’s lies in their ability to model arbitrary runs of matching pairs of elements, such as matching pairs of parentheses:

L((((((((L))))))))L

When the number of matching pairs is unbounded, a finite-state model such as a DFA or an HMM is inadequate to enforce the constraint that all left elements must have a matching right element.

In contrast, in a CFG we can use rules such as X→(X). A sample derivation using such a rule is:

X ⇒ (X) ⇒ ((X)) ⇒ (((X))) ⇒ ((((X)))) ⇒ (((((X)))))

An additional rule such as X→ε is necessary to terminate the recursion.

B. Majoros

64 of 71

A CFG for an RNA

RNA hairpin with 3 bp stem and a 4-base loop (GAAA or GCAA)

S-> aXu | cXg | gXc | uXa

X-> aYu | cYg | gYc | uYa

Y-> aZu | cZg | gZc | uZa

Z->gaaa | gcaa

R. Shamir & R. Sharan

65 of 71

Parse trees

A representation of a parse of a string by a CFG
Root – start nonterminal S
Leaves – terminal symbols in the given string
Internal nodes - nonterminals
The children of an internal node are the productions of that nonterminal (left-to-right order

R. Shamir & R. Sharan

66 of 71

Stochastic CFG

A stochastic context-free grammar (SCFG) is a CFG plus a probability distribution on productions:

G = (V, Σ, S, P, R_p)

where P_p: P a ¡, and probabilities are normalized at the level of each l.h.s. symbol X:

∀[ ∑R_p(X→λ)=1 ]

X∈VX→λ

Thus, we can compute the probability of a single derivation S⇒^*x by multiplying the probabilities for all productions used in the derivation:

∏ _i R(X_i→λ_i)

We can sum over all possible (leftmost) derivations of a given string x to get the probability that G will generate x at random:

R(x | G) = ∑ R(S⇒_j^*x | G).

B. Majoros

67 of 71

An example

As an example, consider G=(V_G, α, S, P_G, R_G), for V_G={S, L, N}, Σ ={a,c,g,t}, and P_G the set consisting of:

S → a S u | u S a | c S g | g S c | L

L → N N N N

N → a | c | g | u

Then the probability of the sequence acguacguacgu is given by:

P(acguacguacgu) =

P( S ⇒ aSu ⇒ acSgu ⇒ acgScgu ⇒ acguSacgu ⇒

acguLacgu ⇒ acguNNNNacgu ⇒ acguaNNNacgu ⇒

acguacNNacgu ⇒ acguacgNacgu ⇒ acguacguacgu) =

0.2 × 0.2 × 0.2 × 0.2 × 0.2 × 1 × 0.25 × 0.25 × 0.25 × 0.25 = 1.25×10^-⁶

because this sequence has only one possible (leftmost) derivation under grammar G.

(P=0.2)

(P=1.0)

(P=0.25)

B. Majoros

68 of 71

Structure using SFCG

Grammar rules with associated probabilities

acuSag

S 🡪 aSu | cSg | aS | uS | … | Su | SS | ε

P .21 .15 .11 .08 .03 .22 .02

acSg

acuSuag

acugScuag

acuguScuag

acuguaScuag

acuguauScuag

acuguaucuag

.(((...).))

Let’s generate a structure for the sequence acuguaucuag

acuguacuag

.(((..).))

acugucuag

.(((.).))

acugcuag

.((().))

acuuag

.((.))

acuag

.(())

acg

.()

We select the set of transformations that highest probability of generating the input sequence. This set gives us our structure.

69 of 71

Chomsky Normal Form

Non-CNF:

S → a S t | t S a | c S g | g S c | L

L → N N N N

N → a | c | g | u

CNF:

S → A S_T | T S_A | C S_G | G S_C | N L₁

S_A → S A

S_T → S T

S_C → S C

S_G → S G

L₁ → N L₂

L₂ → N N

N → a | c | g | u

A → a

C → c

G → g

T → u

A CNF grammar is one in which all productions are of the form:

X → Y Z

or:

X → a

B. Majoros

70 of 71

Parsing CFG

Two questions for a CFG:

Can a grammar G derive string x?
If so, what series of productions would be used during the derivation? (there may be multiple answers!)

Additional questions for an SCFG:

What is the probability that G derives string x?
What is the most probable derivation of x via G?

B. Majoros

71 of 71

Parsing CFG

CYK Algorithm (Cocke-Younger-Kasami)

Dynamic Programming method

Modified CYK for SCFG

“Inside algorithm”
Training similar to HMM

If parses are known for training data sequences, simply count the number of times for each production, calculate probabilities (labeled sequence training for HMM)
If parses are not known, apply an EM algorithm called “Inside-Outside” (“forward-backward” for HMM)