CS481/CS583: Bioinformatics Algorithms
Can Alkan
EA509
calkan@cs.bilkent.edu.tr
http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/
RNA STRUCTURE
RNA Basics
“wobble” pairing
2 Hydrogen Bonds
3 Hydrogen Bonds – more stable
RNA Basics
RNA folding
RNA Structural Levels�
Primary
AAUCG...CUUCUUCCA
Primary
Secondary
Tertiary
RNA families
(most of the data is taken from specific databases)
http://www.sanger.ac.uk/Software/Rfam/
Includes many families of non coding RNAs and functional
Motifs, as well as their alignment and their secondary structures
RNA Secondary Structure
Hairpin loop
Junction (Multiloop)
Bulge Loop
Single-Stranded
Interior Loop
Stem
Pseudoknot
Example: 5S rRNA
E. coli 5S
120 bases
T. thermophilus 5S
120 bases
Example: E. coli 16S rRNA
1542 bases
Example: E. coli 23S rRNA
2904 bases
Example: HIV
9173 bases
Watts et al., Nature, 2009
Example: SARS-CoV and SARS-CoV-2
Simmonds, 2020
Binary Tree Representation of RNA Secondary Structure
Images – Eddy et al.
Circular Representation
Images – David Mount
Examples of known interactions of RNA secondary structural elements
Pseudoknot
Kissing hairpins
Hairpin-bulge contact
These patterns are excluded from the prediction schemes as their computation is too intensive.
Predicting RNA secondary structure
Sequence Alignment as a method to determine structure
Simplifying Assumptions
Base Pair Maximization
U
C
C
A
G
G
A
C
Zuker (1981) Nucleic Acids Research 9(1) 133-149
Base Pair Maximization – Dynamic Programming Algorithm
Simple Example:
Maximizing Base Pairing
http://bix.ucsd.edu/bioalgorithms/slides.php
S(i,j) is the folding of the subsequence of the RNA strand from index i to index j which results in the highest number of base pairs
Base Pair Maximization – Dynamic Programming Algorithm
Initialize first two diagonal arrays to 0
Fill in squares sweeping diagonally
Images – Sean Eddy
Bases cannot pair, similar
to unmatched alignment
Bases can pair, similar
to matched alignment
S(i + 1, j)
Dynamic Programming – possible paths
S(i + 1, j – 1) +1
S(i, j – 1)
Base Pair Maximization – Dynamic Programming Algorithm
Initialize first two diagonal arrays to 0
Fill in squares sweeping diagonally
Images – Sean Eddy
Reminder:
For all k
S(i,k) + S(k + 1, j)
k = 0 : Bifurcation max in this case
S(i,k) + S(k + 1, j)
Reminder:
For all k
S(i,k) + S(k + 1, j)
Bases cannot pair, similar
Bases can pair, similar
to matched alignment
Dynamic Programming – possible paths
Bifurcation – add values for all k
Base Pair Maximization - Drawbacks
Energy Minimization
Free energy model
Free energy of a structure is the sum of all interactions energies
Each interaction energy can be calculated thermodynamically
Free Energy(E) = E(CG)+E(CG)+…..
Why is MFE secondary structure prediction hard?
RNA folding with Dynamic programming �(Zuker and Stiegler)
i
j
W(i,j)
RNA folding with dynamic programming
i (i+1)
W(i,j)
(j-1) j
Energy Minimization Results
Images – David Mount
Exception: Location where the beginning and end of RNA come together in circularized representation
Trouble with Pseudoknots
Images – David Mount
Sequence dependent free-energy �Nearest Neighbor Model
U U
C G
G C
A U
G C
A UCGAC 3’
5’
U U
C G
U A
A U
G C
A UCGAC 3’
5’
Energy is influenced by the previous base pair
(not by the base pairs further down).
Sequence dependent free-energy values of the base pairs �
U U
C G
G C
A U
G C
A UCGAC 3’
5’
U U
C G
U A
A U
G C
A UCGAC 3’
5’
Example values:
GC GC GC GC
AU GC CG UA
-2.3 -2.9 -3.4 -2.1
These energies are estimated experimentally from small synthetic RNAs.
Adding Complexity to Energy Calculations
Mfold
Free energy computation
U U
A A
G C
G C
A
G C
U A
A U
C G
A U
A 3’
A
5’
-0.3
-0.3
-1.1 mismatch of hairpin
-2.9 stacking
+3.3 1nt bulge
-2.9 stacking
-1.8 stacking
5’ dangling
-0.9 stacking
-1.8 stacking
-2.1 stacking
G= -4.6 KCAL/MOL
+5.9 4 nt loop
Mfold
Frey U H et al. Clin Cancer Res 2005;11:5071-5077
©2005 by American Association for Cancer Research
More than one structure can be predicted for the
same RNA
Energy Minimization Drawbacks
RNA fold prediction based on Multiple Alignment
Information from multiple sequence alignment (MSA) can help to predict the probability of positions i,j to be base-paired.
G C C U U C G G G C
G A C U U C G G U C
G G C U U C G G C C
Compensatory Substitutions
U U
C G
U A
A U
G C
A UCGAC 3’
G
C
5’
Mutations that maintain the secondary structure can help predict the fold
RNA secondary structure can be revealed by identification of compensatory mutations
G C C U U C G G G C
G A C U U C G G U C
G G C U U C G G C C
U C
U G
C G
N N’
G C
Insight from Multiple Alignment
Information from multiple sequence alignment (MSA) can help to predict the
probability of positions i,j to be base-paired.
RNAalifold
LOCAL MINIMUM FREE ENERGY: DENSITYFOLD
E.coli 5S rRNA
Energy Density Landscape
E.coli 5S rRNA predictions
mFold
RNAalifold
rnaScf
Densityfold (alteRNA)
Hill climbing process for approximating the contributions of unpaired bases
Substructures
Hairpin loop
Junction (Multiloop)
Bulge Loop
Single-Stranded
Interior Loop
Pseudoknot
Densityfold energy types
Densityfold energy types
Densityfold energy tables
Densityfold energy tables
Calculating energy tables
Linear combination of MFE and ED
For any x ε {S,BI,M} let ELCx(i, j) = EDx(i, j) + Ex(i, j).
Optimize ELC(n) = ED(n) + E(n).
Densityfold prediction: E.coli 5S rRNA
Known Structure
Densityfold Prediction
STOCHASTIC CONTEXT-FREE GRAMMARS
SCFG
Chomsky hierarchy
unrestricted grammars
context-sensitive grammars
context-free grammars
regular grammars
(equivalent to finite automata & HMM’s)
(equivalent to SCFG’s & pushdown automata)
(equivalent to Turing machines & recursively enumerable sets)
(equivalent to linear bounded automata)
B. Majoros
Context-free grammars
A context-free grammar is a generative model denoted by a 4-tuple:
G = (V, Σ, S, P)
where:
Σ is a terminal alphabet, (e.g., {a, c, g, u} )
V is a nonterminal alphabet, (e.g., {A, B, C, D, E, ...} )
S∈V is a special start symbol, and
P is a set of rewriting rules called productions.
Productions in P are rules of the form:
X → λ
where X∈V, λ∈(V∪α)*
B. Majoros
Context “freeness”
The “context-freeness” is imposed by the requirement that the l.h.s of each production rule may contain only a single symbol, and that symbol must be a nonterminal:
X → λ
Thus, a CFG cannot specify context-sensitive rules such as:
wXz → wλz
B. Majoros
Derivations
Suppose a CFG G has generated a terminal string x∈ Σ *. A derivation S ⇒ *x denotes a possible derivation for generating x.
A derivation (or parse) consists of a series of applications of productions from P, beginning with the start symbol S and ending with the terminal string x:
S ⇒ s1 ⇒ s2 ⇒ s3 ⇒ L ⇒ x
where si∈(V∪ Σ)*.
We’ll concentrate of leftmost derivations where the leftmost nonterminal is always replaced first.
B. Majoros
Context-free vs. regular
The advantage of CFG’s over HMM’s lies in their ability to model arbitrary runs of matching pairs of elements, such as matching pairs of parentheses:
L((((((((L))))))))L
When the number of matching pairs is unbounded, a finite-state model such as a DFA or an HMM is inadequate to enforce the constraint that all left elements must have a matching right element.
In contrast, in a CFG we can use rules such as X→(X). A sample derivation using such a rule is:
X ⇒ (X) ⇒ ((X)) ⇒ (((X))) ⇒ ((((X)))) ⇒ (((((X)))))
An additional rule such as X→ε is necessary to terminate the recursion.
B. Majoros
A CFG for an RNA
S-> aXu | cXg | gXc | uXa
X-> aYu | cYg | gYc | uYa
Y-> aZu | cZg | gZc | uZa
Z->gaaa | gcaa
R. Shamir & R. Sharan
Parse trees
R. Shamir & R. Sharan
Stochastic CFG
A stochastic context-free grammar (SCFG) is a CFG plus a probability distribution on productions:
G = (V, Σ, S, P, Rp)
where Pp : P a ¡, and probabilities are normalized at the level of each l.h.s. symbol X:
∀[ ∑ Rp(X→λ)=1 ]
X∈V X→λ
Thus, we can compute the probability of a single derivation S⇒*x by multiplying the probabilities for all productions used in the derivation:
∏ i R(Xi→λi)
We can sum over all possible (leftmost) derivations of a given string x to get the probability that G will generate x at random:
R(x | G) = ∑ R(S⇒j*x | G).
j
B. Majoros
An example
As an example, consider G=(VG, α, S, PG, RG), for VG={S, L, N}, Σ ={a,c,g,t}, and PG the set consisting of:
S → a S u | u S a | c S g | g S c | L
L → N N N N
N → a | c | g | u
Then the probability of the sequence acguacguacgu is given by:
P(acguacguacgu) =
P( S ⇒ aSu ⇒ acSgu ⇒ acgScgu ⇒ acguSacgu ⇒
acguLacgu ⇒ acguNNNNacgu ⇒ acguaNNNacgu ⇒
acguacNNacgu ⇒ acguacgNacgu ⇒ acguacguacgu) =
0.2 × 0.2 × 0.2 × 0.2 × 0.2 × 1 × 0.25 × 0.25 × 0.25 × 0.25 = 1.25×10-6
because this sequence has only one possible (leftmost) derivation under grammar G.
(P=0.2)
(P=1.0)
(P=0.25)
B. Majoros
Structure using SFCG
acuSag
S 🡪 aSu | cSg | aS | uS | … | Su | SS | ε
P .21 .15 .11 .08 .03 .22 .02
S
aS
acSg
acuSuag
acugScuag
acuguScuag
acuguaScuag
acuguauScuag
acuguaucuag
acuguaucuag
.(((...).))
acuguacuag
.(((..).))
acugucuag
.(((.).))
acugcuag
.((().))
acuuag
.((.))
acuag
.(())
acg
.()
a
.
Chomsky Normal Form
Non-CNF:
S → a S t | t S a | c S g | g S c | L
L → N N N N
N → a | c | g | u
CNF:
S → A ST | T SA | C SG | G SC | N L1
SA → S A
ST → S T
SC → S C
SG → S G
L1 → N L2
L2 → N N
N → a | c | g | u
A → a
C → c
G → g
T → u
A CNF grammar is one in which all productions are of the form:
X → Y Z
or:
X → a
B. Majoros
Parsing CFG
Two questions for a CFG:
Additional questions for an SCFG:
B. Majoros
Parsing CFG