1 of 31

Motif finding, part II

CSE 180

Yana Safonova

2 of 31

Back to profile matrix

Each column is a probability space with at most 4 outcomes
We consider that all positions corresponds to independent events
Thus, we can apply the formula for computing a probability of complex events

3 of 31

We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

4 of 31

We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) = ½ + ½ + ½ = 1.5 > 1

What did we do wrong?

5 of 31

We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) ! = ½ + ½ + ½ = 1.5 > 1

All possible flipping results: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH

6 of 31

We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) ! = ½ + ½ + ½ = 1.5 > 1

All possible flipping results: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH
Only 1 out of 8 results does not contain tails: HHH

7 of 31

We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) ! = ½ + ½ + ½ = 1.5 > 1

All possible flipping results: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH
Only 1 out of 8 results does not contain tails: HHH
P(at least one T after 3 flippings) = P(TTT + … + HHT) = P(TTT) + … + P(HHT) = ⅛ * 7 = 7 / 8

8 of 31

We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) ! = ½ + ½ + ½ = 1.5 > 1

All possible flipping results: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH
Only 1 out of 8 results does not contain tails: HHH
P(at least one T after 3 flippings) = 1 - P(no tails after 3 flippings) = 1 - P(HHH) = 1 - ⅛ = 7 / 8

9 of 31

Random variables

It is often useful to look at aspects of the experiment’s outcomes
A random variable X is a function from the Ω to some new space S

Example: random variable “rolling a die” = {1, 2, 5, 6, 6, 6, 3, 2, 1, 4}, where values are numbers on the die

What are Ω and S here?

10 of 31

Samples of random variables

Age / height of people in this room
Number of tails after flipping a coin 3 times
Number of errors in sequencing reads
Positions of sequencing reads in the reference genome

Question: what are Ω and S in each of these cases?

Sample is a set of collected data that was selected from a statistical population by a defined procedure
Sample is a collection of values of a random variable X

11 of 31

Rolling a die...

Sample presents a collection of numbers from 1 to 6

Let’s assume that we rolled the die N times

What is the fraction of 1s in the sample?

What is the fraction of other numbers?

12 of 31

Histogram for 1000 trials

13 of 31

Histograms in everyday life

14 of 31

Histograms in everyday life

15 of 31

Histograms in everyday life

16 of 31

Uniform distribution

When we roll a die, all outcomes have equal probability (⅙)
In this case, we say that X follows uniform distribution

17 of 31

Uniform distribution

When we roll a die, all outcomes have equal probability (⅙)
In this case, we say that X follows uniform distribution
Question: let X be the sum of numbers on two dice. Is X distributed uniformly?

18 of 31

Uniform distribution

When we roll a die, all outcomes have equal probability (⅙)
In this case, we say that X follows uniform distribution
Question: let X be the sum of numbers on two dice. Is X distributed uniformly? NO

2 = 1 + 1

3 = 1 + 2 = 2 + 1

4 = 1 + 3 = 2 + 2 = 3 + 1

…

7 = 1 + 6 = 2 + 5 = 3 + 4 = 4 + 3 = 5 + 2 = 6 + 1

…

12 = 6 + 6

19 of 31

Random numbers

Modeling uniform distribution is important mathematical problem
Applications: simulation and programm debugging, statistical sampling, randomized algorithms, cryptography
Random numbers are not truly random:

R1 = seed, R2 = f(R1), R3 = f(R2), ...

20 of 31

Random numbers

Modeling uniform distribution is important mathematical problem
Applications: simulation and programm debugging, statistical sampling, randomized algorithms, cryptography
Random numbers are not truly random:

R1 = seed, R2 = f(R1), R3 = f(R2), ...

Question: what do you think - are people good at generating random numbers?

21 of 31

Random package

import random

random.randint(1, 1000)

>979

import random

random.seed(1)

random.randint(1, 1000)

>135

import random

random.randint(1, 1000)

>375

import random

random.seed(1)

random.randint(1, 1000)

>135

22 of 31

Not all distributions are uniform

X is the sum of numbers on two dice

2 = 1 + 1

3 = 1 + 2 = 2 + 1

4 = 1 + 3 = 2 + 2 = 3 + 1

…

7 = 1 + 6 = 2 + 5 = 3 + 4 = 4 + 3 = 5 + 2 = 6 + 1

…

11 = 5 + 6 = 6 + 5

12 = 6 + 6

23 of 31

Not all distributions are uniform

X is the sum of numbers on two dice (1000 trials)

24 of 31

Gaussian distribution

used to represent real-valued random variables whose distributions are not known
probability of X = x

mu - average value, sigma - standard deviation (how wide is your distribution)
implemented in many packages including random
random.gauss(mu, sigma)

25 of 31

Visualization of samples

import matplotlib as mplt

import matplotlib.pyplot as plt

import random

x = []

n = 10000

for i in range(n):

x.append(random.gauss(1, 10))

plt.hist(x, bins = 100)

plt.xlabel(‘X’)

plt.ylabel(‘# outcomes’)

plt.savefig(“my_hist.png”)

26 of 31

Visualization of samples

27 of 31

Visualization of samples

n_1_100 = []

n_1_50 = []

n_1_10 = []

for i in range(n):

n_1_100.append(random.gauss(1, 100))

n_1_50.append(random.gauss(1, 50))

n_1_10.append(random.gauss(1, 10))

plt.hist([n_1_100, n_1_50, n_1_10], bins = 100, label = ['mu = 1, sigma = 100', 'mu = 1, sigma = 50', 'mu = 1, sigma = 10'])

plt.legend()

plt.savefig(“my_hist.png”)

28 of 31

Visualization of samples

29 of 31

Helpful links

Other types of distributions: http://www.ams.jhu.edu/~dan/550.435/notes/COURSENOTES435.pdf

Matplotlib package: https://matplotlib.org/
Random library: https://docs.python.org/2/library/random.html
Seaborn visualization: https://seaborn.pydata.org/

30 of 31

How to measure significance of a motif

String S of length L and motif M of length K
Probability to see a random sequence of length K = 1 / 4^K
Computing probability to see M in S is tricky!
However, expected number of “random” occurrences of M in S = L / 4^K

Examples:

L = 1M, K = 5. EN = 1,000,000 / 4⁵ = 1,000,000 / 1024 ~ 976
L = 1M, K = 8. EN = 1,000,000 / 4⁸ = 1,000,000 / 65,536 ~ 15

31 of 31

Practice

Generate a random k-mer assuming that all nucleotides have equally probability

Generate a random k-mer assuming that probabilities of nucleotides are P(A) = 0.15, P(C) = 0.35, P(G) = 0.4, P(T) = 0.1