1 of 31

Motif finding, part II

CSE 180

Yana Safonova

2 of 31

Back to profile matrix

  • Each column is a probability space with at most 4 outcomes
  • We consider that all positions corresponds to independent events
  • Thus, we can apply the formula for computing a probability of complex events

3 of 31

  • We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

4 of 31

  • We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) = ½ + ½ + ½ = 1.5 > 1

What did we do wrong?

5 of 31

  • We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) ! = ½ + ½ + ½ = 1.5 > 1

  • All possible flipping results: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH

6 of 31

  • We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) ! = ½ + ½ + ½ = 1.5 > 1

  • All possible flipping results: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH
  • Only 1 out of 8 results does not contain tails: HHH

7 of 31

  • We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) ! = ½ + ½ + ½ = 1.5 > 1

  • All possible flipping results: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH
  • Only 1 out of 8 results does not contain tails: HHH
  • P(at least one T after 3 flippings) = P(TTT + … + HHT) = P(TTT) + … + P(HHT) = ⅛ * 7 = 7 / 8

8 of 31

  • We flip a coin three times. What is the probability to get tails at least once?

P(T) = ½

P(at least one T after 3 flippings) ! = ½ + ½ + ½ = 1.5 > 1

  • All possible flipping results: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH
  • Only 1 out of 8 results does not contain tails: HHH
  • P(at least one T after 3 flippings) = 1 - P(no tails after 3 flippings) = 1 - P(HHH) = 1 - ⅛ = 7 / 8

9 of 31

Random variables

  • It is often useful to look at aspects of the experiment’s outcomes
  • A random variable X is a function from the Ω to some new space S

  • Example: random variable “rolling a die” = {1, 2, 5, 6, 6, 6, 3, 2, 1, 4}, where values are numbers on the die

What are Ω and S here?

S

10 of 31

Samples of random variables

  • Age / height of people in this room
  • Number of tails after flipping a coin 3 times
  • Number of errors in sequencing reads
  • Positions of sequencing reads in the reference genome

Question: what are Ω and S in each of these cases?

  • Sample is a set of collected data that was selected from a statistical population by a defined procedure
  • Sample is a collection of values of a random variable X

11 of 31

Rolling a die...

  • Sample presents a collection of numbers from 1 to 6

  • Let’s assume that we rolled the die N times

  • What is the fraction of 1s in the sample?

  • What is the fraction of other numbers?

12 of 31

Histogram for 1000 trials

13 of 31

Histograms in everyday life

14 of 31

Histograms in everyday life

15 of 31

Histograms in everyday life

16 of 31

Uniform distribution

  • When we roll a die, all outcomes have equal probability (⅙)
  • In this case, we say that X follows uniform distribution

17 of 31

Uniform distribution

  • When we roll a die, all outcomes have equal probability (⅙)
  • In this case, we say that X follows uniform distribution
  • Question: let X be the sum of numbers on two dice. Is X distributed uniformly?

18 of 31

Uniform distribution

  • When we roll a die, all outcomes have equal probability (⅙)
  • In this case, we say that X follows uniform distribution
  • Question: let X be the sum of numbers on two dice. Is X distributed uniformly? NO

2 = 1 + 1

3 = 1 + 2 = 2 + 1

4 = 1 + 3 = 2 + 2 = 3 + 1

7 = 1 + 6 = 2 + 5 = 3 + 4 = 4 + 3 = 5 + 2 = 6 + 1

12 = 6 + 6

19 of 31

Random numbers

  • Modeling uniform distribution is important mathematical problem
  • Applications: simulation and programm debugging, statistical sampling, randomized algorithms, cryptography
  • Random numbers are not truly random:

R1 = seed, R2 = f(R1), R3 = f(R2), ...

20 of 31

Random numbers

  • Modeling uniform distribution is important mathematical problem
  • Applications: simulation and programm debugging, statistical sampling, randomized algorithms, cryptography
  • Random numbers are not truly random:

R1 = seed, R2 = f(R1), R3 = f(R2), ...

  • Question: what do you think - are people good at generating random numbers?

21 of 31

Random package

import random

random.randint(1, 1000)

>979

import random

random.seed(1)

random.randint(1, 1000)

>135

import random

random.randint(1, 1000)

>375

import random

random.seed(1)

random.randint(1, 1000)

>135

22 of 31

Not all distributions are uniform

  • X is the sum of numbers on two dice

2 = 1 + 1

3 = 1 + 2 = 2 + 1

4 = 1 + 3 = 2 + 2 = 3 + 1

7 = 1 + 6 = 2 + 5 = 3 + 4 = 4 + 3 = 5 + 2 = 6 + 1

11 = 5 + 6 = 6 + 5

12 = 6 + 6

23 of 31

Not all distributions are uniform

  • X is the sum of numbers on two dice (1000 trials)

24 of 31

Gaussian distribution

  • used to represent real-valued random variables whose distributions are not known
  • probability of X = x

  • mu - average value, sigma - standard deviation (how wide is your distribution)
  • implemented in many packages including random
  • random.gauss(mu, sigma)

25 of 31

Visualization of samples

import matplotlib as mplt

import matplotlib.pyplot as plt

import random

x = []

n = 10000

for i in range(n):

x.append(random.gauss(1, 10))

plt.hist(x, bins = 100)

plt.xlabel(‘X’)

plt.ylabel(‘# outcomes’)

plt.savefig(“my_hist.png”)

26 of 31

Visualization of samples

27 of 31

Visualization of samples

n_1_100 = []

n_1_50 = []

n_1_10 = []

for i in range(n):

n_1_100.append(random.gauss(1, 100))

n_1_50.append(random.gauss(1, 50))

n_1_10.append(random.gauss(1, 10))

plt.hist([n_1_100, n_1_50, n_1_10], bins = 100, label = ['mu = 1, sigma = 100', 'mu = 1, sigma = 50', 'mu = 1, sigma = 10'])

plt.legend()

plt.savefig(“my_hist.png”)

28 of 31

Visualization of samples

29 of 31

Helpful links

30 of 31

How to measure significance of a motif

  • String S of length L and motif M of length K
  • Probability to see a random sequence of length K = 1 / 4K
  • Computing probability to see M in S is tricky!
  • However, expected number of “random” occurrences of M in S = L / 4K

Examples:

  • L = 1M, K = 5. EN = 1,000,000 / 45 = 1,000,000 / 1024 ~ 976
  • L = 1M, K = 8. EN = 1,000,000 / 48 = 1,000,000 / 65,536 ~ 15

31 of 31

Practice

  • Generate a random k-mer assuming that all nucleotides have equally probability

  • Generate a random k-mer assuming that probabilities of nucleotides are P(A) = 0.15, P(C) = 0.35, P(G) = 0.4, P(T) = 0.1