Published using Google Docs
Compling II - Scribe Notes #8
Updated automatically every 5 minutes

Lecture 8 (Scribe: Naho Orita)

Agenda

- Wrap up grammar induction stuff that we talked about last time

- Distribution refresher

- Properties about distribution

- Hypotheses testing

- Collocation application

About mid-term

- We haven’t covered mutual information, likelihood ratio, etc., so KL-divergence won’t be in the mid-term.

Wrap up grammar induction

- One problem of grammar induction: We don't have negative evidence.

- One way of coping with the lack of negative evidence is change your objective function slightly.

- If you are doing EM, you have the objective function based on your parameter .

- This is a product of all of your examples, marginalizing out the grammar y for every examples and every parse (?) giving your parameters, and then you normalize it.

Q: What's this neighborhood function N?

A: In EM, this is just all X. the space of all possible examples.

- What you can do, instead of taking all of possible examples, you can treat this as a function of actual data point that you are working with.

- You can do "contrastive estimation" (Smith and Eisner 2005), the way of introducing negative evidence to this unsupervised problem.

- You don't have any negative evidence, so approximate it.

- What this does is that you make , and you mutate it in a bad way. It won't look like real data example.  For example, for some sequence of words, you have some input sequence “the can sat on the mat”, then something like this;

Input        the        can        sat        on        the        mat

N()        cat        the        sat        on        the        mat

                                        mat        the

                                the        on

                        sat        the                

- Intuition: These will be somehow ungrammatical. You should learn not to give preference to  those. Induce negative example from positive example. One way of inducing more data from data you actually have.

- In linguistic sense, it's the way of getting negative example. In machine learning sense, this is like important sampling for your E-step based on the "neighborhood of "

- Notice that it depends on particular data point that you're trying to normalize it.

- Other way of improving techniques in grammar induction

=> tie parameters across languages

- Historical and cross-linguistic justification that some languages look like each other, emerge from each other. They share some characteristics with each other.

- Models for one language to look like the other language

- use family tree like this;

Indo-european

|                |                 |

Romance         Germanic        Slavic            

|        |        |        |        |        |

es        fr        en        de        ru        cz

                         

- for our log-linear model;

 ~ N(0, )

--- Recap log-linear model ------------------------------------------------------------------------------------------

- What does having zero prior tell us these parameters look like for log-linear model?        

=> It doesn't tell us what they should look like.

                - This prevents over-fitting data.

                - Unless you have good reason to believe, your param is zero.

        => called "regularization"

----------------------------------------------------------------------------------------------------------------------------

For example;

 ~ N(, )

- I know what the params for Eng look like. It should look something like the Germanic language.

-  comes from a normal distribution with mean  and some variance.

 ~ N(, )

-  comes form normal  with some variance .

 ~ N(0, )

- Unless you have good reason to believe otherwise, English grammatical structure looks like Germanic grammatical structure, and unless you have good evidence otherwise, Germanic grammatical structure looks like Indo-europian grammatical structure, unless you have reason to believe otherwise, Indo-European looks like zero vector.

- Cohen and Smith 2009 (use logistic normal distribution)

- Berg-Kilpatrick and Klein 2010 (use phylogenetic tree with Gaussian prior)

Refresher on distribution

Dirichlet distribution

- Dirichlet distribution has a following density function.

- This is a distribution over vector of length k, you know k in advance (= , random variables)

- You parametrize it by another vector of length k (= , parameters any number > 0)

- is the thing comes out from the distribution, is the thing goes into the distribution.

- The distribution for this  is;

- This looks scary but  is the extension of factorial function. You can basically think as factorial - 1.

- (If k=2, then this is Beta distribution, but it's parametrized slightly differently.)

- The expectation of Drichlet distribution:

,

= [1 1 1]                        = [2 2 2]                        = [10 10 10

- The above is the example of alpha = [0.1 0.1 0.1]

- This is the way of drawing a vector of length 3.

- Each coordinate corresponds to each corner x 3

- Each point corresponds to some number/vector that sums to 1. e.g., [1 0 0] [0 1 0]

- It represents all vectors sum to 1. This is what your theta looks like given these input.

- The above is the example of alpha = [1 1 1]

- if the all alpha is 1, then the exponent of theta in this  is always 0, and so no matter what theta you plug in you always going to get the same value.

- That explains you get uniform distribution.

Multinomial distribution

- Discrete distribution. The density of this is;

- Sum of   comes to N.

- Intuitively, each bucket has the prob

- The expectation of mutinomial distribution (N things and parameters of theta);

>

- The parameters of mutinomial distribution look like random variables in Dirichlet distribution. This is because multinomial distribution comes from Dirichlet distribution.

- This is comes from combining  in Dirichlet and  in the density of multinomial distribution.

- The prior of the Dirichlet looks like the posterior of the mutinomial distribution.

- They have the same form => conjugate relationship. The conjugate form allows you to do inference problem much more easily.

Continuous distribution

- Distribution over continuous space.

- Normal distribution, Gaussian.

- Another distribution used for continuous distribution is t-distribution.

- This is distributed according to t where t is;

where is the sample variance;

 

- This looks like the definition of variance except that instead of dividing by N, you are dividing by n-1.

- Normal distribution is nice because it has own conjugate prior, at least for the mean.

- If we assume that the log-linear parameters are distributed normally, we can say that

the parameters for the English log-linear model distributed to according to the normal distribution whose mean is the parameters of the Germanic log linear parameters which is distributed according to the normal distribution of mean of Indo-Eoropean etc.etc.

Hypothesis testing

- We want to make a statement about distribution whether it's same or different.

(e.g.) log linear model is giving us a way of constructing distribution, but it doesn't allow us to know whether it has the property or not.

- We assume some null hypotheses (might be true or not) distribution over some of interest.

- what's the probability of observing the value of that variable given that our null hypothesis is true.

- We gonna reject our null hypothesis.

Collocation

- e.g. idioms (kick the bucket), New York, etc.

- Collocation has many properties. See M&S text book.

- Our null hypothesis: 2 words are going to be independent

- Alternate hyopthesis: 2 words appear together (collocation)

- the probability of  followed by the probability of

- Suppose that we want to know whether "new companies" is a collocation or not.

- Null hypothesis we are going to assume is very simple:

~ Binomial ~Nomal distribution

- We are going to assume that this is approximately normal distribution.

- This induces normal (binomial dist: N(np, np(1-p))) distribution with mean and variance.

, )

- We can assume that the variance np(1-p) about equal to mean because the probability of “new companies” is very small, so we have N(np, np).

- In order to get  we need to know how many times we actually got “new companies”.

 (If this value is higher, we can say it’s collocation)

- We are going to assume t-distribution. Plug in;

- Where N = the number of bigrams, =, =, =

(* in the text book,  is wrong)

 ~~ 0.97

- You ask whether t is greater than 2.5 (2.45?) or not. If so, it’s collocation.

Chi square 

- Another way of determining if the words are collocation or not.

-  is defined by contingency table.

-  depends on number of rows and columns.

8

4667

15820

14,287,181

[, = 8

[, = 4667

[] = 14,287,181

[] = 15,820

-  is just number of times these things appear.

-  is what you would observe if these things were independent.

-   sum of rows

-  sum of columns

- N = 14,307,676

- The table below is what we would expect if these things in the above table are independent.

5.17

4673.5

15,822.8

1428717.2

1.55

Equations we haven’t covered in the class #8