COMPLING2

Comp Ling II (Boyd-Graber)

Assignment 3: Topic Modeling

Due: 18. April 2012

Data

For the sake of consistency (and because you’ve already gained some experience with it), we’ll use the Europarl corpus:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.plaintext.EuroparlCorpusReader-class.html

Treat each chapter as a document. First test on English before moving to another language. You do not have to use the entire corpus; feel free to choose a reasonable subset. Be sure to state what subset you’re using in your code.

Code

Download code here:

http://terpconnect.umd.edu/~ying/cl2/hw3/

Unlike previous homeworks, there’s no “Base” class you have to extend. Feel free to add additional functionality, but please keep to the interface as laid out here as closely as possible.

What to do

Create a “vocabulary” that maps strings to integers. You’ll want to exclude stopwords and perhaps very short words. Use the stopwords corpus:

>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', ...]

It’s okay to look at the corpus as you’re creating your vocabulary.

Implement the add_doc function in the code
Implement the change_topic function in the code
Implement the probability function in the code
Implement the sample_word function in the code
Implement the training_lhood function in the code (if it isn’t generally going up with each iteration, you’re doing something wrong)

You might find functions in SciPy helpful:

http://www.scipy.org/doc/api_docs/SciPy.special.basic.html

Implement the topic_top_words function in the code

What to Turn In

Your code in a zipfile called USERNAME.tar.gz, which when unzipped extracts to USERNAME/hw3/lda.py
A writeup of your process for designing and testing your code (you’re encouraged to write unit tests; if you do, please turn them in!)
Choose a corpus and a number of topics (> 5). Show the top words from those topics.
Also show a plot of the training likelihood for the model you trained for (4).
Include in your readme a command line that runs your code.
If you attempted extra credit, provide examples, documentation, and a README so that I can run the code. If I cannot run the code, you will not get extra credit.

Extra Credit

Run your model on another language
Sample hyperparameters
Use a lemmatizer

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball.SnowballStemmer-class.html

Significantly improve the speed of inference

http://people.cs.umass.edu/~lmyao/papers/fast-topic-model10.pdf (for example)

Implement a more complicated model
Use collocation detection to find bigrams as a preprocessing step
Use tf-idf scores to filter words from documents instead of a stop list

Grading

This homework (and all homeworks) will be out of 100 points possible as follows.

Points	Aspect
40	Correctness of Algorithm
30	Description of what you did What did you do How did you do it What problems did you have How did you verify your implementation
10	Documentation/readability of the code Comments Function / variable names
10	Performance How fast is your code Quality of the topics
10	Good programming practice Handling errors Efficiency of implementation Reuse of code Tests / assertions