Published using Google Docs
COMPLING2_HW3
Updated automatically every 5 minutes

Comp Ling II (Boyd-Graber)

Assignment 3: Topic Modeling

Due: 18. April 2012


Data

For the sake of consistency (and because you’ve already gained some experience with it), we’ll use the Europarl corpus:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.plaintext.EuroparlCorpusReader-class.html

Treat each chapter as a document.  First test on English before moving to another language.  You do not have to use the entire corpus; feel free to choose a reasonable subset.  Be sure to state what subset you’re using in your code.

Code

Download code here:

http://terpconnect.umd.edu/~ying/cl2/hw3/

Unlike previous homeworks, there’s no “Base” class you have to extend.  Feel free to add additional functionality, but please keep to the interface as laid out here as closely as possible.

What to do

  1. Create a “vocabulary” that maps strings to integers.  You’ll want to exclude stopwords and perhaps very short words.  Use the stopwords corpus:

>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', ...]

        It’s okay to look at the corpus as you’re creating your vocabulary.

  1. Implement the add_doc function in the code
  2. Implement the change_topic function in the code
  3. Implement the probability function in the code
  4. Implement the sample_word function in the code
  5. Implement the training_lhood function in the code (if it isn’t generally going up with each iteration, you’re doing something wrong)

You might find functions in SciPy helpful:

http://www.scipy.org/doc/api_docs/SciPy.special.basic.html

  1. Implement the topic_top_words function in the code

What to Turn In

  1. Your code in a zipfile called USERNAME.tar.gz, which when unzipped extracts to USERNAME/hw3/lda.py
  2. A writeup of your process for designing and testing your code (you’re encouraged to write unit tests; if you do, please turn them in!)
  3. Choose a corpus and a number of topics (> 5).  Show the top words from those topics.
  4. Also show a plot of the training likelihood for the model you trained for (4).
  5. Include in your readme a command line that runs your code.
  6. If you attempted extra credit, provide examples, documentation, and a README so that I can run the code.  If I cannot run the code, you will not get extra credit.

Extra Credit

  1. Run your model on another language
  2. Sample hyperparameters
  3. Use a lemmatizer

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball.SnowballStemmer-class.html

  1. Significantly improve the speed of inference

                http://people.cs.umass.edu/~lmyao/papers/fast-topic-model10.pdf (for example)

  1. Implement a more complicated model
  2. Use collocation detection to find bigrams as a preprocessing step
  3. Use tf-idf scores to filter words from documents instead of a stop list

Grading

This homework (and all homeworks) will be out of 100 points possible as follows.

Points

Aspect

40

Correctness of Algorithm

30

Description of what you did

  • What did you do
  • How did you do it
  • What problems did you have
  • How did you verify your implementation

10

Documentation/readability of the code

  • Comments
  • Function / variable names

10

Performance

  • How fast is your code
  • Quality of the topics

10

Good programming practice

  • Handling errors
  • Efficiency of implementation
  • Reuse of code
  • Tests / assertions