COMPLING2

Comp Ling II (Boyd-Graber)

Assignment 1: Machine Translation

Due: 15. February 2012

In this homework assignment, you will implement two basic machine translation models: IBM Model 1 and IBM Model 2. Check this page often, as I will update it as people ask questions.

What to Download

You can download data, sample output, and stub source code from this location:

http://terpconnect.umd.edu/~ying/cl2/hw1/

Unit Tests

Unit testing is a good way to develop code. Make sure that your code passes the (basic) unit tests provided - you can see how you’re doing by running the command:

python hw1/ibm_trans_test.py

Note that many of the tests will fail before you’ve started modifying the code, as you modify the code, more and more of the tests should go from failing to passing. It’s possible that different assumptions might lead to correct code not passing a unit test - if so, bring up the issue on the class mailing list.

The tests (and example alignments) defined in the unit tests depend on the translation direction (as should be obvious). You may need to tweak the tests to make them useful to your assumptions. This isn't part of the assignment, but it might be helpful during development.

Data

The data are gzipped samples from Europarl: http://www.statmt.org/europarl/

Each line is an aligned sentence. There is a Python class that iterates through the aligned sentences.

What to Do

IBM Model 1 In class, we discussed how to learn IBM Model 1 using EM. This is what you will implement.
Noisy Channel Scoring Take a language model (a subclass of nltk.model.ngram.NgramModel) and score a translation / source pair
IBM Model 2 After you’ve implemented Model 1, add in an alignment model to implement Model 2. The code is designed to make this easy using an object-oriented approach.

What to Turn In

Your code in a zipfile called USERNAME.tar.gz, which when unzipped extracts to USERNAME/hw1/ibm_trans.py
A writeup of your process for designing and testing your code
Show alignments and noisy channel probabilities for the first 20 sentences in the corpus for Model 1 and Model 2
Show the top translations for the words in devwords.txt
If you attempted extra credit, provide examples, documentation, and a README so that I can run the code. If I cannot run the code, you will not get extra credit.

Grading

This homework (and all homeworks) will be out of 100 points possible as follows.

Points	Aspect
40	Correctness of Algorithm
30	Description of what you did What did you do How did you do it What problems did you have How did you verify your implementation
10	Documentation/readability of the code Comments Function / variable names
10	Example outputs (as required above)
10	Good programming practice Handling errors Efficiency of implementation Reuse of code Tests / assertions

Questions

Your functions can take additional arguments. For example, you may want to supply an alignment distribution to Model 1 to assist an object oriented design. Make sure that your code uses sensible defaults to make sure it works as it should when given the “minimal” arguments.
Do not break the interface! I should be able to import your code and run tests on it.
Try the ToyCorpus first before moving on to real data. You should get reasonable results on it first.
Make sure that your results look something like the example outputs provided.
Should I model p(e|f) or p(f|e)? It’s fine to do either, but make sure you state your assumption clearly in your writeup. Use either p(e|f)p(e) or p(f|e)p(e) to score your translation candidate.
Can I return log probabilities? Yes, but if you do, do so consistently across all your code.
Do I have to return a new alignment model for accumulate_counts? Yes, even for Model 1. But you can return None or AlignmentBase if you’d like - it won’t be used.

Extra Credit

It’s possible to get more than the total number of points by attempting these extra credit problems. I think they’re doable, but it’s possible that there are hidden subtleties. Be sure to document which extra credit you attempt (and how you verified that you did them correctly). The number of points of extra credit will depend on how well your extensions work and how much effort you put into implementing, verifying, and explaining them.

Implement IBM Model 3 (as a separate file)
Add priors to the probability distributions
Handle out of vocabulary words in a sensible way
Find the argmax translation (e.g. using a FST) in a noisy channel model
NLTK incorporates Europarl as one of its corpora. Make your code use the Europarl data.

References This type of assignment is a rite of passage in NLP graduate courses. There are many versions of this assignment out there. You may find their explanations useful or more detailed than mine.

You may find the following paper useful for the extra credit:

http://research.microsoft.com/pubs/68958/model-one-final-rev.pdf