CS 288: Statistical Natural Language Processing

CS 288: Statistical Natural Language Processing, Spring 2011

Assignment 1: Language Modeling
Due: February 3rd

Setup

First, make sure you can access the course materials. The components are:

code1.tar.gz : the Java source code provided for this course
data1.tar.gz : the data sets used in this assignment

The authentication restrictions are due to licensing terms. The username and password should have been mailed to the account you listed with the Berkeley registrar. If for any reason you did not get it, please let us know.

The source archive contains four files: assign1.jar contains the provided classes and source code (most classes have source attached, but some do not). build_assign1.xml is an ant script that you will use to compile the .jar file you submit for grading. The other two files are stubs of the classes you will need to implement. You may wish to use an IDE such as Eclipse to link the .jar file and browse the source (we recommend it). In general, you are expected to be able to set up your development environment yourself, but for this first assignment, we will provide setup instructions using Eclipse:

Create a new Java Project.

Right click on the project and choose "Import".

The import type is "Archive File" (under General). Point to the downloaded source file.

Use the project Properties dialog to add assign1.jar to the Java Build Path.

At this point, your code should all compile, and if you are using Eclipse, you can browse the classes provided in assign1.jar by looking under "Referenced Libraries". You can also run a simple test by running

java -cp assign1.jar edu.berkeley.nlp.Test

You should get a confirmation message back.

The testing harness we will be using is LanguageModelTester (in the edu.berkeley.nlp.assignments.assign1 package). To run it, first unzip the data archive to a local directory ($PATH). Then, build the submission jar using

   ant -f build_assign1.xml

Then, try running

    java -cp assign1.jar:assign1-submit.jar -server -mx500m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType STUB

You will see the tester do some translation, which will take a couple of minutes and take up about 130M of memory, printing out translations along the way (note that printing of translations can be turned off with -noprint). The tester will also print BLEU, an automatic metric for measuring translation quality (bigger numbers are better; 60 is about human-level accuracy). For the stub, accuracy should be terrible (15-16). The next step is to include an actual language model. We've provided a model, which you can use by running

    java -cp assign1.jar:assign1-submit.jar -server -mx500m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType UNIGRAM

Now, you'll see the tester read in around 9,000,000 sentences of monolingual data and build an LM. Unfortunately, the unigram model doesn't really help, so you'll need to improve it by writing a higher order language model.

Description

In this assignment, you will construct two different language models and test them with the provided harness.

Take a look at the main method of LanguageModelTester.java, and its output.

Training: Several data objects are loaded by the harness. First, it loads about 250M words of monolingual English text. These sentences have been tokenized for you. In addition to the monolingual text, the harness loads the data necessary to run a phrase-based statistical translation system and a set of sentence pairs to test the system on. The data for the MT system consists of a phrase table (a set of scored translation rules) and some pre-tuned weights to trade off between the scores from the phrase table (known as the translation model) and the scores from the language model. Once all the data is loaded, a language model is built from the monolingual English text. Then, we test how well it works by incorporating the language model into an MT system and measuring translation quality.

Experiments: You will need to implement two language models: an exact language model that directly reports the (appropriately smoothed) scores computed from the training data, and a noisy language model that uses some sort of approximation technique to create a language model that takes less memory, but may not work quite as well. You should modify the classes ExactLmFactory and NoisyLmFactory, respectively, to generate language models of this type. You are welcome to create as many additional classes as you like, so long as those two classes retain their names and continue to implement LanguageModelFactory.

For both types of language models, the basic scoring method should be a Kneser-Ney trigram model. The particular implementation details are mostly up to you, though you are encouraged to experiment with different options and see what works best. For the noisy model, you are free to try any of the approximation methods discussed in class (or if you're feeling particularly ambitious, ones you find in outside literature). In your write-up you should include a discussion of the tradeoffs between memory usage and BLEU that you found in your experiments.

Evaluation: Each language model is primarily tested by providing its scores to a standard MT decoder and measuring the quality of the resulting translations. An MT decoder takes a French sentence and attempts to find the highest-scoring English sentence, taking both the translation and language models into account (for this assignment, you should just consider the decoder to be a black box, although you'll have to implement one yourself later). The resulting translations are then compared to the human-annotated reference English translations using BLEU. Since the translation model for this assignment is fixed, the only way to boost your BLEU score is by improving the quality of the language model. For reference, our exact Kneser-Ney trigram model gets a BLEU score of about 23.4; you must be able to at least get a score of 23 to pass this assignment. Note that the monolingual English data contains every English word in the phrase table. However, sometimes when translating a French word that hasn't been seen before, the decoder will need to make up a translation rule that includes an unknown English word and the language model will need to return some score. All translations of that sentence will include that rule, so the score you return doesn't really matter so long as it's consistent (for example, always returning a constant should be fine).

In addition to translation quality, your language model will be evaluated for its speed and memory usage. We will expect your exact language model to fit into about 900M, and the noisy one to fit into about 600M. Note that around 300M is used by the phrase table and the vocabulary (when we run the unigram model, total memory usage is 348M), so at worst, you should aim to make your exact language model fit in 1.2G of memory. However, we will allow the JVM to use up to 2G of memory since some implementations may require additional scratch space during language model construction.
For speed, we are measuring the speed of decoding with your language model, not building it (this is a standard metric, the idea being that you only have to build a language model once, but you have to decode with it for as many sentences as you wish to translate). Note that decoding speed depends heavily on the language model order, so it's typical for decoding with a trigram language model to be dramatically slower than decoding with a unigram model. For reference, on one particular testbed machine, decoding all 2000 sentences with the unigram language model took 24 seconds, but decoding with the exact trigram took 582 seconds.

When we autograde your submitted code we will do two things. First, we will measure BLEU, memory usage and decoding speed using the same testing harness as you by running the two commands

java -cp assign1.jar:assign1-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType EXACT
java -cp assign1.jar:assign1-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType NOISY

In addition, we will programatically spot-check the stored counts for various n-grams. The exact model should return all correct counts, whereas the noisy model should still be mostly correct, but is permitted to make errors on at most 2% of the trigrams we query.

Write-ups: For this assignment, in addition to submitting a compiled jar for autograding according to the standard instructions, you should turn in a write-up of the work you've done. The write-up should specify what models you implemented and what significant choices you made. It should include tables or graphs of BLEU, runtime, memory, etc., of your systems. It should also include some error analysis - enough to convince us that you looked at the specific behavior of your systems and thought about what it's doing wrong and how you'd fix it. There is no set length for write-ups, but a ballpark length might be 3-4 pages, including your evaluation results, a graph or two, and some interesting examples. We're more interested in knowing what observations you made about the models or data than having a reiteration of the formal definitions of the various models.

What will impact your grade is the degree to which you can present what you did clearly and make sense of what's going on in your experiments using thoughtful error analysis. When you do see improvements in BLEU, where are they coming from, specifically? Try to localize the improvements as much as possible. Some example questions you might consider: Do the errors that are corrected by a given change to the language model make any sense? Why? You should also include some discussion of what approximation techniques you tried for your noisy model, and what worked and what didn't. What kind of tradeoffs did you observe?

Submission: You will submit assign1-submit.jar to an online system. Note that this jar must contain implementations of NoisyLmFactory and ExactLmFactory, but must not contain any modifications of the source code provided in assign1.jar. To check that everything is in order, we will run a small "sanity" check when you submit the jar. Specifically, we will run the commands

java -cp assign1.jar:assign1-submit.jar -server -mx50m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType EXACT -sanityCheck
java -cp assign1.jar:assign1-submit.jar -server -mx50m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType NOISY -sanityCheck

The -sanityCheck flag will run the test harness with a tiny amount of data just to make sure no exceptions are thrown. Please ensure that these commands return successfully before submitting your jar.
You will also submit a write-up in class on the due date.

Grading: For this assignment, the following are required for successful completion of the project:

The memory usage printed before decoding must be no more than 1.3G and 1G for the exact and noisy LMs..
Decoding must be no more than 30x slower than when decoding with the unigram model.
The BLEU score for the exact and noisy LMs must be at least 23 and 22.
The noisy LM must return the correct counts for at least 98% of trigrams in the data.

These are hard limits; additional improvements in memory usage, BLEU score, error rate, and decoding speed will also affect your grade, as will your write-up. The highest-scoring submissions will be those that perform substantially better than the minimum requirements or do substantial investigation or extension.

Updates:

01/28/11: Some students have noticed an exception with a strack trace that looks like this:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
    at edu.berkeley.nlp.mt.decoder.internal.BeamDecoder.decode(BeamDecoder.java:96)
    at edu.berkeley.nlp.assignments.assign1.LanguageModelTester.doDecoding(LanguageModelTester.java:255)
    at edu.berkeley.nlp.assignments.assign1.LanguageModelTester.evaluateLanguageModel(LanguageModelTester.java:223)
    at edu.berkeley.nlp.assignments.assign1.LanguageModelTester.main(LanguageModelTester.java:197)

This can happen if you language model returns Double.NaN or Double.NEGATIVE_INFINITY. Please ensure that your language model always returns a finite real number.

01/29/11: Some students have asked how long LM construction can take. While we don't have a hard limit on this, we do expect LM construction to happen in a reasonable amount of time. We will not accept code that takes more than 30 minutes to run from start to finish (including LM construction and decoding).

01/30/11: A small bug in one of the support classes (StringToIntOpenHashMap) was fixed, which should save a little bit of memory. Please download the most recent version of the code to find the up-to-date assign1.jar.