CS Topic: Natural Language Processing

Stony Brook University
CSE392-01 - Spring 2019

Assignment 2.

Assigned: 3/24/2019;   Due: 4/3/2019 11:59pm

Goals.

Requirements. You must use Python version 3.5 or later. The only data science libraries you may use are:

numpy

sklearn

scipy.stats

tensorflow

(if you are interested in other libraries, that are not related to the main goals of the assignment, please ask).

All template methods must use the names provided (e.g. “tokenize(sent)”). However, you may also use additional methods to keep your code clean. Copying chunks of code from other students, online or other resources is prohibited. Please see syllabus for additional policies.

Submission. All code must be placed in a single file titled “A2_lastname.py” where lastname is your last name. It must be submitted to blackboard as a .py file. Submitting a zip file or multiple files will be considered an invalid submission -- late penalties or the deadline will apply if the single py file is submitted afterward.  If using API keys for your personal twitter account, then please remove the keys before submitted and make to be clear with where to set the keys for testing.

Task Description. Your task is to summarize what is being said about a particular named entity (person, place, or thing) on Twitter over a certain period of time by generating random phrases that could be said after the named entity. To do this, we will first build a named entity recognizer to find named entities. Then, we will build a tier of language models that will help us to generate words, given a named entity.

Once everything is trained for a given named entity (words xi-k, …, xi) then the language model should generate up to 5 words to appear next: xi+1, xi+2, xi+3, xi+4, xi+5.  For example, given xi-k, …, xi = “United Nations” the model might produce xi+1, xi+2, xi+3, xi+4, xi+5 = “is considering global affairs .”.

Step 1: Turn your POS Tagger into a (partial) Named Entity Recognizer (25 points). Named entity recognition (NER) is the task of identifying mentions of people, places, and things. Typically, it also involves labeling a class for the named entity (person, organization, place, ...etc..) but for this work we will stick to simply identifying this. To do this, modify the tagger from assignment one so it simply classifies whether a word is a proper noun or not. This may sound easy, but proper nouns are one of the harder tags for a POS tagger so you will need to update your system to work better for proper nouns. Specifically, the following features should be helpful additions:

  1. A feature that indicates if it is the first word of the sentence or not
  2. Adjust your one-hot representation so it treats words that only appear once in the training data as “out of vocabulary” -- a special index reserved for words that it hasn’t seen before. An easy way to do this, is simply to update wordToIndex so that it only includes words that appear at least twice and then make sure to add a feature for out of vocabulary. For example. If you have 550 words in wordToIndex, then add another index on 551 used to indicate any word that isn’t in wordtoIndex.
  3. A feature indicating whether the target word (and previous, next word) are commonly proper nouns (this is called using a “gazeteer”).  Thankfully there are lists available for this, such as that from Alan Ritter’s Twitter NLP package: https://github.com/aritter/twitter_nlp/blob/master/data/dictionaries/cap.1000
    Have the feature return 0 or 1 depending on if the word is in the list of common nouns (a slightly more advanced version uses the reverse rank of the word in list -- they are listed in order of frequency, as of 2012) .
  4. Optionally, use vector representation of context instead of one-hot.

Also,, Go ahead and add additional training from the same Ark Tweet NLP data: https://github.com/brendano/ark-tweet-nlp/blob/master/data/twpos-data-v0.3/oct27.conll?raw=true

Your final accuracy should be > 96% accurate (note that simply yielding everything is not a proper noun

Take any contiguous chunk of proper nouns as a single named entity. For example, if you tagger produces the following labels for
“He told Barak Obama to read Newsday to learn about Stony Brook University” => tagger =>
He/0 told/0 Barak/1 Obama/1 to/0 read/0 Newsday/1 to/0 learn/0 about/0 Stony/1 Brook/1 University/1 ./0

Then you should mark “Barak Obama”, “Newsday”, and “Stony Brook University” as named entities.

Step 2: Train a Generic Trigram Language Model on the Same Data (25 points).

Using the same data, daily547.conll and oct27.conll, train a trigram model (ignore the tags in the data). The model can be thought of as dictionary that stores word counts (unigram counts) and bigrams counts as well as a dictionary of bigrams storing a dictionary of words that might follow and their counts: trigramCounts[(word1, word2)][word3] = count word1,word2,word3. This way, Trigram probabilities can easily be computed on the fly, even with add-one smoothing:
p(w3 | w1, w2) = trigramCounts[(w1, w2)][w3] + 1 / bigramCounts[(w1, w2)] + VOCAB_SIZE

Let’s call this “baseLM”, for base language model. We will now update it for specific

Step 3: Pull 1000 tweets with a given named entity from the Twitter stream (10 points).

There are several libraries available for Python to “listen to” the public twitter stream. The tweepy package is one of the better ones. Be sure to read the tutorial on how to setup OAuth and  creating a stream listener. No matter what package you use, you will need to signup for access to the stream with Twitter (choose personal use, since this is for your own education).

Once you’ve figured out how to access, you’ll want to limit the stream to tweets mentioning a given named entity (e.g. “United Nations”) and wait until you receive at least 1000 tweets (you might test this using a lower threshold). Note: for rare named entities this may go slow (e.g. “Port Jefferson”) while for others it will go very fast (e.g. “Trump”).

In the final step, these tweets will be used to update the counts for the language model.

Step 4: Update Language Model (20 Points)

Process the tweets and update the counts of the “base” language model such that they reflect the tweets you collected for the given named entity. It’s a good idea to make a copy of base, and then simply add the counts you derive from the new tweets. The idea is that the base model represents language in general, while this update will “specialize” the model on what was said about the given named entity. When the model is prompted with the named entity, it will generate language based on the tweets you’ve just acquired but it can always “backoff” to a more general language model. You will need to limit to tweets that mention the named entity as a proper noun (e.g. “I will trump you” should not match on “trump”.)

Step 5: Generate 5 Different Phrases appearing after the named entity. (20 Points)

Now that you have all the pieces, use your tools from the previous steps to generate likely phrases that appear after the named entity. Specifically, you should generate five different phrases that follow the named_entity. The phrases can have up to 5 words, not counting the named entity. For example:

  1. United Nations is encouraging new legislation.
  2. United Nations introduced sustainable development goals.
  3. United Nations meeting was effective.
  4. United Nations is an unnecessary organization.
  5. United Nations to elect a new representative.

The algorithm to do this simply follow

i=0 #current index into generated phrase

new_word = named_entity_tokens[-1]

last_bigram=named_entity_tokens[-2:] #last two words of named entity

while i < 5 and new_word != ‘END’:
        #get probability distribution for next word based on last two words
        #pick a next word from the probability distribution
        #update last_bigram = (last_bigram[1], new_word)

        i+=1

Place everything into the following method such that we can call your method on any named entity:

def NamedEntityGenerativeSummary(named_entity,
                        twitter_access_token, twitter_access_token_secret):

        #train the named entity recognizer; save it to an object

        #train the generic (base) trigram language model; save it

        #pull 1000 tweets that contain named_entity

        #Limit to tweets with the named entity being classified as a named entity

        #generate five different phrases that follow the named_entity.


   

HINTS

This section will be updated as questions come in.