Assigned: 3/24/2019; Due: 4/3/2019 11:59pm
Requirements. You must use Python version 3.5 or later. The only data science libraries you may use are:
(if you are interested in other libraries, that are not related to the main goals of the assignment, please ask).
All template methods must use the names provided (e.g. “tokenize(sent)”). However, you may also use additional methods to keep your code clean. Copying chunks of code from other students, online or other resources is prohibited. Please see syllabus for additional policies.
Submission. All code must be placed in a single file titled “A2_lastname.py” where lastname is your last name. It must be submitted to blackboard as a .py file. Submitting a zip file or multiple files will be considered an invalid submission -- late penalties or the deadline will apply if the single py file is submitted afterward. If using API keys for your personal twitter account, then please remove the keys before submitted and make to be clear with where to set the keys for testing.
Task Description. Your task is to summarize what is being said about a particular named entity (person, place, or thing) on Twitter over a certain period of time by generating random phrases that could be said after the named entity. To do this, we will first build a named entity recognizer to find named entities. Then, we will build a tier of language models that will help us to generate words, given a named entity.
Once everything is trained for a given named entity (words xi-k, …, xi) then the language model should generate up to 5 words to appear next: xi+1, xi+2, xi+3, xi+4, xi+5. For example, given xi-k, …, xi = “United Nations” the model might produce xi+1, xi+2, xi+3, xi+4, xi+5 = “is considering global affairs .”.
Step 1: Turn your POS Tagger into a (partial) Named Entity Recognizer (25 points). Named entity recognition (NER) is the task of identifying mentions of people, places, and things. Typically, it also involves labeling a class for the named entity (person, organization, place, ...etc..) but for this work we will stick to simply identifying this. To do this, modify the tagger from assignment one so it simply classifies whether a word is a proper noun or not. This may sound easy, but proper nouns are one of the harder tags for a POS tagger so you will need to update your system to work better for proper nouns. Specifically, the following features should be helpful additions:
Also,, Go ahead and add additional training from the same Ark Tweet NLP data: https://github.com/brendano/ark-tweet-nlp/blob/master/data/twpos-data-v0.3/oct27.conll?raw=true
Your final accuracy should be > 96% accurate (note that simply yielding everything is not a proper noun
Take any contiguous chunk of proper nouns as a single named entity. For example, if you tagger produces the following labels for
“He told Barak Obama to read Newsday to learn about Stony Brook University” => tagger =>
He/0 told/0 Barak/1 Obama/1 to/0 read/0 Newsday/1 to/0 learn/0 about/0 Stony/1 Brook/1 University/1 ./0
Then you should mark “Barak Obama”, “Newsday”, and “Stony Brook University” as named entities.
Step 2: Train a Generic Trigram Language Model on the Same Data (25 points).
Using the same data, daily547.conll and oct27.conll, train a trigram model (ignore the tags in the data). The model can be thought of as dictionary that stores word counts (unigram counts) and bigrams counts as well as a dictionary of bigrams storing a dictionary of words that might follow and their counts: trigramCounts[(word1, word2)][word3] = count word1,word2,word3. This way, Trigram probabilities can easily be computed on the fly, even with add-one smoothing:
p(w3 | w1, w2) = trigramCounts[(w1, w2)][w3] + 1 / bigramCounts[(w1, w2)] + VOCAB_SIZE
Let’s call this “baseLM”, for base language model. We will now update it for specific
Step 3: Pull 1000 tweets with a given named entity from the Twitter stream (10 points).
There are several libraries available for Python to “listen to” the public twitter stream. The tweepy package is one of the better ones. Be sure to read the tutorial on how to setup OAuth and creating a stream listener. No matter what package you use, you will need to signup for access to the stream with Twitter (choose personal use, since this is for your own education).
Once you’ve figured out how to access, you’ll want to limit the stream to tweets mentioning a given named entity (e.g. “United Nations”) and wait until you receive at least 1000 tweets (you might test this using a lower threshold). Note: for rare named entities this may go slow (e.g. “Port Jefferson”) while for others it will go very fast (e.g. “Trump”).
In the final step, these tweets will be used to update the counts for the language model.
Step 4: Update Language Model (20 Points)
Process the tweets and update the counts of the “base” language model such that they reflect the tweets you collected for the given named entity. It’s a good idea to make a copy of base, and then simply add the counts you derive from the new tweets. The idea is that the base model represents language in general, while this update will “specialize” the model on what was said about the given named entity. When the model is prompted with the named entity, it will generate language based on the tweets you’ve just acquired but it can always “backoff” to a more general language model. You will need to limit to tweets that mention the named entity as a proper noun (e.g. “I will trump you” should not match on “trump”.)
Step 5: Generate 5 Different Phrases appearing after the named entity. (20 Points)
Now that you have all the pieces, use your tools from the previous steps to generate likely phrases that appear after the named entity. Specifically, you should generate five different phrases that follow the named_entity. The phrases can have up to 5 words, not counting the named entity. For example:
The algorithm to do this simply follow
i=0 #current index into generated phrase
new_word = named_entity_tokens[-1]
last_bigram=named_entity_tokens[-2:] #last two words of named entity
while i < 5 and new_word != ‘END’:
#get probability distribution for next word based on last two words
#pick a next word from the probability distribution
#update last_bigram = (last_bigram, new_word)
Place everything into the following method such that we can call your method on any named entity:
#train the named entity recognizer; save it to an object
#train the generic (base) trigram language model; save it
#pull 1000 tweets that contain named_entity
#Limit to tweets with the named entity being classified as a named entity
#generate five different phrases that follow the named_entity.
This section will be updated as questions come in.