Everything You Always Wanted to Know About NLP but Were Afraid to Ask
Steven Butler and Max Schwartz
PyGotham 2016
Assumptions
Python
Some Applications of NLP
Why Linguistics?
Theoretical Linguistics Concepts
Today we will look at:
Other important areas that we are ignoring:
What Even Is a Word
Morphology: the study morphemes, the smallest linguistic units that carry meaning.
Words are made up of one or more morphemes:
“words” => [“word”, “s”]
“smallest” => [“small”, “est”]
Some morphemes are free (they can function as words independently), others are bound (they must attach to other morphemes). Many lie somewhere in-between.
Task: Stemming
A Parliament of Words (an aside)
The definition of word gets tricky, especially as you drop out of English:
English: copyeditor (copy editor?), the
Spanish: lo (as in lo siento), desafortunadamente
Indonesian: kebertanggungjawaban (=> [ke-, tidak, [ber-, tanggung], jawab, -an])
Turkish: uygarlaştıramadıklarımızdanmışsınızcasına ("behaving as if you are among those whom we could not civilize")
Task: Inserting Word Breaks
N-Grams
Here are some bigrams.
(Here, are) (are, some) (some, bigrams) (bigrams, .)
And here are some trigrams.
(And, here, are) (here, are, some) (are, some, trigrams) (some, trigrams, .)
Statistical Language Models
Mini corpus: “I really really really really really really like you
And I want you, do you want me, do you want me, too?”
P(want | you) = 0.5
P(me | want) = 0.667
P(really | really) = 0.833
Relies on the Markov Assumption.
Task: Text Generation
Syntax and Parts of Speech
Task: Part of Speech Tagging
Q: Why not just look up a word’s part of speech in the dictionary?
A: Time flies like an arrow, fruit flies like a banana.
Semantics and WordNet
Bag-Of-Words Model
S1: “This is sentence one.”
S2: “This sentence is the second sentence.”
[this, is, sentence, one, ., the, second]
S1: [1, 1, 1, 1, 1, 0, 0]
S2: [1, 1, 2, 0, 1, 1, 1]
Resources ‘n’ Refs
Thank You!
Steven: srbutler
Max: maxwell-schwartz
@DeathAndMaxes
Thanks for coming! Questions and comments are welcome.