1 of 18

Everything You Always Wanted to Know About NLP but Were Afraid to Ask

Steven Butler and Max Schwartz

PyGotham 2016

2 of 18

Assumptions

Python

  • You have an interest in natural language processing (NLP), but don’t have much experience with it
  • You’re okay with us mostly focusing on English for this relatively short talk (even though we like working on multilingual data too!)

  • Python 3 is strongly recommended for NLP (or really all) tasks, because of native Unicode support (you’re gonna need Unicode support)
  • Get the Jupyter notebook here and follow along:

3 of 18

Some Applications of NLP

  • Search
  • Spam Filters
  • Chatbots
  • Autocorrect, Predictive Text
  • Document Classification, Authorship Attribution
  • Sentiment Analysis
  • Translation
  • Speech Recognition
  • Automatic Summarization
  • So many others

4 of 18

Why Linguistics?

  • Natural language processing is hard.
  • Language is messy, and questions that seem trivial at first become enormously complicated very quickly.
  • Linguistics is the field of study focused on language: its form, its usage, its sound systems, its interaction with different systems in the brain, its social context...
  • Using the body of knowledge about language developed by linguists can help make NLP problems clearer and easier to approach

5 of 18

Theoretical Linguistics Concepts

Today we will look at:

  • Morphology
  • Syntax
  • Semantics

Other important areas that we are ignoring:

  • Phonetics, Phonology
  • Prosody
  • Pragmatics

6 of 18

What Even Is a Word

Morphology: the study morphemes, the smallest linguistic units that carry meaning.

Words are made up of one or more morphemes:

“words” => [“word”, “s”]

“smallest” => [“small”, “est”]

Some morphemes are free (they can function as words independently), others are bound (they must attach to other morphemes). Many lie somewhere in-between.

7 of 18

Task: Stemming

  • In English (and a lot of other languages), affixes tend to show up in a certain order, and the orthographic (=spelling) changes that come with it seem to follow general rules and patterns
  • With a few groups of potential suffixes, you can build an extremely simple, crude stemmer that will get a decent amount of stems correct
  • For English, this has been done extensively (see Porter stemmer and variations)
  • Of course, you can try an (unsupervised) machine learning approach as well, to deal with a wider variety of data

8 of 18

A Parliament of Words (an aside)

The definition of word gets tricky, especially as you drop out of English:

English: copyeditor (copy editor?), the

Spanish: lo (as in lo siento), desafortunadamente

Indonesian: kebertanggungjawaban (=> [ke-, tidak, [ber-, tanggung], jawab, -an])

Turkish: uygarlaştıramadıklarımızdanmışsınızcasına ("behaving as if you are among those whom we could not civilize")

9 of 18

Task: Inserting Word Breaks

  • Word breaks are conventions of writing systems, and not everyone agrees about where they should be.
  • English’s writing system has fairly strong conventions, other languages’ have few, none, or an alternative concept altogether (c.f. Thai, Chinese).
  • We can make use of those conventions to solve some of our tasks by looking at the places where breaks are most likely to occur.
  • Tools like dynamic programming can make this process run much faster than otherwise

10 of 18

N-Grams

Here are some bigrams.

(Here, are) (are, some) (some, bigrams) (bigrams, .)

And here are some trigrams.

(And, here, are) (here, are, some) (are, some, trigrams) (some, trigrams, .)

11 of 18

Statistical Language Models

Mini corpus: “I really really really really really really like you

And I want you, do you want me, do you want me, too?”

P(want | you) = 0.5

P(me | want) = 0.667

P(really | really) = 0.833

Relies on the Markov Assumption.

12 of 18

Task: Text Generation

  • Tokens: the units of language being investigated. In most circumstances, these correspond to “words.”
  • First tokenize the corpus. Some tricky parts: Punctuation? Contractions?
  • From there we create a list of n-grams.
  • Using the Markov Assumption, we guess a potential next word based on the probabilities that one word follows the previous n - 1 words in our model.
  • We can find all words that follow an initial “seed,” choose one of those words, and continue looping until a predetermined endpoint.

13 of 18

Syntax and Parts of Speech

  • Parts of speech are things like nouns, verbs, adjectives, etc.
  • They are a way of categorizing words based on the syntactic and semantic roles that they play.
  • You can expand your notion of a language model by including these categorical terms along with words; it allows generalization.
  • This allows probabilities to be generated for how often a particular category or POS follows a given sequence of words, rather than having to store probabilities for each word.

14 of 18

Task: Part of Speech Tagging

Q: Why not just look up a word’s part of speech in the dictionary?

A: Time flies like an arrow, fruit flies like a banana.

  • You can build a POS tagger that assigns parts-of-speech using the Viterbi algorithm described previously.
  • There are two key pieces to our bigram POS tagger:
    • A word::POS_tag conditional frequency distribution
    • A POS_tag::previous_POS_tag conditional frequency distribution
  • For each word in an input, the tagger will ask:
    • How often does this word appear with this POS tag?
    • How often does that word’s POS tag follow the previous word’s POS tag?

15 of 18

Semantics and WordNet

  • After words (morphology) and context (syntax), we can now move into one of the trickiest parts of NLP: meaning (semantics)
  • WordNet is an ambitious project that attempts to map out the semantic relations (the ontology) of the vast majority of English’s vocabulary
  • It is built around the concept of synsets, groups of synonyms defined by a variety of semantic relationships (things like hypernyms, hyponyms, etc.)
  • These are mapped out as graphs, which allow you to navigate from word to word to explore their relationships
  • It also includes a lot of morphological information, like stems and lemmas, as well as providing usage examples

16 of 18

Bag-Of-Words Model

S1: “This is sentence one.”

S2: “This sentence is the second sentence.”

[this, is, sentence, one, ., the, second]

S1: [1, 1, 1, 1, 1, 0, 0]

S2: [1, 1, 2, 0, 1, 1, 1]

17 of 18

Resources ‘n’ Refs

  • NLP With Python; Bird, Klein, and Loper
    • This is essentially the documentation to the Natural Language Toolkit (NLTK)
  • Speech and Language Processing; Jurafsky and Martin
  • WordNet
  • Scikit-Learn Feature Extraction

18 of 18

Thank You!

Steven: srbutler

srbutler@gmail.com

Max: maxwell-schwartz

@DeathAndMaxes

maxwell.schwartz11@gmail.com

Thanks for coming! Questions and comments are welcome.