CS 263:
Advanced NLP
Saadia Gabriel
Lecture 1
NLP has a growing and diverse research community
Check out our Friday research talks
From 2-3pm in Eng 6 289
Association for Computational Linguistics
(Founded in the 1960s)
PTE Requests
I will start addressing these after the first class and will give an update on Wednesday.
What is Natural Language Processing?
Why are we all in this room?
Most people use NLP in their daily lives…
Launched 2006
Most people use NLP in their daily lives…
credit: ifunny.com
Siri: Launched 2011
Alexa: Launched 2014
Most people use NLP in their daily lives…
Claude,
Copilot: Launched 2023
Gemini: Launched 2024
NLP
NLP ≠ LLM Science
LLMs
But we will focus heavily on modern NLP and large-scale language modeling
10 or 11 years ago…
Zipf’s Law:
The frequency of a word is inversely proportional to its rank, for example the most frequent word appears twice as often as the second most frequent.
You can visualize Zipf’s Law as a power-law distribution dominated by common words.
10 or 11 years ago…
You might have been parsing corpora with regular expressions (Regex)
Find sentences beginning with a pronoun:
Search phrase starts at beginning of sentence
Words to match
OR
symbol
Word boundary
Word group
NLP bridges
Computer Science
Computational
Linguistics
Modern NLP focuses heavily on AI
Levels of Language
Courtesy of Nanyun (Violet) Peng
Levels of Language
In English, why is “hangry” a word and “digging a foundation” is three words?
Levels of Language
Papa learns computer programing with an AI assistant.
Each part has a semantic role within the larger sentence.
Papa is an Agent, an entity that carries out an action (e.g. learning).
Predicate/
Action
Tool used to perform action
Programming is a Patient, an entity that undergoes a state of change due to the action, or is possessed/acquired/exchanged.
Let’s think more formally about meaning and relatedness
Terminology: lemma and wordform
Lemmas have senses
Sense 1:
…a bank can hold the investments in a custodial account…
Sense 2:
“…as agriculture burgeons on the east bank the river will shrink even more”
of an aspect of a word’s meaning.
Hyponymy and Hypernymy
One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other
mango is a hyponym of fruit
Conversely hypernym/superordinate (“hyper is super”)
vehicle is a hypernym of car
fruit is a hypernym of mango
Meronymy
A leg is part of a chair; a wheel is part of a car.
Homonymy:
multi-sense as an artifact
Homonyms: words that share a form (spelling or pronunciation) but have unrelated, distinct meanings:
bat1: club for hitting a ball, bat2: nocturnal flying mammal
Homonymy:
multi-sense as an artifact
A related multilingual concept is “false friends,” which have identical or similar forms in 2 languages but have different meanings across languages.
Think “pain” in French vs. “pain” in English.
Homonymy causes problems for NLP applications
“bat care”
bat: murciélago (animal) or bate (for baseball)
bass (stringed instrument) vs. bass (fish)
Polysemy: related multi-sense
Sense 2: “A financial institution”
Sense 1: “The building belonging to a financial institution”
Synonyms
Antonyms
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
Levels of Language
You should get a hammer and hit it?
Do you have the ability to crack the window?
Can you open the window right now?
Can you crack the window a little bit?
When this is important in the real world
5 years ago…
Content moderation traditionally uses blocklists.
What could go wrong with this?
Offensive Content Warning
73%
8%
Breaking Content Filtering Systems
OpenAI content filter
HateBERT
Incorrectly given 73% probability of toxicity
Incorrectly given 8% probability of toxicity
Child abuse is wrong, racism is wrong, sexism is wrong
If you have ever been to a restaurant, you have probably noticed that the service is much better if the waiter is white…
34
Toxicity detectors use loaded words and group-mentions as signals
35
Toxicity detectors use loaded words and group-mentions as signals
36
African-American English tweets are considerably more likely to be misclassified as offensive compared to White-aligned tweets
Risk of Racial Bias in Hate Speech Detection
Systemic risk of silencing already marginalized populations
- Sap, Card and Gabriel et al. (2019)
37
Data collection
(benign and toxic examples)
Prompt GPT-3 to generate novel examples
sample
“I think he said his parents were Jewish”
“Nissenbaum is historically a Jewish name”
“Even as a Jew I think we can all have fun”
Prompt
Engineering
Examples with group mentions
GPT-3 responds
Pass to GPT-3
Prompt
“Passover is an annual Jewish holiday”
“Passover is an annual Jewish holiday”
Use classifier-in-the-loop decoding for adversarial behavior
Classifier-in-the-loop
“ALICE” method with a pretrained language model (PLM) like GPT-3
PLM
Classifier
Decoding
search algorithm
Candidate (sub)sequences
Rerank and continue
Toxicity/Hate Speech classifier like BERT
Pass to GPT-3
Toxigen (Hartvigsen and Gabriel et al., 2022)
38
39
Detecting Toxicity is Highly Subjective
40
41
42
43
Modeling Communication
Language exists within certain contexts…
Linguistic, e.g. interpreting the meaning of “Do you want to go there?” based on the previous conversational history “Hey, the park reopened.”
Extra-linguistic (social, spatial or temporal factors),
e.g. is it offensive to say something?
Implicature is what lies beneath explicit statements
Ann
Bill
Hirschberg (1985)
Do you sell paste?
I sell rubber cement
Implication:
Bill doesn’t sell paste,
or he would have said so
Ann
Do you sell paste?
We can describe this as an instance of a “speech act”
A speech act is an utterance that not only conveys information but also performs an action (Austin, 1975).
Rational Speech Act (RSA)
and state of the world (s) based on S1’s utterance
The speaker chooses their utterance (u) to maximize probability of a literal speaker (Lit) interpreting object w
Pragmatic Speaker (S)
The pragmatic listener interprets S’s utterance and uses Bayes’ rule identify w
Pragmatic Listener (L)
Rational Speech Act (RSA)
Pragmatic Listener (L1)
and state of the world (s) based on S1’s utterance
Pragmatic Speaker (S1)
object
prior
Meaning function
returning 1 if u is indicative of w
Rational Speech Act (RSA)
Pragmatic Listener (L1)
and state of the world (s) based on S1’s utterance
Pragmatic Speaker (S1)
Hyperparameter controlling speaker’s rationality
Rational Speech Act (RSA)
Pragmatic Listener (L1)
and state of the world (s) based on S1’s utterance
Pragmatic Speaker (S1)
How do we represent words and concepts?
How do we represent a word?
Representing words as discrete symbols
egg student talk university happy buy
`
Issues?
Vhappy = [0 0 0 1 0 ... 0 ]
Vsad = [0 0 1 0 0 ... 0 ]
Vmilk = [1 0 0 0 0 ... 0 ]
Vhappy • Vsad = Vhappy • Vmilk = 0
`
Word meanings that can help decide:
Distributional Hypothesis (Harris, 1954):
A word is characterized by the company it keeps. In other terms, words that occur in the same contexts tend to have similar meanings.
Theoretical foundation of distributional semantics
Intuitions: Zellig Harris (1954):
Word-context matrix for word similarity
Note: Very sparse! (~ 50,000 x 50,000)
Word-word matrix
Problem with raw counts
Two classes of vector representation
Pointwise Mutual Information
Pointwise Mutual Information
Now how do we compute similarity?
We can use cosine similarity to quantify how close 2 word vectors are to each other
Courtesy of pyimagesearch
Measuring similarity: cosine
Cosine for computing similarity
Two classes of vector representation
Mutual Information (X ; Y): accounts for chance word co occurrences by capturing how X decreases our uncertainty about Y
What’s ahead
Next Time…
There are now many types of embeddings:
https://huggingface.co/blog/mteb
Overview of the Class
Representation Learning, Semantics, Pragmatics
Scaling Language Modeling Pipelines
Evaluation
Ethics and Applications
Basics of Language Modeling
Goals of the Course
Spoiler:
Not AGI
Main Course Website
Weeks 1 and 2:
Sophie Klitgaard (UCLA)
In-class Paper
Presentation Guidelines
Presenting in groups of up to 3
(sign up using the spreadsheet linked on the website and Bruin Learn)
Each presentation is ~7 minutes with 3 minutes for Q&A
Select a peer-reviewed paper
(you can take a look at ACL Anthology or other conference proceedings)
Weeks 3 and 4:
Weeks 5-8:
John Thickstun (Cornell)
Alisa Liu (UW)
Weeks 9 and 10
4 days of in-person final project presentations
Students will individually provide feedback for other project groups
If you are registered for the course and you are not added by tomorrow night, email me.
Piazza
For general course and homework discussion.
Bruin Learn
The website will be up by next Monday.
This is primarily for submitting homework assignments and viewing grades.
Grading
Cheating
Cheating
Prerequisites
There will be basic probability, statistics, linear algebra, and machine learning (CM146, CS260 or equivalent), which will be reviewed as needed.
Questions?