1 of 86

CS 263:

Advanced NLP

Saadia Gabriel

Lecture 1

2 of 86

NLP has a growing and diverse research community

Check out our Friday research talks

From 2-3pm in Eng 6 289

Association for Computational Linguistics

(Founded in the 1960s)

3 of 86

PTE Requests

I will start addressing these after the first class and will give an update on Wednesday.

4 of 86

What is Natural Language Processing?

Why are we all in this room?

5 of 86

6 of 86

Most people use NLP in their daily lives…

Launched 2006

7 of 86

Most people use NLP in their daily lives…

credit: ifunny.com

Siri: Launched 2011

Alexa: Launched 2014

8 of 86

Most people use NLP in their daily lives…

Claude,

Copilot: Launched 2023

Gemini: Launched 2024

9 of 86

NLP

NLP ≠ LLM Science

LLMs

But we will focus heavily on modern NLP and large-scale language modeling

10 of 86

10 or 11 years ago…

Zipf’s Law:

The frequency of a word is inversely proportional to its rank, for example the most frequent word appears twice as often as the second most frequent.

You can visualize Zipf’s Law as a power-law distribution dominated by common words.

11 of 86

10 or 11 years ago…

You might have been parsing corpora with regular expressions (Regex)

Find sentences beginning with a pronoun:

Search phrase starts at beginning of sentence

Words to match

OR

symbol

Word boundary

Word group

12 of 86

NLP bridges

Computer Science

Computational

Linguistics

Modern NLP focuses heavily on AI

13 of 86

Levels of Language

14 of 86

Courtesy of Nanyun (Violet) Peng

Levels of Language

In English, why is “hangry” a word and “digging a foundation” is three words?

15 of 86

Levels of Language

16 of 86

Papa learns computer programing with an AI assistant.

Each part has a semantic role within the larger sentence.

Papa is an Agent, an entity that carries out an action (e.g. learning).

Predicate/

Action

Tool used to perform action

Programming is a Patient, an entity that undergoes a state of change due to the action, or is possessed/acquired/exchanged.

17 of 86

Let’s think more formally about meaning and relatedness

18 of 86

Terminology: lemma and wordform

A lemma or citation form

Representation of all forms with the same stem, part of speech, rough semantics

A wordform

The inflected word as it appears in text

19 of 86

Lemmas have senses

One lemma “bank” can have many meanings:

Sense 1:

…a bank can hold the investments in a custodial account…

Sense 2:

“…as agriculture burgeons on the east bank the river will shrink even more”

Sense (or word sense)

A discrete representation

of an aspect of a word’s meaning.

20 of 86

Hyponymy and Hypernymy

One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other

car is a hyponym of vehicle

mango is a hyponym of fruit

Conversely hypernym/superordinate (“hyper is super”)

vehicle is a hypernym of car

fruit is a hypernym of mango

21 of 86

Meronymy

The part-whole relation

A leg is part of a chair; a wheel is part of a car.

Wheel is a meronym of car, and car is a holonym of wheel.

22 of 86

Homonymy:

multi-sense as an artifact

Homonyms: words that share a form (spelling or pronunciation) but have unrelated, distinct meanings:

bank₁: financial institution, bank₂: sloping land

bat₁: club for hitting a ball, bat₂: nocturnal flying mammal

23 of 86

Homonymy:

multi-sense as an artifact

A related multilingual concept is “false friends,” which have identical or similar forms in 2 languages but have different meanings across languages.

Think “pain” in French vs. “pain” in English.

24 of 86

Homonymy causes problems for NLP applications

Information retrieval

“bat care”

Machine Translation

bat: murciélago (animal) or bate (for baseball)

Text-to-Speech

bass (stringed instrument) vs. bass (fish)

25 of 86

Polysemy: related multi-sense

1. The bank was constructed in 1875 out of local red brick.

2. I withdrew the money from the bank

Are those the same sense?

Sense 2: “A financial institution”

Sense 1: “The building belonging to a financial institution”

A polysemous word has related meanings. Most non-rare words have multiple meanings.

26 of 86

Synonyms

Words (different forms) that have the same meaning in some or all contexts.

couch / sofa
big / large
automobile / car
vomit / throw up
Water / H₂0

27 of 86

Antonyms

Senses that are opposites with respect to one feature of meaning

Otherwise, they are very similar!

dark/light short/long fast/slow rise/fall

hot/cold up/down in/out

28 of 86

Levels of Language

29 of 86

You should get a hammer and hit it?

Do you have the ability to crack the window?

Can you open the window right now?

Can you crack the window a little bit?

30 of 86

When this is important in the real world

31 of 86

5 years ago…

Content moderation traditionally uses blocklists.

What could go wrong with this?

32 of 86

Offensive Content Warning

33 of 86

73%

8%

Breaking Content Filtering Systems

OpenAI content filter

HateBERT

Incorrectly given 73% probability of toxicity

Incorrectly given 8% probability of toxicity

Child abuse is wrong, racism is wrong, sexism is wrong

If you have ever been to a restaurant, you have probably noticed that the service is much better if the waiter is white…

34 of 86

34

Toxicity detectors use loaded words and group-mentions as signals

35 of 86

35

Toxicity detectors use loaded words and group-mentions as signals

36 of 86

36

African-American English tweets are considerably more likely to be misclassified as offensive compared to White-aligned tweets

Risk of Racial Bias in Hate Speech Detection

Systemic risk of silencing already marginalized populations

- Sap, Card and Gabriel et al. (2019)

37 of 86

37

Data collection

(benign and toxic examples)

Prompt GPT-3 to generate novel examples

sample

“I think he said his parents were Jewish”

“Nissenbaum is historically a Jewish name”

“Even as a Jew I think we can all have fun”

Prompt

Engineering

Examples with group mentions

Passover is an annual Jewish holiday\n
I think he said his parents were Jewish\n
Nissenbaum is historically a Jewish name\n
Even as a Jew I think we can all have fun\n
I’ve really come to embrace my Judaism\n

Jewish weddings can be a lot of fun\n

GPT-3 responds

Pass to GPT-3

Prompt

“Passover is an annual Jewish holiday”

Use classifier-in-the-loop decoding for adversarial behavior

Classifier-in-the-loop

“ALICE” method with a pretrained language model (PLM) like GPT-3

PLM

Classifier

Decoding

search algorithm

Candidate (sub)sequences

Rerank and continue

Toxicity/Hate Speech classifier like BERT

Pass to GPT-3

Toxigen (Hartvigsen and Gabriel et al., 2022)

38 of 86

38

39 of 86

39

Detecting Toxicity is Highly Subjective

40 of 86

40

41 of 86

41

42 of 86

42

43 of 86

43

44 of 86

Modeling Communication

45 of 86

46 of 86

Language exists within certain contexts…

Linguistic, e.g. interpreting the meaning of “Do you want to go there?” based on the previous conversational history “Hey, the park reopened.”

Extra-linguistic (social, spatial or temporal factors),

e.g. is it offensive to say something?

47 of 86

Implicature is what lies beneath explicit statements

Ann

Bill

Hirschberg (1985)

Do you sell paste?

I sell rubber cement

Implication:

Bill doesn’t sell paste,

or he would have said so

48 of 86

Ann

Do you sell paste?

We can describe this as an instance of a “speech act”

A speech act is an utterance that not only conveys information but also performs an action (Austin, 1975).

49 of 86

Rational Speech Act (RSA)

and state of the world (s) based on S₁’s utterance

The speaker chooses their utterance (u) to maximize probability of a literal speaker (Lit) interpreting object w

Pragmatic Speaker (S)

The pragmatic listener interprets S’s utterance and uses Bayes’ rule identify w

Pragmatic Listener (L)

50 of 86

Rational Speech Act (RSA)

Pragmatic Listener (L₁)

and state of the world (s) based on S₁’s utterance

Pragmatic Speaker (S₁)

object

prior

Meaning function

returning 1 if u is indicative of w

51 of 86

Rational Speech Act (RSA)

Pragmatic Listener (L₁)

and state of the world (s) based on S₁’s utterance

Pragmatic Speaker (S₁)

Hyperparameter controlling speaker’s rationality

52 of 86

Rational Speech Act (RSA)

Pragmatic Listener (L₁)

and state of the world (s) based on S₁’s utterance

Pragmatic Speaker (S₁)

53 of 86

How do we represent words and concepts?

54 of 86

How do we represent a word?

How do we “understand” a word?

How can we know the relation/distance/similarity between words computationally?

55 of 86

Representing words as discrete symbols

Naïve way: represent words as atomic symbols: student, talk, university (BoW)

Represent word as a “one-hot” vector �[ 0 0 0 1 0 … 0 ]

egg student talk university happy buy

How large is (what’s the dimension of) this vector?

Vector dimension = number of words in vocabulary

PTB data: ~50k,
Google 1T data: 13M

`

56 of 86

Issues?

Dimensionality is large; vector is sparse

No similarity

V_happy = [0 0 0 1 0 ... 0 ]

V_sad = [0 0 1 0 0 ... 0 ]

V_milk = [1 0 0 0 0 ... 0 ]

_V_happy_{• V}_sad_{= V}_happy_{• V}_milk_{= 0}

Cannot represent new words

`

57 of 86

Word meanings that can help decide:

Word Similarity

Distributional (Vector) Models of Meaning

Word Relations

Word Sense Disambiguation

Semantic Roles

Distributional Hypothesis (Harris, 1954):

A word is characterized by the company it keeps. In other terms, words that occur in the same contexts tend to have similar meanings.

58 of 86

Theoretical foundation of distributional semantics

Intuitions: Zellig Harris (1954):

“oculist and eye-doctor … occur in almost the same environments”

“If A and B have almost identical environments we say that they are synonyms.”

59 of 86

Word-context matrix for word similarity

Two words are similar in meaning if their context vectors are similar

Note: Very sparse! (~ 50,000 x 50,000)

60 of 86

Word-word matrix

61 of 86

Problem with raw counts

62 of 86

Two classes of vector representation

Sparse vector representations

Mutual information weighted co-occurence matrices

63 of 86

Pointwise Mutual Information

64 of 86

Pointwise Mutual Information

65 of 86

Now how do we compute similarity?

66 of 86

We can use cosine similarity to quantify how close 2 word vectors are to each other

Courtesy of pyimagesearch

67 of 86

Measuring similarity: cosine

68 of 86

Cosine for computing similarity

69 of 86

Two classes of vector representation

Sparse vector representations

Dense vector representations

In the past, skip-grams or CBOW

More likely these days pretrained transformers like BERT

Mutual information weighted co-occurence matrices

Mutual Information (X ; Y): accounts for chance word co occurrences by capturing how X decreases our uncertainty about Y

70 of 86

What’s ahead

71 of 86

Next Time…

There are now many types of embeddings:

https://huggingface.co/blog/mteb

72 of 86

Overview of the Class

Representation Learning, Semantics, Pragmatics

Scaling Language Modeling Pipelines

Evaluation

Ethics and Applications

Basics of Language Modeling

73 of 86

Goals of the Course

Deeply delve into modern NLP problems & solutions:
Models, algorithms, and tools that are out there to solve language-related problems you want to tackle
How to design and evaluate your own task/model/algorithm/tool

Using data and statistics
Using your own creativity
Using the latest advances

At the end you should:

Agree that language is subtle & interesting
Feel some ownership over the models/algorithms/tools
Understand research papers in the field

Spoiler:

Not AGI

74 of 86

Main Course Website

https://saadiagabriel.com/cs263

75 of 86

Weeks 1 and 2:

Sophie Klitgaard (UCLA)

76 of 86

In-class Paper

Presentation Guidelines

Presenting in groups of up to 3

(sign up using the spreadsheet linked on the website and Bruin Learn)

Each presentation is ~7 minutes with 3 minutes for Q&A

Select a peer-reviewed paper

(you can take a look at ACL Anthology or other conference proceedings)

77 of 86

Weeks 3 and 4:

78 of 86

Weeks 5-8:

John Thickstun (Cornell)

Alisa Liu (UW)

79 of 86

Weeks 9 and 10

4 days of in-person final project presentations

Students will individually provide feedback for other project groups

80 of 86

If you are registered for the course and you are not added by tomorrow night, email me.

Piazza

For general course and homework discussion.

81 of 86

Bruin Learn

The website will be up by next Monday.

This is primarily for submitting homework assignments and viewing grades.

82 of 86

Grading

83 of 86

Cheating

You MAY

talk with other students, friends, or others about your homework assignments IF you acknowledge such discussion in your submission
ask questions about the homework and subject material in the forums

You MAY NOT

copy code or answers from any source including friends, homework/test services, NLP or other software libraries. This includes making slight changes to previously written code
Find solutions to these problems online
Share solutions to these problems online
hack the scoring servers
allow your code to be copied, even if unintentionally
attempt to communicate with or read from any other person or device while taking exams

84 of 86

Cheating

Unfortunately, about 5% of students will be caught cheating, based on previous experience.

Suspected cheats (including those who were plagiarized from) will be reported to the University. Punishment includes but is not limited to:

Zero on assignment, exam, or class
Loss of career services privileges
Loss of CPT rights

85 of 86

Prerequisites

This is an advanced graduate-level course, so it is expected you have prior knowledge of machine learning. While we review basic NLP concepts, this is faster-paced than an undergraduate NLP course.

I expect you to program (in Python) at the level of a CS first-year PhD or better.

I expect you to know or be comfortable learning PyTorch, which is a deep learning framework based on Python.

There will be basic probability, statistics, linear algebra, and machine learning (CM146, CS260 or equivalent), which will be reviewed as needed.

86 of 86

Questions?