1 of 86

CS 263:

Advanced NLP

Saadia Gabriel

Lecture 1

2 of 86

NLP has a growing and diverse research community

Check out our Friday research talks

From 2-3pm in Eng 6 289

Association for Computational Linguistics

(Founded in the 1960s)

3 of 86

PTE Requests

I will start addressing these after the first class and will give an update on Wednesday.

4 of 86

What is Natural Language Processing?

Why are we all in this room?

5 of 86

6 of 86

Most people use NLP in their daily lives…

Launched 2006

7 of 86

Most people use NLP in their daily lives…

credit: ifunny.com

Siri: Launched 2011

Alexa: Launched 2014

8 of 86

Most people use NLP in their daily lives…

Claude,

Copilot: Launched 2023

Gemini: Launched 2024

9 of 86

NLP

NLP ≠ LLM Science

LLMs

But we will focus heavily on modern NLP and large-scale language modeling

10 of 86

10 or 11 years ago…

Zipf’s Law:

The frequency of a word is inversely proportional to its rank, for example the most frequent word appears twice as often as the second most frequent.

You can visualize Zipf’s Law as a power-law distribution dominated by common words.

11 of 86

10 or 11 years ago…

You might have been parsing corpora with regular expressions (Regex)

Find sentences beginning with a pronoun:

Search phrase starts at beginning of sentence

Words to match

OR

symbol

Word boundary

Word group

12 of 86

NLP bridges

Computer Science

Computational

Linguistics

Modern NLP focuses heavily on AI

13 of 86

Levels of Language

14 of 86

Courtesy of Nanyun (Violet) Peng

Levels of Language

In English, why is “hangry” a word and “digging a foundation” is three words?

15 of 86

Levels of Language

16 of 86

Papa learns computer programing with an AI assistant.

Each part has a semantic role within the larger sentence.

Papa is an Agent, an entity that carries out an action (e.g. learning).

Predicate/

Action

Tool used to perform action

Programming is a Patient, an entity that undergoes a state of change due to the action, or is possessed/acquired/exchanged.

17 of 86

Let’s think more formally about meaning and relatedness

18 of 86

Terminology: lemma and wordform

  • A lemma or citation form
    • Representation of all forms with the same stem, part of speech, rough semantics
  • A wordform
    • The inflected word as it appears in text

19 of 86

Lemmas have senses

  • One lemma “bank” can have many meanings:

Sense 1:

…a bank can hold the investments in a custodial account…

Sense 2:

“…as agriculture burgeons on the east bank the river will shrink even more

  • Sense (or word sense)
    • A discrete representation 

                  of an aspect of a word’s meaning.

20 of 86

Hyponymy and Hypernymy

One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other

    • car is a hyponym of vehicle

mango is a hyponym of fruit

Conversely hypernym/superordinate (“hyper is super”)

vehicle is a hypernym  of car

fruit is a hypernym of mango

21 of 86

Meronymy

  • The part-whole relation

A leg is part of a chair; a wheel is part of a car.

  • Wheel is a meronym of car, and car is a holonym of wheel

22 of 86

Homonymy:

multi-sense as an artifact

Homonyms: words that share a form (spelling or pronunciation) but have unrelated, distinct meanings:

  • bank1: financial institution,    bank2:  sloping land

bat1: club for hitting a ball,    bat2nocturnal flying mammal

23 of 86

Homonymy:

multi-sense as an artifact

A related multilingual concept is “false friends,” which have identical or similar forms in 2 languages but have different meanings across languages.

Think “pain” in French vs. “pain” in English.

24 of 86

Homonymy causes problems for NLP applications

  • Information retrieval

bat care”

  • Machine Translation

batmurciélago  (animal) or  bate (for baseball)

  • Text-to-Speech

bass (stringed instrument) vs. bass (fish)

25 of 86

Polysemy: related multi-sense

  • 1. The bank was constructed in 1875 out of local red brick.

  • 2. I withdrew the money from the bank 

  • Are those the same sense?

Sense 2: “A financial institution”

Sense 1: “The building belonging to a financial institution”

  • A polysemous word has related meanings. Most non-rare words have multiple meanings.

26 of 86

Synonyms

  • Words (different forms) that have the same meaning in some or all contexts.

    • couch / sofa
    • big / large
    • automobile / car
    • vomit / throw up
    • Water / H20

27 of 86

Antonyms

  • Senses that are opposites with respect to one feature of meaning
  • Otherwise, they are very similar!

dark/light   short/long fast/slow rise/fall

hot/cold     up/down       in/out

28 of 86

Levels of Language

29 of 86

You should get a hammer and hit it?

Do you have the ability to crack the window?

Can you open the window right now?

Can you crack the window a little bit?

30 of 86

When this is important in the real world

31 of 86

5 years ago…

Content moderation traditionally uses blocklists.

What could go wrong with this?

32 of 86

Offensive Content Warning

33 of 86

73%

8%

Breaking Content Filtering Systems

OpenAI content filter

HateBERT

Incorrectly given 73% probability of toxicity

Incorrectly given 8% probability of toxicity

Child abuse is wrong, racism is wrong, sexism is wrong

If you have ever been to a restaurant, you have probably noticed that the service is much better if the waiter is white…

34 of 86

34

Toxicity detectors use loaded words and group-mentions as signals

35 of 86

35

Toxicity detectors use loaded words and group-mentions as signals

36 of 86

36

African-American English tweets are considerably more likely to be misclassified as offensive compared to White-aligned tweets

Risk of Racial Bias in Hate Speech Detection

Systemic risk of silencing already marginalized populations

- Sap, Card and Gabriel et al. (2019)

37 of 86

37

Data collection

(benign and toxic examples)

Prompt GPT-3 to generate novel examples

sample

“I think he said his parents were Jewish

“Nissenbaum is historically a Jewish name”

“Even as a Jew I think we can all have fun”

Prompt

Engineering

Examples with group mentions

  • Passover is an annual Jewish holiday\n
  • I think he said his parents were Jewish\n
  • Nissenbaum is historically a Jewish name\n
  • Even as a Jew I think we can all have fun\n
  • I’ve really come to embrace my Judaism\n

  • Jewish weddings can be a lot of fun\n

GPT-3 responds

Pass to GPT-3

Prompt

“Passover is an annual Jewish holiday”

“Passover is an annual Jewish holiday”

Use classifier-in-the-loop decoding for adversarial behavior

Classifier-in-the-loop

“ALICE” method with a pretrained language model (PLM) like GPT-3

PLM

Classifier

Decoding

search algorithm

Candidate (sub)sequences

Rerank and continue

Toxicity/Hate Speech classifier like BERT

Pass to GPT-3

Toxigen (Hartvigsen and Gabriel et al., 2022)

38 of 86

38

39 of 86

39

Detecting Toxicity is Highly Subjective

40 of 86

40

41 of 86

41

42 of 86

42

43 of 86

43

44 of 86

Modeling Communication

45 of 86

46 of 86

Language exists within certain contexts…

Linguistic, e.g. interpreting the meaning of “Do you want to go there?” based on the previous conversational history “Hey, the park reopened.”

Extra-linguistic (social, spatial or temporal factors),

e.g. is it offensive to say something?

47 of 86

Implicature is what lies beneath explicit statements

Ann

Bill

Hirschberg (1985)

Do you sell paste?

I sell rubber cement

Implication:

Bill doesn’t sell paste,

or he would have said so

48 of 86

Ann

Do you sell paste?

We can describe this as an instance of a “speech act”

A speech act is an utterance that not only conveys information but also performs an action (Austin, 1975).

49 of 86

Rational Speech Act (RSA)

and state of the world (s) based on S1’s utterance

The speaker chooses their utterance (u) to maximize probability of a literal speaker (Lit) interpreting object w

Pragmatic Speaker (S)

The pragmatic listener interprets S’s utterance and uses Bayes’ rule identify w

Pragmatic Listener (L)

50 of 86

Rational Speech Act (RSA)

Pragmatic Listener (L1)

and state of the world (s) based on S1’s utterance

Pragmatic Speaker (S1)

object

prior

Meaning function

returning 1 if u is indicative of w

51 of 86

Rational Speech Act (RSA)

Pragmatic Listener (L1)

and state of the world (s) based on S1’s utterance

Pragmatic Speaker (S1)

Hyperparameter controlling speaker’s rationality

52 of 86

Rational Speech Act (RSA)

Pragmatic Listener (L1)

and state of the world (s) based on S1’s utterance

Pragmatic Speaker (S1)

53 of 86

How do we represent words and concepts?

54 of 86

How do we represent a word?

  • How do we “understand” a word?

  • How can we know the relation/distance/similarity between words computationally?

55 of 86

Representing words as discrete symbols

  • Naïve way: represent words as atomic symbols: student, talk, university (BoW)

  • Represent word as a “one-hot” vector �[ 0      0     0        1         0   …   0 ]

egg   student   talk    university     happy      buy     

  • How large is (what’s the dimension of) this vector?
    • Vector dimension = number of words in vocabulary 
      • PTB data: ~50k
      • Google 1T data: 13M

`

56 of 86

Issues?

  • Dimensionality is large; vector is sparse

  • No similarity

   Vhappy  =  [0  0  0  1  0 ... 0 ]

   Vsad      = [0  0  1  0  0 ... 0 ]

   Vmilk     = [1  0  0  0  0 ... 0 ]

   Vhappy • Vsad  = Vhappy • Vmilk = 0

  • Cannot represent new words

`

57 of 86

Word meanings that can help decide:

  • Word Similarity 
    • Distributional (Vector) Models of Meaning

  • Word Relations

  • Word Sense Disambiguation

  • Semantic Roles

Distributional Hypothesis (Harris, 1954):

A word is characterized by the company it keeps. In other terms, words that occur in the same contexts tend to have similar meanings.

58 of 86

Theoretical foundation of distributional semantics

Intuitions:  Zellig Harris (1954):

  • “oculist and eye-doctor … occur in almost the same environments”

  • “If A and B have almost identical environments we say that they are synonyms.”

59 of 86

Word-context matrix for word similarity

  • Two words are similar in meaning if their context vectors are similar

Note: Very sparse! (~ 50,000 x 50,000)

60 of 86

Word-word matrix

61 of 86

Problem with raw counts

62 of 86

Two classes of vector representation

  • Sparse vector representations
  • Mutual information weighted co-occurence matrices

63 of 86

Pointwise Mutual Information

64 of 86

Pointwise Mutual Information

65 of 86

Now how do we compute similarity?

66 of 86

We can use cosine similarity to quantify how close 2 word vectors are to each other

Courtesy of pyimagesearch

67 of 86

Measuring similarity: cosine

68 of 86

Cosine for computing similarity

69 of 86

Two classes of vector representation

  • Sparse vector representations
  • Dense vector representations
  • In the past, skip-grams or CBOW
  • More likely these days pretrained transformers like BERT
  • Mutual information weighted co-occurence matrices

Mutual Information (X ; Y): accounts for chance word co occurrences by capturing how X decreases our uncertainty about Y

70 of 86

What’s ahead

71 of 86

Next Time…

There are now many types of embeddings:

https://huggingface.co/blog/mteb

72 of 86

Overview of the Class

Representation Learning, Semantics, Pragmatics

Scaling Language Modeling Pipelines

Evaluation

Ethics and Applications

Basics of Language Modeling

73 of 86

Goals of the Course

  • Deeply delve into modern NLP problems & solutions: 
  • Models, algorithms, and tools that are out there to solve language-related problems you want to tackle
  • How to design and evaluate your own task/model/algorithm/tool
    • Using data and statistics
    • Using your own creativity
    • Using the latest advances

  • At the end you should:
    • Agree that language is subtle & interesting 
    • Feel some ownership over the models/algorithms/tools
    • Understand research papers in the field

Spoiler:

Not AGI

74 of 86

Main Course Website

75 of 86

Weeks 1 and 2:

Sophie Klitgaard (UCLA)

76 of 86

In-class Paper

Presentation Guidelines

Presenting in groups of up to 3

(sign up using the spreadsheet linked on the website and Bruin Learn)

Each presentation is ~7 minutes with 3 minutes for Q&A

Select a peer-reviewed paper

(you can take a look at ACL Anthology or other conference proceedings)

77 of 86

Weeks 3 and 4:

78 of 86

Weeks 5-8:

John Thickstun (Cornell)

Alisa Liu (UW)

79 of 86

Weeks 9 and 10

4 days of in-person final project presentations

Students will individually provide feedback for other project groups

80 of 86

If you are registered for the course and you are not added by tomorrow night, email me.

Piazza

For general course and homework discussion.

81 of 86

Bruin Learn

The website will be up by next Monday.

This is primarily for submitting homework assignments and viewing grades.

82 of 86

Grading

83 of 86

Cheating

  • You MAY
    • talk with other students, friends, or others about your homework assignments IF you acknowledge such discussion in your submission
    • ask questions about the homework and subject material in the forums

  • You MAY NOT
    • copy code or answers from any source including friends, homework/test services, NLP or other software libraries. This includes making slight changes to previously written code
    • Find solutions to these problems online
    • Share solutions to these problems online
    • hack the scoring servers
    • allow your code to be copied, even if unintentionally
    • attempt to communicate with or read from any other person or device while taking exams

84 of 86

Cheating

  • Unfortunately, about 5% of students will be caught cheating, based on previous experience.

  • Suspected cheats (including those who were plagiarized from) will be reported to the University. Punishment includes but is not limited to:
    • Zero on assignment, exam, or class
    • Loss of career services privileges
    • Loss of CPT rights

85 of 86

Prerequisites

  • This is an advanced graduate-level course, so it is expected you have prior knowledge of machine learning. While we review basic NLP concepts, this is faster-paced than an undergraduate NLP course.

  • I expect you to program (in Python) at the level of a CS first-year PhD or better.

  • I expect you to know or be comfortable learning PyTorch, which is a deep learning framework based on Python.

There will be basic probability, statistics, linear algebra, and machine learning (CM146, CS260 or equivalent), which will be reviewed as needed.

86 of 86

Questions?