1 of 101

Dialogue systems

and Transformers

Williams College

Spring 2023

CSCI 375: Natural Language Processing

2 of 101

Announcements

Midterm returned on Wednesday after lecture
HW 5 released

Long and potentially challenging

Late days: last opportunity with HW5

Syllabus corrections: No HW7, only HW6 (last homework)

HW 5 is the last assignment as individuals

Tips:

Start early (i.e. today)
Review the lecture materials
Work with friends on the pseudocode before you begin typing.

3 of 101

Groups for final project (and HW 6)

Let me know any potential issues by this Wednesday.

4 of 101

Final Projects

All info on the project tab of the class website

5 of 101

First half of the semester:

Breadth
Individual work
Learn and practice foundational NLP concepts

Rule-based systems (regular expressions)
Naive Bayes
Logistic Regression
Word embeddings
Deep learning
One popular application: Dialogue systems

Second half of the semester:

Depth
Group work
Deep and sustained creative problem-solving to build complex projects around one topic

6 of 101

Final project learning objectives

Apply your foundational knowledge in NLP to a new area that you are excited and curious about!
Wade through complexity
Write complex code
Scratch the surface of what “creative problem-solving” is for a technical field like NLP

Bonus: NLP is changing so fast. Ensure we’re spending the second-half of the class on relevant topics.

7 of 101

Final project outcomes

Outcome 1: Create an NLP project (code + data + report) that can become part of your professional portfolio
Outcome 2: Become a “class expert” on a particular topic
Outcome 3: Produce the a less-extensive version of a “workshop paper” in NLP

Smaller in scope than a full-fledge research paper
Negative results ok

8 of 101

Goal of “research”: Try to find a place no one has been before.

Expansion of human knowledge.

9 of 101

Fill out by Sunday

Form link

Form 2 link

10 of 101

“Topic” scope

Equivalent to a textbook chapter or a published review/survey paper

Example 1: Machine Translation (J&M Ch. 13)
Example 2: Social Bias in NLP systems (survey paper by Blodgett et al. 2020)

11 of 101

Ideas are not created in a vacuum…

12 of 101

Places to start if you’re looking for an NLP topic

Other chapters in J&M
Other chapters in other NLP textbooks

Jacob Eisenstein’s textbook

ACL, EMNLP and NAACL are the main academic conferences in NLP

Look at their keynote speakers
Look at their list of workshops

Workshop = smaller community within NLP working on the same (narrower) topic

13 of 101

Browse potential topics from ACL workshops

Linguistic, e.g., semantic evaluation

Applications, e.g.,NLP for Educational Applications

Machine learning, e.g., representation learning for NLP

14 of 101

“Flash talks” of 4 ideas

15 of 101

AfriSenti-SemEval Shared Task 12

Website and data
According to UNESCO (2003), 30% of all living languages, around 2,058, are African languages. However, most of these languages do not have curated datasets for developing such AI applications.
Task: Sentiment Classification in one of the following languages:

Track 1: Hausa
Track 2: Yoruba
Track 3: Igbo
Track 4: Nigerian_Pidgin
Track 5: Amharic
Track 6: Algerian Arabic
Track 7: Moroccan Arabic/Darija,
Track 8: Swahili
Track 9: Kinyarwanda
Track 10: Twi
Track 11: Mozambican Portuguese
Track 12: Xitsonga (Mozambique Dialect)

Linguistic focus

16 of 101

Many open NLP problems for languages that are not English!

Opportunity to use your liberal arts education!

17 of 101

Shared tasks

Solicits new methods for a clearly specified goal and evaluation
Task defined and data provided (usually a hard part of research)
Organizers typically hold out the test set
Central to “progress” in NLP

18 of 101

NLP model distillation and “on-device” models

CS 11-767: On Device Machine Learning taught at CMU
Broader Goal: build, train, and deploy models that can run on low-power devices (e.g. smartphones, refrigerators, and mobile robots)

Machine learning focus

19 of 101

Scientific Information Extraction

Website and data
Task: Given several scientific sentences, extract the spans of tokens that denote the population, intervention and outcome
Important for meta-analysis of science

20 of 101

NLP for Computational Social Science

Paper and data
Data: They collected 38M+ posts from Russian media outlets on Twitter and VKontakte (Russian social media) immediately preceding and during 2022 Russia-Ukraine war
Many open questions with this dataset
One of the author’s main findings:

Computational measure of framing = how those topics are discussed can influence the way audiences understand them

21 of 101

Logistics

22 of 101

“Tutorial” style small group meetings

Please let me know by Friday if you have conflicts with May 11.

23 of 101

Grading

Final report graded on a rubric

All deliverables except for final report graded all-or-nothing on completion.

You’ll receive qualitative feedback to improve your final report.

24 of 101

Project description on website

25 of 101

Dialogue systems

26 of 101

J&M Textbook

This week

Other application chapters could potentially make great project topics!

27 of 101

Dialogue systems

Today: “historical” approaches
Rest of the week: towards “modern” approaches

Back to Deep Learning!

28 of 101

Definitions

“Dialogue systems” also known as:

Conversational agents, Dialogue agents, Chatbots

Types:

Task-oriented dialogue systems: help user complete a task
Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
Hybrids

Approaches:

Rule-based
Frame-based systems: statistical but modular
“End-to-end” (deep learning)

29 of 101

What has driven recent interest in dialogue systems?

Commercial!

Amazon’s Alexa

Apple’s Siri

Google Assistant

30 of 101

Historical motivations: Operationalizing “intelligence”

Alan Turing (1912-1954)

Turing Test (1950)

Human judge able to distinguish between machine and human based on their responses?

31 of 101

Definitions

“Dialogue systems” also known as:

Conversational agents, Dialogue agents, Chatbots

Types:

Task-oriented dialogue systems: help user complete a task
Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
Hybrids

Approaches:

Rule-based
Frame-based systems: statistical but modular
“End-to-end” (deep learning)

32 of 101

Review: Joeseph Weizenbaum and ELIZA (1966)

1923-2008

Professor at MIT

ELIZA: Rule-based program that resembled a Rogerian psychologist (a type of counseling in which the therapist is nondirective but supportive)

33 of 101

PARRY (1971)

Simulated a schizophrenic patient to help train psychiatrists
ELIZA-like regular expressions + rule-based processing of internal system affect (emotions)

First known system to pass the Turing test (in 1972!)

34 of 101

Definitions

“Dialogue systems” also known as:

Conversational agents, Dialogue agents, Chatbots

Types:

Task-oriented dialogue systems: help user complete a task
Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
Hybrids

Approaches:

Rule-based
Frame-based systems: statistical but modular
“End-to-end” (deep learning)

You’ll implement this approach in Homework 6!

35 of 101

Frame-based dialogue systems (J&M Section 15.3)

Domain ontology

Frames > Slots > Values

Example frame: travel

Other frames: car or hotel reservations, calling another person

36 of 101

Old system (1977!) but frame-based approach still used in modern applications

37 of 101

paper link

2017 “Alexa Prize”: 2.5 Million dollar challenge

Task: Build a conversational agent to converse coherently with humans on popular topics (e.g. sports or politics) for 20 minutes

Tested with real Alexa users

2017! Rapid change in NLP state-of-the art ~6 years

38 of 101

UW’s “Sounding Board” team won

Used frame-based approach

(what they call a “hierarchical dialogue manager”)

“Sounding board” paper

39 of 101

Frame-based dialogue systems

Other example modules:

Intent detection/classification
Domain detection/classification
Error handling: high versus low confidence cases

Think-pair-share 1

Brainstorm some advantages and disadvantages of modular dialogue systems like these

40 of 101

Weds – Transformers

41 of 101

Announcements

Gentle reminder: HW5 due Friday, 3:59pm ET
Please fill out the two project forms by Sunday
Office hours this week

Katie: Wednesday, 1-3pm, TPL 207
Thomas: Wednesday, 7-9pm, TPL 207
Katie: Thursday, 1-3pm, TPL 207
Rachel: Thursday, 7:45-9:45pm, TPL 207

Midterm return shifted to Friday

42 of 101

Review: Deep Learning for NLP – modern tech stack

Python and numpy

Pytorch

transformers /

huggingface

You implement gradients and computation graph

Modular pieces of a deep learning architecture with gradients automatically calculated

NLP models pre-trained on existing collections of text documents

43 of 101

Review: Dialogue systems

“Dialogue systems” also known as:

Conversational agents, Dialogue agents, Chatbots

Types:

Task-oriented dialogue systems: help user complete a task
Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
Hybrids

Approaches:

Rule-based
Frame-based systems: statistical but modular
“End-to-end” (deep learning)

Advantages: Interpretability, Ease of Evaluation

Disadvantages: Low generalization, “top down” design

44 of 101

Review: Dialogue systems

“Dialogue systems” also known as:

Conversational agents, Dialogue agents, Chatbots

Types:

Task-oriented dialogue systems: help user complete a task
Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
Hybrids

Approaches:

Rule-based
Frame-based systems: statistical but modular
“End-to-end” (deep learning)

This lecture!

45 of 101

End-to-end

Definition: training regimen that uses the loss from a downstream application to adjust the weights all the way through the network
(not “modular”)
Most common applications:

Sequence-to-sequence
Encoder-decoder

Real-valued vector

46 of 101

Two running applications in this lecture

Machine translation

Dialogue systems (aka, question-answering, prompt-response)

47 of 101

Stacked encoders and decoders

Figure credit: Jay Alamar

48 of 101

Number of decoder blocks distinguishes models

Figure credit: Jay Alamar

49 of 101

Review: Fixed-window language model

e.g. fixed window = 4

Feed-forward deep learning model

50 of 101

Thought experiment

Want hidden layer output to be the same size as the input so we can stack these in our encoder-decoder architecture.

Input layer

(“unravelled” into 1-d vector)

Size: Num. tokens x embedding dimension

Single hidden layer

Weights applied linearly to input + non-linear activation

Think-pair-share 1

In the hidden layer whose output is the same size as the input, what is the size of the weights (number of model parameters that will be trained by SGD)?

Calculate for GPT model (variant):

Input tokens = 512
Embedding dimension = 768

Feed-forward deep learning model

51 of 101

Transformers

No, not these kinds of transformers :)

General pretrained transformer (GPT) type of transformer

52 of 101

Paper link

Citation count is a proxy for what the research community deems important

53 of 101

Transformers

Proposes new pieces of the neural architectures

Self-attention
Positional encodings

Motivations:

Fewer parameters than feed-forward networks
Easily parallelizable (computationally efficient)
Can model long-range dependencies

54 of 101

Self-attention intuition

In this example, how do you know what its refers to?

“A robot must protect its own existence”

55 of 101

Attention intuition: Machine Translation alignment

Bahdanau et al. 2015

Attention high-level goal: Compare a token of interest to other tokens to reveal what is relevant in context.

More white = greater attention score

56 of 101

J&M Fig. 10.1

Attends to each previous element (but not future elements).

Each time step independent of other time steps so can be performed in parallel

57 of 101

Transformers were a response to computational inefficiencies of recurrent neural networks (RNNs)

Figure credit: Chris Olah

Represent prior context with recurrent connections

Won’t go in-depth

Can’t parallelize training. Inefficient

58 of 101

Self-attention board work

59 of 101

60 of 101

Supposed we’ve already trained the model and are now doing inference for y3.

61 of 101

Self-attention: efficient, vectorized version

Q: Why are we setting these to negative infinity?

Why is this important?

J&M Fig 10.3

62 of 101

Important b/c different words can relate to each other in many ways simultaneously (e.g. syntactic, semantic, discourse relations)

J&M Fig 10.5

Each “head” has unique set of key, query and value weight matrices

63 of 101

Stacked many times over

Residual connection: Pass info from lower layer to higher layer without going through intermediate layer

64 of 101

Layer normalization

In statistics, “z-scores”

Weights learned by the model

65 of 101

Stacked encoders and decoders

Figure credit: Jay Alamar

66 of 101

Notes on inputs

67 of 101

Review from week 1: Tokenization for GPT-3, GPT-2… most likely ChatGPT

Demo: https://platform.openai.com/tokenizer

Radford et al. 2019

In practice, the vocabulary is not characters but bytes (recall 8 bits=1 byte).

BPE Algorithm Inputs

68 of 101

Input: Embedding look-up

Figure credit: Jay Alamar

69 of 101

Must truncate and pad

(like HW5)!

70 of 101

Input: Positional encodings

Radford et al. 2019

Tells the order of the sequence to the model.

Initialized randomly and then learned by the model

71 of 101

72 of 101

Autoregression

Figure credit: Jay Alamar

After each token is generated, that token is added to the sequence of inputs.

73 of 101

Transformers

Proposes new pieces of the neural architectures

Self-attention
Positional encodings

Motivations:

Fewer parameters than feed-forward networks
Easily parallelizable (computationally efficient)
Can model long-range dependencies

74 of 101

Friday

75 of 101

Katie: Record on Zoom

76 of 101

Announcements

Gentle reminder: two project forms due on Sunday
Next week: lecture time is flipped classroom (working in groups)

Some tips for mini-lectures
Check-ins and feedback with every group about project topics and ideas

77 of 101

Topics

Wrap-up dialogue systems and transformers (~15 minutes)
Synthesis & reflection activity (~15 minutes)
Midterm debrief (~20 minutes)

78 of 101

Scanned midterms and grades on gradescope

Released: Today 12:30pm (after lecture)

Some scans, the handwriting was too light. Come see Katie during office hours for a paper copy.

79 of 101

Review: Dialogue system paradigms

Rule-based

Frame-based

End-to-end

Examples: ELIZA (1966), PARRY (1971)

Examples: UW “Sounding Board” (Alexa Prize winner 2017)

Examples: ChatGPT (2022)

80 of 101

Review: Sequence-to-sequence

Machine translation

Dialogue systems (aka, question-answering, prompt-response)

Trained “end to end”

81 of 101

Review: Self-attention intuition

A robot must protect its own existence

82 of 101

Review: Self-attention

Each time step independent of other time steps so can be performed in parallel

83 of 101

Important b/c different words can relate to each other in many ways simultaneously (e.g. syntactic, semantic, discourse relations)

J&M Fig 10.5

Each “head” has unique set of key, query and value weight matrices

84 of 101

Example from Vaswani et al. 2017

Learned self-attention weights for Layer 5 (of 6)

Vaswani et al. 2017, Fig 4

Isolating “its”

85 of 101

Vaswani et al. 2017, Fig 5

Different “heads” learn different relations between tokens

86 of 101

Review: Stacked encoders and decoders

Figure credit: Jay Alamar

87 of 101

Varies by model

Figure credit: Jay Alamar

Only up to current item in sequence

Different models will put masked vs unmasked self-attention in decoders

88 of 101

Supervised Models

Training data, e.g. class labels (y), typically come from humans

Self-Supervised Models

Training data, e.g. class labels (y), comes from raw data (no humans!)

89 of 101

Variants on self-supervision for training

Masked language modeling

(bidirectional)

“Causal” language modeling

(unidirectional)

poor terminology choice by community

Figure credit: Prakhar Mishra

90 of 101

Training for unidirectional LMs

Teacher forcing = during training, at each time step in decoding we force the system to use the gold target token from the training data as the next input

Negative log probability the model assigns to the next word in the training sequence

91 of 101

Review: Lots of compute!

Floating point operations (e.g. +, -, *, /)

3.14 x 10^23 FLOPS

Brown et al. 2020, appendix

Factor of ~100 Billion

Apple M1 Pro 16-Core-GPU = 5.3 x 10^12 FLOPS

GPT-3 total train compute =

92 of 101

Synthesis & reflection activity

93 of 101

Midterm

94 of 101

Scanned midterms and grades on gradescope

Released: Today 12:30pm (after lecture)

Some scans, the handwriting was too light. Come see Katie during office hours for a paper copy.

95 of 101

Indicates to me this test was doable.

4.2: 70% mean (lowest)

96 of 101

Tests

Grade

Concept

understanding

Performing under

time pressure

Gentle reminder: Your grade is neither a reflection of your worth nor your ability.

Exists in jobs and/or grad school

Problem

solving

97 of 101

(Optional) Extra Credit Midterm Revisions

Extra credit can only help you, not hurt you
On Gradescope, submit a pdf with the following questions:

How did you prepare for this test? Knowing what you know now, how could you have improved your preparation?
For each individual question you missed,

First pass. What was your initial answer (during test time)?
Reflection. Why do think you got this question wrong during test time? Was it an issue with concept understanding, problem solving, and/or performing under time pressure? For concepts you didn’t understand, what was difficult about the concept?
Second pass. What is your revised answer?

Due: Friday, April 14 at 3:59pm ET

Encouraged to talk to classmates & Katie

98 of 101

Go over together

4.2
3.6
2.3

99 of 101

Midterm guide

100 of 101

Lecture, Slide 54