1 of 101

Dialogue systems

and Transformers

Williams College

Spring 2023

CSCI 375: Natural Language Processing

2 of 101

Announcements

  • Midterm returned on Wednesday after lecture
  • HW 5 released
    • Long and potentially challenging
  • Late days: last opportunity with HW5
    • Syllabus corrections: No HW7, only HW6 (last homework)

HW 5 is the last assignment as individuals

Tips:

  • Start early (i.e. today)
  • Review the lecture materials
  • Work with friends on the pseudocode before you begin typing.

3 of 101

Groups for final project (and HW 6)

Let me know any potential issues by this Wednesday.

4 of 101

Final Projects

All info on the project tab of the class website

5 of 101

First half of the semester:

  • Breadth
  • Individual work
  • Learn and practice foundational NLP concepts
    • Rule-based systems (regular expressions)
    • Naive Bayes
    • Logistic Regression
    • Word embeddings
    • Deep learning
    • One popular application: Dialogue systems

Second half of the semester:

  • Depth
  • Group work
  • Deep and sustained creative problem-solving to build complex projects around one topic

6 of 101

Final project learning objectives

  • Apply your foundational knowledge in NLP to a new area that you are excited and curious about!
  • Wade through complexity
  • Write complex code
  • Scratch the surface of what “creative problem-solving” is for a technical field like NLP

Bonus: NLP is changing so fast. Ensure we’re spending the second-half of the class on relevant topics.

7 of 101

Final project outcomes

  • Outcome 1: Create an NLP project (code + data + report) that can become part of your professional portfolio
  • Outcome 2: Become a “class expert” on a particular topic
  • Outcome 3: Produce the a less-extensive version of a “workshop paper” in NLP
    • Smaller in scope than a full-fledge research paper
    • Negative results ok

8 of 101

Goal of “research”: Try to find a place no one has been before.

Expansion of human knowledge.

9 of 101

Fill out by Sunday

10 of 101

“Topic” scope

  • Equivalent to a textbook chapter or a published review/survey paper
    • Example 1: Machine Translation (J&M Ch. 13)
    • Example 2: Social Bias in NLP systems (survey paper by Blodgett et al. 2020)

11 of 101

Ideas are not created in a vacuum…

12 of 101

Places to start if you’re looking for an NLP topic

  • Other chapters in J&M
  • Other chapters in other NLP textbooks
    • Jacob Eisenstein’s textbook
  • ACL, EMNLP and NAACL are the main academic conferences in NLP
    • Look at their keynote speakers
    • Look at their list of workshops
      • Workshop = smaller community within NLP working on the same (narrower) topic

13 of 101

Browse potential topics from ACL workshops

Linguistic, e.g., semantic evaluation

Applications, e.g.,NLP for Educational Applications

Machine learning, e.g., representation learning for NLP

14 of 101

“Flash talks” of 4 ideas

15 of 101

AfriSenti-SemEval Shared Task 12

  • Website and data
  • According to UNESCO (2003), 30% of all living languages, around 2,058, are African languages. However, most of these languages do not have curated datasets for developing such AI applications.
  • Task: Sentiment Classification in one of the following languages:
    • Track 1: Hausa
    • Track 2: Yoruba
    • Track 3: Igbo
    • Track 4: Nigerian_Pidgin
    • Track 5: Amharic
    • Track 6: Algerian Arabic
    • Track 7: Moroccan Arabic/Darija,
    • Track 8: Swahili
    • Track 9: Kinyarwanda
    • Track 10: Twi
    • Track 11: Mozambican Portuguese
    • Track 12: Xitsonga (Mozambique Dialect)

Linguistic focus

16 of 101

Many open NLP problems for languages that are not English!

Opportunity to use your liberal arts education!

17 of 101

Shared tasks

  • Solicits new methods for a clearly specified goal and evaluation
  • Task defined and data provided (usually a hard part of research)
  • Organizers typically hold out the test set
  • Central to “progress” in NLP

18 of 101

NLP model distillation and “on-device” models

  • CS 11-767: On Device Machine Learning taught at CMU
  • Broader Goal: build, train, and deploy models that can run on low-power devices (e.g. smartphones, refrigerators, and mobile robots)

Machine learning focus

19 of 101

Scientific Information Extraction

  • Website and data
  • Task: Given several scientific sentences, extract the spans of tokens that denote the population, intervention and outcome
  • Important for meta-analysis of science

20 of 101

NLP for Computational Social Science

  • Paper and data
  • Data: They collected 38M+ posts from Russian media outlets on Twitter and VKontakte (Russian social media) immediately preceding and during 2022 Russia-Ukraine war
  • Many open questions with this dataset
  • One of the author’s main findings:

Computational measure of framing = how those topics are discussed can influence the way audiences understand them

21 of 101

Logistics

22 of 101

“Tutorial” style small group meetings

Please let me know by Friday if you have conflicts with May 11.

23 of 101

Grading

Final report graded on a rubric

All deliverables except for final report graded all-or-nothing on completion.

You’ll receive qualitative feedback to improve your final report.

24 of 101

25 of 101

Dialogue systems

26 of 101

J&M Textbook

This week

Other application chapters could potentially make great project topics!

27 of 101

Dialogue systems

  • Today: “historical” approaches
  • Rest of the week: towards “modern” approaches

Back to Deep Learning!

28 of 101

Definitions

  • “Dialogue systems” also known as:
    • Conversational agents, Dialogue agents, Chatbots
  • Types:
    • Task-oriented dialogue systems: help user complete a task
    • Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
    • Hybrids
  • Approaches:
    • Rule-based
    • Frame-based systems: statistical but modular
    • “End-to-end” (deep learning)

29 of 101

What has driven recent interest in dialogue systems?

Commercial!

Amazon’s Alexa

Apple’s Siri

Google Assistant

30 of 101

Historical motivations: Operationalizing “intelligence”

Alan Turing (1912-1954)

Turing Test (1950)

  • Human judge able to distinguish between machine and human based on their responses?

31 of 101

Definitions

  • “Dialogue systems” also known as:
    • Conversational agents, Dialogue agents, Chatbots
  • Types:
    • Task-oriented dialogue systems: help user complete a task
    • Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
    • Hybrids
  • Approaches:
    • Rule-based
    • Frame-based systems: statistical but modular
    • “End-to-end” (deep learning)

32 of 101

Review: Joeseph Weizenbaum and ELIZA (1966)

1923-2008

Professor at MIT

ELIZA: Rule-based program that resembled a Rogerian psychologist (a type of counseling in which the therapist is nondirective but supportive)

33 of 101

PARRY (1971)

  • Simulated a schizophrenic patient to help train psychiatrists
  • ELIZA-like regular expressions + rule-based processing of internal system affect (emotions)

First known system to pass the Turing test (in 1972!)

34 of 101

Definitions

  • “Dialogue systems” also known as:
    • Conversational agents, Dialogue agents, Chatbots
  • Types:
    • Task-oriented dialogue systems: help user complete a task
    • Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
    • Hybrids
  • Approaches:
    • Rule-based
    • Frame-based systems: statistical but modular
    • “End-to-end” (deep learning)

You’ll implement this approach in Homework 6!

35 of 101

Frame-based dialogue systems (J&M Section 15.3)

  • Domain ontology
    • Frames > Slots > Values

Example frame: travel

Other frames: car or hotel reservations, calling another person

36 of 101

Old system (1977!) but frame-based approach still used in modern applications

37 of 101

2017 “Alexa Prize”: 2.5 Million dollar challenge

Task: Build a conversational agent to converse coherently with humans on popular topics (e.g. sports or politics) for 20 minutes

Tested with real Alexa users

2017! Rapid change in NLP state-of-the art ~6 years

38 of 101

UW’s “Sounding Board” team won

Used frame-based approach

(what they call a “hierarchical dialogue manager”)

39 of 101

Frame-based dialogue systems

Other example modules:

  • Intent detection/classification
  • Domain detection/classification
  • Error handling: high versus low confidence cases

Think-pair-share 1

Brainstorm some advantages and disadvantages of modular dialogue systems like these

40 of 101

Weds – Transformers

41 of 101

Announcements

  • Gentle reminder: HW5 due Friday, 3:59pm ET
  • Please fill out the two project forms by Sunday
  • Office hours this week
    • Katie: Wednesday, 1-3pm, TPL 207
    • Thomas: Wednesday, 7-9pm, TPL 207
    • Katie: Thursday, 1-3pm, TPL 207
    • Rachel: Thursday, 7:45-9:45pm, TPL 207
  • Midterm return shifted to Friday

42 of 101

Review: Deep Learning for NLP – modern tech stack

Python and numpy

Pytorch

transformers /

huggingface

You implement gradients and computation graph

Modular pieces of a deep learning architecture with gradients automatically calculated

NLP models pre-trained on existing collections of text documents

43 of 101

Review: Dialogue systems

  • “Dialogue systems” also known as:
    • Conversational agents, Dialogue agents, Chatbots
  • Types:
    • Task-oriented dialogue systems: help user complete a task
    • Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
    • Hybrids
  • Approaches:
    • Rule-based
    • Frame-based systems: statistical but modular
    • “End-to-end” (deep learning)

Advantages: Interpretability, Ease of Evaluation

Disadvantages: Low generalization, “top down” design

44 of 101

Review: Dialogue systems

  • “Dialogue systems” also known as:
    • Conversational agents, Dialogue agents, Chatbots
  • Types:
    • Task-oriented dialogue systems: help user complete a task
    • Chatbots: “systems that can carry on extended conversations with the goal of mimicking the unstructured conversations” (often for entertainment value)
    • Hybrids
  • Approaches:
    • Rule-based
    • Frame-based systems: statistical but modular
    • “End-to-end” (deep learning)

This lecture!

45 of 101

End-to-end

  • Definition: training regimen that uses the loss from a downstream application to adjust the weights all the way through the network
  • (not “modular”)
  • Most common applications:
    • Sequence-to-sequence
    • Encoder-decoder

Real-valued vector

46 of 101

Two running applications in this lecture

Machine translation

Dialogue systems (aka, question-answering, prompt-response)

47 of 101

Stacked encoders and decoders

Figure credit: Jay Alamar

48 of 101

Number of decoder blocks distinguishes models

Figure credit: Jay Alamar

49 of 101

Review: Fixed-window language model

e.g. fixed window = 4

Feed-forward deep learning model

50 of 101

Thought experiment

Want hidden layer output to be the same size as the input so we can stack these in our encoder-decoder architecture.

Input layer

(“unravelled” into 1-d vector)

Size: Num. tokens x embedding dimension

Single hidden layer

Weights applied linearly to input + non-linear activation

Think-pair-share 1

In the hidden layer whose output is the same size as the input, what is the size of the weights (number of model parameters that will be trained by SGD)?

Calculate for GPT model (variant):

  • Input tokens = 512
  • Embedding dimension = 768

Feed-forward deep learning model

51 of 101

Transformers

No, not these kinds of transformers :)

General pretrained transformer (GPT) type of transformer

52 of 101

Citation count is a proxy for what the research community deems important

53 of 101

Transformers

  • Proposes new pieces of the neural architectures
    • Self-attention
    • Positional encodings
  • Motivations:
    • Fewer parameters than feed-forward networks
    • Easily parallelizable (computationally efficient)
    • Can model long-range dependencies

54 of 101

Self-attention intuition

In this example, how do you know what its refers to?

“A robot must protect its own existence”

55 of 101

Attention intuition: Machine Translation alignment

Attention high-level goal: Compare a token of interest to other tokens to reveal what is relevant in context.

More white = greater attention score

56 of 101

J&M Fig. 10.1

Attends to each previous element (but not future elements).

Each time step independent of other time steps so can be performed in parallel

57 of 101

Transformers were a response to computational inefficiencies of recurrent neural networks (RNNs)

Figure credit: Chris Olah

Represent prior context with recurrent connections

Won’t go in-depth

Can’t parallelize training. Inefficient

58 of 101

Self-attention board work

59 of 101

60 of 101

Supposed we’ve already trained the model and are now doing inference for y3.

61 of 101

Self-attention: efficient, vectorized version

Q: Why are we setting these to negative infinity?

Why is this important?

J&M Fig 10.3

62 of 101

Important b/c different words can relate to each other in many ways simultaneously (e.g. syntactic, semantic, discourse relations)

J&M Fig 10.5

Each “head” has unique set of key, query and value weight matrices

63 of 101

Stacked many times over

Residual connection: Pass info from lower layer to higher layer without going through intermediate layer

64 of 101

Layer normalization

In statistics, “z-scores”

Weights learned by the model

65 of 101

Stacked encoders and decoders

Figure credit: Jay Alamar

66 of 101

Notes on inputs

67 of 101

Review from week 1: Tokenization for GPT-3, GPT-2… most likely ChatGPT

In practice, the vocabulary is not characters but bytes (recall 8 bits=1 byte).

BPE Algorithm Inputs

68 of 101

Input: Embedding look-up

Figure credit: Jay Alamar

69 of 101

Must truncate and pad

(like HW5)!

70 of 101

Input: Positional encodings

Tells the order of the sequence to the model.

Initialized randomly and then learned by the model

71 of 101

72 of 101

Autoregression

Figure credit: Jay Alamar

After each token is generated, that token is added to the sequence of inputs.

73 of 101

Transformers

  • Proposes new pieces of the neural architectures
    • Self-attention
    • Positional encodings
  • Motivations:
    • Fewer parameters than feed-forward networks
    • Easily parallelizable (computationally efficient)
    • Can model long-range dependencies

74 of 101

Friday

75 of 101

Katie: Record on Zoom

76 of 101

Announcements

  • Gentle reminder: two project forms due on Sunday
  • Next week: lecture time is flipped classroom (working in groups)
    • Some tips for mini-lectures
    • Check-ins and feedback with every group about project topics and ideas

77 of 101

Topics

  • Wrap-up dialogue systems and transformers (~15 minutes)
  • Synthesis & reflection activity (~15 minutes)
  • Midterm debrief (~20 minutes)

78 of 101

  • Scanned midterms and grades on gradescope

  • Released: Today 12:30pm (after lecture)

  • Some scans, the handwriting was too light. Come see Katie during office hours for a paper copy.

79 of 101

Review: Dialogue system paradigms

Rule-based

Frame-based

End-to-end

Examples: ELIZA (1966), PARRY (1971)

Examples: UW “Sounding Board” (Alexa Prize winner 2017)

Examples: ChatGPT (2022)

80 of 101

Review: Sequence-to-sequence

Machine translation

Dialogue systems (aka, question-answering, prompt-response)

Trained “end to end”

81 of 101

Review: Self-attention intuition

A robot must protect its own existence

82 of 101

Review: Self-attention

Each time step independent of other time steps so can be performed in parallel

83 of 101

Important b/c different words can relate to each other in many ways simultaneously (e.g. syntactic, semantic, discourse relations)

J&M Fig 10.5

Each “head” has unique set of key, query and value weight matrices

84 of 101

Example from Vaswani et al. 2017

Learned self-attention weights for Layer 5 (of 6)

Vaswani et al. 2017, Fig 4

Isolating “its”

85 of 101

Vaswani et al. 2017, Fig 5

Different “heads” learn different relations between tokens

86 of 101

Review: Stacked encoders and decoders

Figure credit: Jay Alamar

87 of 101

Varies by model

Figure credit: Jay Alamar

Only up to current item in sequence

Different models will put masked vs unmasked self-attention in decoders

88 of 101

Supervised Models

Training data, e.g. class labels (y), typically come from humans

Self-Supervised Models

Training data, e.g. class labels (y), comes from raw data (no humans!)

89 of 101

Variants on self-supervision for training

Masked language modeling

(bidirectional)

“Causal” language modeling

(unidirectional)

poor terminology choice by community

Figure credit: Prakhar Mishra

90 of 101

Training for unidirectional LMs

Teacher forcing = during training, at each time step in decoding we force the system to use the gold target token from the training data as the next input

Negative log probability the model assigns to the next word in the training sequence

91 of 101

Review: Lots of compute!

Floating point operations (e.g. +, -, *, /)

3.14 x 10^23 FLOPS

Factor of ~100 Billion

Apple M1 Pro 16-Core-GPU = 5.3 x 10^12 FLOPS

GPT-3 total train compute =

92 of 101

Synthesis & reflection activity

93 of 101

Midterm

94 of 101

  • Scanned midterms and grades on gradescope

  • Released: Today 12:30pm (after lecture)

  • Some scans, the handwriting was too light. Come see Katie during office hours for a paper copy.

95 of 101

Indicates to me this test was doable.

Indicates to me this test was doable.

4.2: 70% mean (lowest)

96 of 101

Tests

Grade

Concept

understanding

Performing under

time pressure

Gentle reminder: Your grade is neither a reflection of your worth nor your ability.

Exists in jobs and/or grad school

Problem

solving

97 of 101

(Optional) Extra Credit Midterm Revisions

  • Extra credit can only help you, not hurt you
  • On Gradescope, submit a pdf with the following questions:
    1. How did you prepare for this test? Knowing what you know now, how could you have improved your preparation?
    2. For each individual question you missed,
      • First pass. What was your initial answer (during test time)?
      • Reflection. Why do think you got this question wrong during test time? Was it an issue with concept understanding, problem solving, and/or performing under time pressure? For concepts you didn’t understand, what was difficult about the concept?
      • Second pass. What is your revised answer?
  • Due: Friday, April 14 at 3:59pm ET

Encouraged to talk to classmates & Katie

98 of 101

Go over together

  • 4.2
  • 3.6
  • 2.3

99 of 101

Midterm guide

100 of 101

Lecture, Slide 54

101 of 101