Dialogue systems
and Transformers
Williams College
Spring 2023
CSCI 375: Natural Language Processing
Announcements
HW 5 is the last assignment as individuals
Tips:
Groups for final project (and HW 6)
Let me know any potential issues by this Wednesday.
Final Projects
All info on the project tab of the class website
First half of the semester:
Second half of the semester:
Final project learning objectives
Bonus: NLP is changing so fast. Ensure we’re spending the second-half of the class on relevant topics.
Final project outcomes
Goal of “research”: Try to find a place no one has been before.
Expansion of human knowledge.
Fill out by Sunday
“Topic” scope
Ideas are not created in a vacuum…
Places to start if you’re looking for an NLP topic
Browse potential topics from ACL workshops
Linguistic, e.g., semantic evaluation
Applications, e.g.,NLP for Educational Applications
Machine learning, e.g., representation learning for NLP
“Flash talks” of 4 ideas
AfriSenti-SemEval Shared Task 12
Linguistic focus
Many open NLP problems for languages that are not English!
Opportunity to use your liberal arts education!
Shared tasks
NLP model distillation and “on-device” models
Machine learning focus
Scientific Information Extraction
NLP for Computational Social Science
Computational measure of framing = how those topics are discussed can influence the way audiences understand them
Logistics
“Tutorial” style small group meetings
Please let me know by Friday if you have conflicts with May 11.
Grading
Final report graded on a rubric
All deliverables except for final report graded all-or-nothing on completion.
You’ll receive qualitative feedback to improve your final report.
Dialogue systems
J&M Textbook
This week
Other application chapters could potentially make great project topics!
Dialogue systems
Back to Deep Learning!
Definitions
What has driven recent interest in dialogue systems?
Commercial!
Amazon’s Alexa
Apple’s Siri
Google Assistant
Historical motivations: Operationalizing “intelligence”
Alan Turing (1912-1954)
Turing Test (1950)
Definitions
Review: Joeseph Weizenbaum and ELIZA (1966)
1923-2008
Professor at MIT
ELIZA: Rule-based program that resembled a Rogerian psychologist (a type of counseling in which the therapist is nondirective but supportive)
PARRY (1971)
First known system to pass the Turing test (in 1972!)
Definitions
You’ll implement this approach in Homework 6!
Frame-based dialogue systems (J&M Section 15.3)
Example frame: travel
Other frames: car or hotel reservations, calling another person
Old system (1977!) but frame-based approach still used in modern applications
2017 “Alexa Prize”: 2.5 Million dollar challenge
Task: Build a conversational agent to converse coherently with humans on popular topics (e.g. sports or politics) for 20 minutes
Tested with real Alexa users
2017! Rapid change in NLP state-of-the art ~6 years
UW’s “Sounding Board” team won
Used frame-based approach
(what they call a “hierarchical dialogue manager”)
Frame-based dialogue systems
Other example modules:
Think-pair-share 1
Brainstorm some advantages and disadvantages of modular dialogue systems like these
Weds – Transformers
Announcements
Review: Deep Learning for NLP – modern tech stack
Python and numpy
Pytorch
transformers /
huggingface
You implement gradients and computation graph
Modular pieces of a deep learning architecture with gradients automatically calculated
NLP models pre-trained on existing collections of text documents
Review: Dialogue systems
Advantages: Interpretability, Ease of Evaluation
Disadvantages: Low generalization, “top down” design
Review: Dialogue systems
This lecture!
End-to-end
Real-valued vector
Two running applications in this lecture
Machine translation
Dialogue systems (aka, question-answering, prompt-response)
Stacked encoders and decoders
Figure credit: Jay Alamar
Number of decoder blocks distinguishes models
Figure credit: Jay Alamar
Review: Fixed-window language model
e.g. fixed window = 4
Feed-forward deep learning model
Thought experiment
Want hidden layer output to be the same size as the input so we can stack these in our encoder-decoder architecture.
Input layer
(“unravelled” into 1-d vector)
Size: Num. tokens x embedding dimension
Single hidden layer
Weights applied linearly to input + non-linear activation
Think-pair-share 1
In the hidden layer whose output is the same size as the input, what is the size of the weights (number of model parameters that will be trained by SGD)?
Calculate for GPT model (variant):
Feed-forward deep learning model
Transformers
No, not these kinds of transformers :)
General pretrained transformer (GPT) type of transformer
Citation count is a proxy for what the research community deems important
Transformers
Self-attention intuition
In this example, how do you know what its refers to?
“A robot must protect its own existence”
Attention intuition: Machine Translation alignment
Attention high-level goal: Compare a token of interest to other tokens to reveal what is relevant in context.
More white = greater attention score
J&M Fig. 10.1
Attends to each previous element (but not future elements).
Each time step independent of other time steps so can be performed in parallel
Transformers were a response to computational inefficiencies of recurrent neural networks (RNNs)
Figure credit: Chris Olah
Represent prior context with recurrent connections
Won’t go in-depth
Can’t parallelize training. Inefficient
Self-attention board work
Supposed we’ve already trained the model and are now doing inference for y3.
Self-attention: efficient, vectorized version
Q: Why are we setting these to negative infinity?
Why is this important?
J&M Fig 10.3
Important b/c different words can relate to each other in many ways simultaneously (e.g. syntactic, semantic, discourse relations)
J&M Fig 10.5
Each “head” has unique set of key, query and value weight matrices
Stacked many times over
Residual connection: Pass info from lower layer to higher layer without going through intermediate layer
Layer normalization
In statistics, “z-scores”
Weights learned by the model
Stacked encoders and decoders
Figure credit: Jay Alamar
Notes on inputs
Review from week 1: Tokenization for GPT-3, GPT-2… most likely ChatGPT
In practice, the vocabulary is not characters but bytes (recall 8 bits=1 byte).
BPE Algorithm Inputs
Input: Embedding look-up
Figure credit: Jay Alamar
Must truncate and pad
(like HW5)!
Input: Positional encodings
Tells the order of the sequence to the model.
Initialized randomly and then learned by the model
Autoregression
Figure credit: Jay Alamar
After each token is generated, that token is added to the sequence of inputs.
Transformers
Friday
Katie: Record on Zoom
Announcements
Topics
Review: Dialogue system paradigms
Rule-based
Frame-based
End-to-end
Examples: ELIZA (1966), PARRY (1971)
Examples: UW “Sounding Board” (Alexa Prize winner 2017)
Examples: ChatGPT (2022)
Review: Sequence-to-sequence
Machine translation
Dialogue systems (aka, question-answering, prompt-response)
Trained “end to end”
Review: Self-attention intuition
A robot must protect its own existence
Review: Self-attention
Each time step independent of other time steps so can be performed in parallel
Important b/c different words can relate to each other in many ways simultaneously (e.g. syntactic, semantic, discourse relations)
J&M Fig 10.5
Each “head” has unique set of key, query and value weight matrices
Example from Vaswani et al. 2017
Learned self-attention weights for Layer 5 (of 6)
Vaswani et al. 2017, Fig 4
Isolating “its”
Vaswani et al. 2017, Fig 5
Different “heads” learn different relations between tokens
Review: Stacked encoders and decoders
Figure credit: Jay Alamar
Varies by model
Figure credit: Jay Alamar
Only up to current item in sequence
Different models will put masked vs unmasked self-attention in decoders
Supervised Models
Training data, e.g. class labels (y), typically come from humans
Self-Supervised Models
Training data, e.g. class labels (y), comes from raw data (no humans!)
Variants on self-supervision for training
Masked language modeling
(bidirectional)
“Causal” language modeling
(unidirectional)
poor terminology choice by community
Figure credit: Prakhar Mishra
Training for unidirectional LMs
Teacher forcing = during training, at each time step in decoding we force the system to use the gold target token from the training data as the next input
Negative log probability the model assigns to the next word in the training sequence
Review: Lots of compute!
Floating point operations (e.g. +, -, *, /)
Apple M1 Pro 16-Core-GPU = 5.3 x 10^12 FLOPS
GPT-3 total train compute =
Synthesis & reflection activity
Midterm
Indicates to me this test was doable.
Indicates to me this test was doable.
4.2: 70% mean (lowest)
Tests
Grade
Concept
understanding
Performing under
time pressure
Gentle reminder: Your grade is neither a reflection of your worth nor your ability.
Exists in jobs and/or grad school
Problem
solving
(Optional) Extra Credit Midterm Revisions
Encouraged to talk to classmates & Katie
Go over together
Midterm guide
Lecture, Slide 54