1 of 69

Lecture 2

CS 263:

Advanced NLP

Saadia Gabriel

2 of 69

Announcements

  • PTE update: everyone waitlisted on Monday should be able to enroll, any additional requests will be addressed over the next week but there is (limited) additional space

  • Please sign up for a presentation slot as soon as possible! If you do not have edit access, email me. Aim to select a paper at least 3 days in advance. If you need suggestions, ask me.

  • If you still don’t have access to Bruin Learn, email me.

3 of 69

Announcements

  • TA Office hours are being finalized:

4 of 69

Announcements

  • Our first guest lecture next Wednesday:

The privacy concerns surrounding LLMs stem from multiple factors. These include how the collection of massive amounts of data for training stochastic parrots often exceed user expectations, how LLMs present the potential for data leakage, or the acquisition of data by unauthorized entities, as well as concerns over how data is used, once collected, by (un)authorized agents to make decisions that could directly impact users lives. The nature of LLM-based conversational agents, in particular, presents an ongoing site for privacy risk, as users disclose information during the course of an interaction. In this talk, I will review current practices designed to protect user privacy as it relates to LLM-based chatbots and query some of their shortcomings. I will then detail some of my current research, which investigates what factors may be underlying behavior and attitudes related to digital privacy in this novel environment. Specifically, I investigate two important yet under-explored topics in digital privacy research: how users might adjust the amount of information they disclose as a mechanism of privacy protection and what factors shape mental models/expectations of information flow.

Privacy Risks of LLM-based Chatbot Interaction

Sophie Klitgaard

5 of 69

Last time…

What is language?

How do we form associations between words and concepts?

How do we differentiate between literal meaning and language use in context?

6 of 69

Today…

What is a language model?

How do we form neural networks capable of mimicking and analyzing human language production?

What are some of the challenges in language modeling and how are they addressed by different architectures?

7 of 69

What is a Language Model?

GPT2 in 2019

A computational tool that can generate this?

Initially appears fluent, but logically inconsistent…

8 of 69

What is a Language Model?

Fundamentally, a statistical LM is a probability distribution that determines what word is likely to follow a subsequence:

9 of 69

What is a Language Model?

An early form of language model is the ngram model:

But what if a n-gram never appears during estimation?

Maximum Likelihood

Estimation (MLE)

10 of 69

“Curse of dimensionality”

Smoothing was introduced to handle sparsity and effects of test-time distribution shift.

Bengio et al. (2003)

Training sentences

All of English

11 of 69

LM Smoothing

We need to redistribute the probability mass so previously unseen events are not impossible.

(Add-λ) Laplace smoothing

Adding by λ in the numerator and multiplying V by λ in the denominator gives us the generalization

How do we select an effective λ?

Small values can lead to overfitting (high variance), large values to overestimation of novel events (high bias)

12 of 69

LM Smoothing

We need to redistribute the probability mass so previously unseen events are not impossible.

(Add-λ) Laplace smoothing

Linear interpolation

Constraint: λ1 + λ2 + λ3 =1

13 of 69

LM Smoothing

We need to redistribute the probability mass so previously unseen events are not impossible.

Backoff

smoothing

What if the unigram doesn’t appear?

Then a “unk” symbol is introduced.

14 of 69

Neural Language Models

2025

1943

The first artificial neuron proposed, “McCulloch-Pitts neuron”

1958

Shallow feedforward neural network

A set of inputs

A set of weights

A bias term

Nonlinearity

Activation value

15 of 69

Why do we need nonlinear activations?

16 of 69

Types of Activation Functions

Sigmoid

Tanh

Tends to converge faster

17 of 69

And…

Gaussian cumulative distribution function

Gaussian CDF

18 of 69

And…

Courtesy of Tatsu Hashimoto

19 of 69

Feedforward Neural LM

Word representations to project inputs into low-dimensional vectors

Concatenate projected vectors to get multi-word contexts

Obtain p(y | x) by performing final linear transformation and softmax:

s = WhTh + bh

p = softmax(s)

Non-linear function, e.g.,

h = tanh(WcTc + bc)

20 of 69

How Neural Networks Make Predictions

z1 = xW1 + b1

a1 = σ(z1)

z2 = xW2 + b2

a2 = softmax(z2)

activation units

z1

x

a1

z2

a2

Forward pass of a deeper neural network:

21 of 69

Learning the Parameters

22 of 69

Backpropagation

23 of 69

Backpropagation

24 of 69

Backpropagation

25 of 69

Backpropagation

26 of 69

Update the Parameters

Step size for update

27 of 69

Neural Language Models

2025

1943

The first artificial neuron proposed, “McCulloch-Pitts neuron”

1958

Shallow feedforward neural network

28 of 69

Neural Language Models

2025

1943

The first artificial neuron proposed, “McCulloch-Pitts neuron”

1997

Long-short term memory network (LSTM)

Gated recurrent network (GRU)

2014

For decades, variants of RNNs were the dominant neural architecture in NLP.

29 of 69

Recurrent Neural Networks (RNNs)

Hidden layer corresponding to a time-step t

Weight matrices

Critically, parameters are reused across time-steps.

Previous output hidden state

Input at time-step t

Predicted next word

Additional weight

Normalize

30 of 69

Recurrent Neural Networks (RNNs)

Cross entropy loss for corpus of size T

How well does the model probability distribution actually represent the data?

31 of 69

Long Short-Term Memory (LSTM)

32 of 69

Vanishing Gradients

Vanishing gradients pose a problem to training with long sequences.

33 of 69

Long Short-Term Memory (LSTM)

34 of 69

Core Idea Behind LSTMs

35 of 69

The 3 Gates in a LSTM:

3. Output gate

2. Forget gate

1. Input gate

36 of 69

Putting Everything Together

37 of 69

LSTMs Summary

38 of 69

Gated Recurrent Units (GRUs)

Cho et al. (2014)

  • An even more efficient variation on the LSTM

  • GRUs use 2 gates instead of 3

Reset gate

Update gate

The reset gate determines how much of ht-1 to forget

The update gate determines how much new information from the input to retain

39 of 69

query

query

The sequence-to-sequence setup:

A sequence x1, x2, … xn goes in

A sequence y1, y2, … ym comes out

Used in speech recognition, machine translation, dialogue, etc…

Encoder-Decoder Architectures

40 of 69

Encoder-Decoder Architectures

Example from Dive into Deep Learning

Encoder calculations

Use final hidden state from encoder to initialize decoder hidden states

Decoder calculations

41 of 69

  • We can define the encoder context vector (our initial decoder hidden state) in a more sophisticated way, for example as a weighted sum of encoder hidden states.

  • Semantically similar sentences have similar encoder context representations:

Encoder-Decoder Architectures

42 of 69

Limitation of RNNs/LSTMs

43 of 69

Enter the Attention Mechanism

(Bahdanau et al., 2015)

44 of 69

Attention Mechanism

Attention scores calculated as a function of the current decoder hidden state and encoder state for a specific input position.

45 of 69

Basics of Attention

Slide courtesy of Mohit Iyyer

In practice, we scale attention weights to stabilize gradients during training

46 of 69

Basics of Attention

Slide courtesy of Mohit Iyyer

47 of 69

Basics of Attention

Slide courtesy of Mohit Iyyer

48 of 69

t=2

Basics of Attention

Slide courtesy of Mohit Iyyer

Keep in mind, these attention weights are specific to each decoder timestep.

Attention distribution from t=1

49 of 69

Seq2Seq with Attention

50 of 69

Attention Summary

Slide courtesy of Mohit Iyyer

51 of 69

Is attention all you need to model language?

52 of 69

Transformers

53 of 69

Transformers

54 of 69

Structures of each 

encoder and decoder

55 of 69

The input to the first encoder is word embeddings

The subsequent encoders get the output of the previous encoder

2 linear transformations,

ReLU activation

New terminology!

56 of 69

Self-attention

Self-attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

https://nlp.seas.harvard.edu/2018/04/03/attention.html

This is how we can learn locally contextual embeddings.

57 of 69

Self-attention

Step 1: create three vectors

They are the query, key and value vectors. 

58 of 69

Self-attention

Step 2: calculate the attention

  • Attention: how much focus to place on other parts of the input sentence
  • The first word will “query” each other word and based on their keys to decide its attention on the words.

59 of 69

Self-attention

Stabilizes gradients

60 of 69

Step 5: Multiply each value vector by the attention score

  • Focus on the words we want to attend on (high attention weights), and ignore irrelevant words (low attention weights)

Step 6: sum up the weighted value vectors

Self-attention

61 of 69

Self-attention

Packing embeddings into matrix

  • Each row in the X matrix corresponds to a word in the input sentence

62 of 69

63 of 69

Self-attention

64 of 69

Self-attention Summary

65 of 69

A few more details…

In RNNs, we captured positional information with recurrence. In the absence of recurrence, we using positional embeddings.

Positional Encoding�t is position, i is dimension index, d is the vector size

66 of 69

GPT Architecture

https://jalammar.github.io/illustrated-gpt2/

67 of 69

GPT Technical Details

68 of 69

Large Language Models and Beyond

Image from https://blogs.cfainstitute.org/investor/2023/05/26/chatgpt-and-large-language-models-six-evolutionary-steps/

Many components of LLM development have been scaling up….

69 of 69

Questions?