Lecture 2
CS 263:
Advanced NLP
Saadia Gabriel
Announcements
Announcements
Announcements
The privacy concerns surrounding LLMs stem from multiple factors. These include how the collection of massive amounts of data for training stochastic parrots often exceed user expectations, how LLMs present the potential for data leakage, or the acquisition of data by unauthorized entities, as well as concerns over how data is used, once collected, by (un)authorized agents to make decisions that could directly impact users lives. The nature of LLM-based conversational agents, in particular, presents an ongoing site for privacy risk, as users disclose information during the course of an interaction. In this talk, I will review current practices designed to protect user privacy as it relates to LLM-based chatbots and query some of their shortcomings. I will then detail some of my current research, which investigates what factors may be underlying behavior and attitudes related to digital privacy in this novel environment. Specifically, I investigate two important yet under-explored topics in digital privacy research: how users might adjust the amount of information they disclose as a mechanism of privacy protection and what factors shape mental models/expectations of information flow.
Privacy Risks of LLM-based Chatbot Interaction
Sophie Klitgaard
Last time…
What is language?
How do we form associations between words and concepts?
How do we differentiate between literal meaning and language use in context?
Today…
What is a language model?
How do we form neural networks capable of mimicking and analyzing human language production?
What are some of the challenges in language modeling and how are they addressed by different architectures?
What is a Language Model?
GPT2 in 2019
A computational tool that can generate this?
Initially appears fluent, but logically inconsistent…
What is a Language Model?
Fundamentally, a statistical LM is a probability distribution that determines what word is likely to follow a subsequence:
What is a Language Model?
An early form of language model is the ngram model:
But what if a n-gram never appears during estimation?
Maximum Likelihood
Estimation (MLE)
“Curse of dimensionality”
Smoothing was introduced to handle sparsity and effects of test-time distribution shift.
Bengio et al. (2003)
Training sentences
All of English
LM Smoothing
We need to redistribute the probability mass so previously unseen events are not impossible.
(Add-λ) Laplace smoothing
Adding by λ in the numerator and multiplying V by λ in the denominator gives us the generalization
How do we select an effective λ?
Small values can lead to overfitting (high variance), large values to overestimation of novel events (high bias)
LM Smoothing
We need to redistribute the probability mass so previously unseen events are not impossible.
(Add-λ) Laplace smoothing
Linear interpolation
Constraint: λ1 + λ2 + λ3 =1
LM Smoothing
We need to redistribute the probability mass so previously unseen events are not impossible.
Backoff
smoothing
What if the unigram doesn’t appear?
Then a “unk” symbol is introduced.
Neural Language Models
2025
1943
The first artificial neuron proposed, “McCulloch-Pitts neuron”
1958
Shallow feedforward neural network
A set of inputs
A set of weights
A bias term
Nonlinearity
Activation value
Why do we need nonlinear activations?
Types of Activation Functions
Sigmoid
Tanh
Tends to converge faster
And…
Gaussian cumulative distribution function
Gaussian CDF
And…
Courtesy of Tatsu Hashimoto
Feedforward Neural LM
Word representations to project inputs into low-dimensional vectors
Concatenate projected vectors to get multi-word contexts
Obtain p(y | x) by performing final linear transformation and softmax:
s = WhTh + bh
p = softmax(s)
Non-linear function, e.g.,
h = tanh(WcTc + bc)
How Neural Networks Make Predictions
z1 = xW1 + b1
a1 = σ(z1)
z2 = xW2 + b2
a2 = softmax(z2)
activation units
z1
x
a1
z2
a2
Forward pass of a deeper neural network:
Learning the Parameters
Backpropagation
Backpropagation
Backpropagation
Backpropagation
Update the Parameters
Step size for update
Neural Language Models
2025
1943
The first artificial neuron proposed, “McCulloch-Pitts neuron”
1958
Shallow feedforward neural network
Neural Language Models
2025
1943
The first artificial neuron proposed, “McCulloch-Pitts neuron”
1997
Long-short term memory network (LSTM)
Gated recurrent network (GRU)
2014
For decades, variants of RNNs were the dominant neural architecture in NLP.
Recurrent Neural Networks (RNNs)
Hidden layer corresponding to a time-step t
Weight matrices
Critically, parameters are reused across time-steps.
Previous output hidden state
Input at time-step t
Predicted next word
Additional weight
Normalize
Recurrent Neural Networks (RNNs)
Cross entropy loss for corpus of size T
How well does the model probability distribution actually represent the data?
Long Short-Term Memory (LSTM)
Vanishing Gradients
Vanishing gradients pose a problem to training with long sequences.
Long Short-Term Memory (LSTM)
Core Idea Behind LSTMs
The 3 Gates in a LSTM:
3. Output gate
2. Forget gate
1. Input gate
Putting Everything Together
LSTMs Summary
Gated Recurrent Units (GRUs)
Cho et al. (2014)
Reset gate
Update gate
The reset gate determines how much of ht-1 to forget
The update gate determines how much new information from the input to retain
query
query
The sequence-to-sequence setup:
A sequence x1, x2, … xn goes in
A sequence y1, y2, … ym comes out
Used in speech recognition, machine translation, dialogue, etc…
Encoder-Decoder Architectures
Encoder-Decoder Architectures
Example from Dive into Deep Learning
Encoder calculations
Use final hidden state from encoder to initialize decoder hidden states
Decoder calculations
Encoder-Decoder Architectures
Limitation of RNNs/LSTMs
Enter the Attention Mechanism
(Bahdanau et al., 2015)
Attention Mechanism
Attention scores calculated as a function of the current decoder hidden state and encoder state for a specific input position.
Basics of Attention
Slide courtesy of Mohit Iyyer
In practice, we scale attention weights to stabilize gradients during training
Basics of Attention
Slide courtesy of Mohit Iyyer
Basics of Attention
Slide courtesy of Mohit Iyyer
t=2
Basics of Attention
Slide courtesy of Mohit Iyyer
Keep in mind, these attention weights are specific to each decoder timestep.
Attention distribution from t=1
Seq2Seq with Attention
Attention Summary
Slide courtesy of Mohit Iyyer
Is attention all you need to model language?
Transformers
Transformers
Structures of each
encoder and decoder
The input to the first encoder is word embeddings
The subsequent encoders get the output of the previous encoder
2 linear transformations,
ReLU activation
New terminology!
Self-attention
Self-attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
https://nlp.seas.harvard.edu/2018/04/03/attention.html
This is how we can learn locally contextual embeddings.
Self-attention
Step 1: create three vectors
They are the query, key and value vectors.
Self-attention
Step 2: calculate the attention
Self-attention
Stabilizes gradients
Step 5: Multiply each value vector by the attention score
Step 6: sum up the weighted value vectors
Self-attention
Self-attention
Packing embeddings into matrix
Self-attention
Self-attention Summary
A few more details…
In RNNs, we captured positional information with recurrence. In the absence of recurrence, we using positional embeddings.
Positional Encoding�t is position, i is dimension index, d is the vector size
GPT Architecture
https://jalammar.github.io/illustrated-gpt2/
GPT Technical Details
Large Language Models and Beyond
Image from https://blogs.cfainstitute.org/investor/2023/05/26/chatgpt-and-large-language-models-six-evolutionary-steps/
Many components of LLM development have been scaling up….
Questions?