1 of 11

Programming Practice 01

Given two matrices, write a function to multiply them together:

print(mat_mult([[3, 3], [2, 2], [1, 1]],

[[2, 3], [3, 2]])

> [[15, 15], [10, 10], [5, 5]]

NOTE: I thought this was really

tricky and it took me a while!

2 of 11

Programming Practice 01 Solution

3 of 11

CSE 10124�Learning Weights

4 of 11

Proposed Schedule

5 of 11

Where are we?

I lied!

Today we’ll talk about how we learn our weights.

This is called gradient descent

6 of 11

Training a Model

Magic Box

SLM

Prompt:

“Hello there”

Tokens:

{“hello”: 3, “there”: 7}

Embeddings:

{3: <0.4, 0.3, …, -0.02>

7: <0.27, 0.8, …, 0.06>}

Predicted Output:

<0.19, 0.14, …, 0.08>

<0.19, 0.14, 0.12, 0.10, 0.09, 0.08, 0.07, 0.06, 0.07, 0.08>

0 1 2 3 4 5 6 7 8 9

Predicted Output

Cross Entropy Loss

<0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0>

0 1 2 3 4 5 6 7 8 9

Expected Output

= 1.97

Ground Truth:

Hello there general

7 of 11

Deep Learning: Loss Functions

A loss function allows us to numerically measure how far off our model’s predicted values are from what we expected them to be, or what we know the right answer is.

Cross-Entropy: The loss function used to train LLMs is called “Cross-Entropy Loss” and it’s one of the most common loss functions. Heavily based on the principle of “entropy” as used in Information Theory, it’s a measure of how surprising our answer is relative to our ground-truth based on the entropy formula: -∑ p(x)log2p(x)

Mostly outside the scope of the class, but it’s really not that complicated.

8 of 11

Linear Transformations

Input Vector: X

Output Vector: Ŷ

transformed (multiplied) by W

W

9 of 11

Additional Resources

10 of 11

Writing MNIST “Equations”

11 of 11

Gradient Descent