1 of 110

Machine Learning Foundations

Calc II: Partial Derivatives & Integrals

Using Gradients in

Python to Enable Algorithms to

Learn from Data

Jon Krohn, Ph.D.

jonkrohn.com/talks

github.com/jonkrohn/ML-foundations

2 of 110

Machine Learning Foundations

Calc II: Partial Derivatives & Integrals

Slides: jonkrohn.com/talks

Code: github.com/jonkrohn/ML-foundations

Stay in Touch:

jonkrohn.com to sign up for email newsletter

linkedin.com/in/jonkrohn

jonkrohn.com/youtube

twitter.com/JonKrohnLearns

3 of 110

The Pomodoro Technique

Rounds of:

25 minutes of work
with 5 minute breaks

Questions best handled at breaks, so save questions until then.

When people ask questions that have already been answered, do me a favor and let them know, politely providing response if appropriate.

Except during breaks, I recommend attending to this lecture only as topics are not discrete: Later material builds on earlier material.

4 of 110

POLL

What is your level of familiarity with Calculus?

Little to no exposure
Some understanding of the theory
Deep understanding of the theory
Deep understanding of the theory and experience applying calculus operations (e.g., differentiation) with code

5 of 110

POLL

What is your level of familiarity with Machine Learning?

Little to no exposure, or exposure to theory only
Experience applying machine learning with code
Experience applying machine learning with code and some understanding of the underlying theory
Experience applying machine learning with code and strong understanding of the underlying theory

6 of 110

ML Foundations Series

Calculus II builds upon and is foundational for:

Intro to Linear Algebra
Linear Algebra II: Matrix Operations
Calculus I: Limits & Derivatives
Calculus II: Partial Derivatives & Integrals
Probability & Information Theory
Intro to Statistics
Algorithms & Data Structures
Optimization

7 of 110

Calc II: Partial Derivatives & Integrals

Review of Introductory Calculus
Machine Learning Gradients
Integrals

8 of 110

Calc II: Partial Derivatives & Integrals

Review of Introductory Calculus
Machine Learning Gradients
Integrals

9 of 110

The Delta Method
Differentiation with Rules
AutoDiff: Automatic Differentiation

Segment 1: Review of Introductory Calc

10 of 110

What Calculus Is

Mathematical study of continuous change
Two branches:

Differential calculus: expanded on in Calc II
Integral calculus: a focus of Calc II subject

11 of 110

What Calculus Is

Mathematical study of continuous change
Two branches:

Differential calculus: expanded on in Calc II
Integral calculus: a focus of Calc II subject

12 of 110

What Differential Calculus Is

Study of rates of change
Consider a vehicle traveling some distance d over time t:

13 of 110

What Calculus Is

Mathematical study of continuous change
Two branches:

Differential calculus: expanded on in Calc II
Integral calculus: a focus of Calc II subject

14 of 110

What Integral Calculus Is

Study of areas under curves
Facilitates the “opposite” of differential calculus:

15 of 110

The Delta Method

16 of 110

Derivative Notation

17 of 110

Derivative of a Constant

Assuming c is constant:

Intuition: A constant has no variation so its slope is nothing, e.g.:

18 of 110

The Power Rule

19 of 110

The Constant Product Rule

20 of 110

The Sum Rule

21 of 110

The Chain Rule

Many applications within ML

Critical for backpropagation algo used to train neural nets

Based on nested functions:

Let’s say y = (5x + 25)³
We can let u = 5x + 25
In that case, y = u³
y is a function of u, and u is a function of x

Chain rule is easy way to find derivative of nested function:

22 of 110

The Chain Rule

23 of 110

Power Rule on a Function Chain

24 of 110

Fitting a Line with Machine Learning

Line equation y = mx + b as directed acyclic graph (DAG)
Nodes are input, output, parameters, or operations
Directed edges (“arrows”) are tensors (N.B.: non-operation nodes can be tensors too)

25 of 110

Machine Learning

26 of 110

Machine Learning

Step 3: Partial Differentiation (the primary focus of Calc II)

Step 4: Descend gradient of cost C w.r.t. parameters m and b

Hands-on code demo: regression-in-pytorch.ipynb

gradient of C w.r.t. p = 0

27 of 110

Calc II: Partial Derivatives & Integrals

Review of Introductory Calculus
Machine Learning Gradients
Integrals

28 of 110

Partial Derivatives of Multivariate Functions
The Partial-Derivative Chain Rule
Quadratic Cost
Gradients
Gradient Descent
Backpropagation
Higher-Order Partial Derivatives

Segment 2: ML Gradients

29 of 110

Multivariate Functions

Even in a simple regression such as y = mx + b:

y is a function of multiple parameters

— in this case, m and b.

Therefore, we can’t calculate the full derivative dy/dm or dy/db.

30 of 110

Partial Derivatives

Enable the calculation of derivatives of multivariate equations.

Consider the equation z = x² - y²

Hands-on demo: geogebra.org/3d

31 of 110

The partial derivative of z with respect to x is obtained by considering y to be a constant:

32 of 110

The slope of z along the x axis is twice the x axis value.

Hands-on code demo

33 of 110

Partial Derivatives

Reconsider z = x² - y²

from the perspective of z w.r.t y

Hands-on demo: geogebra.org/3d

34 of 110

The partial derivative of z with respect to y is obtained by considering x to be a constant:

35 of 110

The slope of z along the y axis is twice the y axis value

...and is inverted.

Hands-on code demo

39 of 110

Partial Derivatives

Hands-on code demo

40 of 110

Partial Derivatives

Hands-on code demo

41 of 110

Exercises

Find all the partial derivatives of the following functions:

z = y³ + 5xy
The surface area of a cylinder is described by a = 2πr² + 2πrh.
The volume of a square prism with a cube cut out of its center is described by v = x²y - z³.

45 of 110

Partial Derivative Notation

46 of 110

The Chain Rule

Let’s say:

Recall that the chain rule for full derivatives would be:

With univariate functions, the partial derivative is identical:

47 of 110

The Chain Rule

With a multivariate function, the partial derivative is more interesting:

48 of 110

The Chain Rule

With multiple multivariate functions, it gets really interesting:

49 of 110

The Chain Rule

Generalizing completely:

50 of 110

Exercises

Find all the partial derivatives of y, where:

54 of 110

Recalling Machine Learning

55 of 110

Recalling Machine Learning

Step 3: Automatic differentiation

Step 4: Descend gradient of cost C w.r.t. parameters m and b

Hands-on code demo: single-point-regression-gradient NB

gradient of C w.r.t. p = 0

56 of 110

Quadratic Cost w.r.t. Predicted y

57 of 110

Predicted y w.r.t. Model Parameters

58 of 110

Quadratic Cost w.r.t. Model Parameters

Hands-on code demo

59 of 110

Recalling Machine Learning

60 of 110

Recalling Machine Learning

Step 3: Determine gradient of cost C w.r.t. parameters m and b

Step 4: Descend gradient

gradient of C w.r.t. p = 0

61 of 110

∇C: the Gradient of Cost

Hands-on code demo: batch-regression-gradient NB

62 of 110

MSE w.r.t. Predicted y

63 of 110

MSE w.r.t. Model Parameters

Hands-on code demo

64 of 110

Regression Line after 1000 Epochs

65 of 110

Backpropagation

Chain rule of partial derivatives of cost w.r.t. model parameters extends to deep neural networks, which may have 1000s of layers:

66 of 110

Higher-Order Derivatives

67 of 110

Higher-Order Partial Derivatives

In ML, used to accelerate through gradient descent. (Optimization)

Consider the following first-order partial derivatives...

68 of 110

Higher-Order Partial Derivatives

69 of 110

Higher-Order Partial Derivatives

70 of 110

Higher-Order Partial Derivative Notation

71 of 110

Exercise

Find all the second-order partial derivatives of z = x³ + 2xy.

73 of 110

Calc II: Partial Derivatives & Integrals

Review of Introductory Calculus
Gradients Applied to Machine Learning
Integrals

74 of 110

Binary Classification
The Confusion Matrix
The Receiver-Operating Characteristic (ROC) Curve
Calculating Integrals Manually
Numeric Integration with Python
Finding the Area Under the ROC Curve
Resources for Further Study of Calculus

Segment 3: Integrals

75 of 110

Supervised Learning

Have x and y
Goal: learn function that uses x to approximate y
Examples:

Regression

Clinical measure of forgetfulness
Sales of a product
Future value of an asset

Classification

Multinomial

Handwritten digits: 10 classes
Imagenet: 21k classes

Binomial

Movie-review sentiment: positive vs negative
Photos of fast food: Hot dog vs not hot dog

76 of 110

Accuracy at a Single Threshold

Doesn’t reflect model quality at other points in output distribution
If y = 1: Prediction of 0.49 is 100% wrong; 0.51 prediction 100% correct

Prediction of 0.51 is considered as correct as prediction of 0.99

Solution: ROC AUC metric

77 of 110

The Confusion Matrix

78 of 110

Four Hot Dog Predictions

79 of 110

Receiver-Operating Characteristic

80 of 110

The ROC Curve

81 of 110

The ROC Curve

82 of 110

Integral Calculus

Study of areas under curves
Facilitates the inverse of differential calculus:

Also finds areas more generally, volumes, and central points

83 of 110

Integral Calculus Applications in ML

Find area under the curve:

Receiver operating characteristic (Calc II)
Probability theory’s “expectation” of random variable is widely used in machine learning, incl. deep learning (Prob. & Info. Thy.)

84 of 110

dx Slice Width

dx indicates slice width (Δx) is approaching zero width

85 of 110

Integral Notation

86 of 110

The Power Rule

87 of 110

The Constant Multiple Rule

88 of 110

The Sum Rule

91 of 110

Definite Integrals

99 of 110

Hands-on code demo

100 of 110

Exercise

Evaluate the following expression using both pencil and Python:

101 of 110

Solution

Hands-on code demo

102 of 110

Area Under the ROC Curve

Hands-on code demo

103 of 110

Resources for Further Study

General reference textbook: Michael Spivak’s Calculus
Differential Calculus:

Ch. 6 of Deisenroth et al. (2020) Mathematics for ML
3Blue1Brown on YouTube

Integral Calculus:

ditto
Appendix 18.5 of Zhang et al.’s (2019) Dive into Deep Learning

Next steps in the ML Foundations series:

Probability & Information Theory
Intro to Statistics
Optimization

104 of 110

Next Subject: Probability & Info. Thy.

Apply calculus to ascertain how much meaningful signal is present in data.

Learn the probability theory-based foundations of stats and ML.

105 of 110

POLL with Multiple Answers Possible

What follow-up topics interest you most?

Linear Algebra
More Calculus
Probability / Statistics
Computer Science (e.g., algorithms, data structures)
Machine Learning Basics
Advanced Machine Learning, incl. Deep Learning
Something Else

106 of 110

Stay in Touch

jonkrohn.com to sign up for email newsletter

linkedin.com/in/jonkrohn

youtube.com/c/JonKrohnLearns

twitter.com/JonKrohnLearns

1 of 110

2 of 110

3 of 110

4 of 110

5 of 110

6 of 110

7 of 110

8 of 110

9 of 110

10 of 110

11 of 110

12 of 110

13 of 110

14 of 110

15 of 110

16 of 110

17 of 110

18 of 110

19 of 110

20 of 110

21 of 110

22 of 110

23 of 110

24 of 110

25 of 110

26 of 110

27 of 110

28 of 110

29 of 110

30 of 110

31 of 110

32 of 110

33 of 110

34 of 110

35 of 110

36 of 110

37 of 110

38 of 110

39 of 110

40 of 110

41 of 110

42 of 110

43 of 110

44 of 110

45 of 110

46 of 110

47 of 110

48 of 110

49 of 110

50 of 110

51 of 110

52 of 110

53 of 110

54 of 110

55 of 110

56 of 110

57 of 110

58 of 110

59 of 110

60 of 110

61 of 110

62 of 110

63 of 110

64 of 110

65 of 110

66 of 110

67 of 110

68 of 110

69 of 110

70 of 110

71 of 110

72 of 110

73 of 110

74 of 110

75 of 110

76 of 110

77 of 110

78 of 110

79 of 110

80 of 110