1 of 110

Machine Learning Foundations

Calc II: Partial Derivatives & Integrals

Using Gradients in

Python to Enable Algorithms to

Learn from Data

Jon Krohn, Ph.D.

jonkrohn.com/talks

github.com/jonkrohn/ML-foundations

2 of 110

Machine Learning Foundations

Calc II: Partial Derivatives & Integrals

Slides: jonkrohn.com/talks

Code: github.com/jonkrohn/ML-foundations

Stay in Touch:

jonkrohn.com to sign up for email newsletter

linkedin.com/in/jonkrohn

jonkrohn.com/youtube

twitter.com/JonKrohnLearns

3 of 110

The Pomodoro Technique

Rounds of:

  • 25 minutes of work
  • with 5 minute breaks

Questions best handled at breaks, so save questions until then.

When people ask questions that have already been answered, do me a favor and let them know, politely providing response if appropriate.

Except during breaks, I recommend attending to this lecture only as topics are not discrete: Later material builds on earlier material.

4 of 110

POLL

What is your level of familiarity with Calculus?

  • Little to no exposure
  • Some understanding of the theory
  • Deep understanding of the theory
  • Deep understanding of the theory and experience applying calculus operations (e.g., differentiation) with code

5 of 110

POLL

What is your level of familiarity with Machine Learning?

  • Little to no exposure, or exposure to theory only
  • Experience applying machine learning with code
  • Experience applying machine learning with code and some understanding of the underlying theory
  • Experience applying machine learning with code and strong understanding of the underlying theory

6 of 110

ML Foundations Series

Calculus II builds upon and is foundational for:

  1. Intro to Linear Algebra
  2. Linear Algebra II: Matrix Operations
  3. Calculus I: Limits & Derivatives
  4. Calculus II: Partial Derivatives & Integrals
  5. Probability & Information Theory
  6. Intro to Statistics
  7. Algorithms & Data Structures
  8. Optimization

7 of 110

Calc II: Partial Derivatives & Integrals

  1. Review of Introductory Calculus
  2. Machine Learning Gradients
  3. Integrals

8 of 110

Calc II: Partial Derivatives & Integrals

  • Review of Introductory Calculus
  • Machine Learning Gradients
  • Integrals

9 of 110

  • The Delta Method
  • Differentiation with Rules
  • AutoDiff: Automatic Differentiation

Segment 1: Review of Introductory Calc

10 of 110

What Calculus Is

  • Mathematical study of continuous change
  • Two branches:
    1. Differential calculus: expanded on in Calc II
    2. Integral calculus: a focus of Calc II subject

11 of 110

What Calculus Is

  • Mathematical study of continuous change
  • Two branches:
    • Differential calculus: expanded on in Calc II
    • Integral calculus: a focus of Calc II subject

12 of 110

What Differential Calculus Is

  • Study of rates of change
  • Consider a vehicle traveling some distance d over time t:

13 of 110

What Calculus Is

  • Mathematical study of continuous change
  • Two branches:
    • Differential calculus: expanded on in Calc II
    • Integral calculus: a focus of Calc II subject

14 of 110

What Integral Calculus Is

  • Study of areas under curves
  • Facilitates the “opposite” of differential calculus:

15 of 110

The Delta Method

16 of 110

Derivative Notation

17 of 110

Derivative of a Constant

Assuming c is constant:

Intuition: A constant has no variation so its slope is nothing, e.g.:

18 of 110

The Power Rule

19 of 110

The Constant Product Rule

20 of 110

The Sum Rule

21 of 110

The Chain Rule

  • Many applications within ML
    • Critical for backpropagation algo used to train neural nets
  • Based on nested functions:
    • Let’s say y = (5x + 25)3
    • We can let u = 5x + 25
    • In that case, y = u3
    • y is a function of u, and u is a function of x
  • Chain rule is easy way to find derivative of nested function:

22 of 110

The Chain Rule

23 of 110

Power Rule on a Function Chain

24 of 110

Fitting a Line with Machine Learning

  • Line equation y = mx + b as directed acyclic graph (DAG)
  • Nodes are input, output, parameters, or operations
  • Directed edges (“arrows”) are tensors (N.B.: non-operation nodes can be tensors too)

25 of 110

Machine Learning

26 of 110

Machine Learning

Step 3: Partial Differentiation (the primary focus of Calc II)

Step 4: Descend gradient of cost C w.r.t. parameters m and b

Hands-on code demo: regression-in-pytorch.ipynb

gradient of C w.r.t. p = 0

Image © 2020 Pearson

27 of 110

Calc II: Partial Derivatives & Integrals

  • Review of Introductory Calculus
  • Machine Learning Gradients
  • Integrals

28 of 110

  • Partial Derivatives of Multivariate Functions
  • The Partial-Derivative Chain Rule
  • Quadratic Cost
  • Gradients
  • Gradient Descent
  • Backpropagation
  • Higher-Order Partial Derivatives

Segment 2: ML Gradients

29 of 110

Multivariate Functions

Even in a simple regression such as y = mx + b:

y is a function of multiple parameters

— in this case, m and b.

Therefore, we can’t calculate the full derivative dy/dm or dy/db.

30 of 110

Partial Derivatives

Enable the calculation of derivatives of multivariate equations.

Consider the equation z = x2 - y2

Hands-on demo: geogebra.org/3d

31 of 110

The partial derivative of z with respect to x is obtained by considering y to be a constant:

32 of 110

The slope of z along the x axis is twice the x axis value.

Hands-on code demo

33 of 110

Partial Derivatives

Reconsider z = x2 - y2

from the perspective of z w.r.t y

Hands-on demo: geogebra.org/3d

34 of 110

The partial derivative of z with respect to y is obtained by considering x to be a constant:

35 of 110

The slope of z along the y axis is twice the y axis value

...and is inverted.

Hands-on code demo

36 of 110

Solutions

37 of 110

Solutions

38 of 110

Solutions

39 of 110

Partial Derivatives

Hands-on code demo

40 of 110

Partial Derivatives

Hands-on code demo

41 of 110

Exercises

Find all the partial derivatives of the following functions:

  1. z = y3 + 5xy
  2. The surface area of a cylinder is described by a = 2πr2 + 2πrh.
  3. The volume of a square prism with a cube cut out of its center is described by v = x2y - z3.

42 of 110

Solutions

43 of 110

Solutions

44 of 110

Solutions

45 of 110

Partial Derivative Notation

46 of 110

The Chain Rule

Let’s say:

Recall that the chain rule for full derivatives would be:

With univariate functions, the partial derivative is identical:

47 of 110

The Chain Rule

With a multivariate function, the partial derivative is more interesting:

48 of 110

The Chain Rule

With multiple multivariate functions, it gets really interesting:

49 of 110

The Chain Rule

Generalizing completely:

50 of 110

Exercises

Find all the partial derivatives of y, where:

51 of 110

Solutions

52 of 110

Solutions

53 of 110

Solutions

54 of 110

Recalling Machine Learning

55 of 110

Recalling Machine Learning

Step 3: Automatic differentiation

Step 4: Descend gradient of cost C w.r.t. parameters m and b

Hands-on code demo: single-point-regression-gradient NB

gradient of C w.r.t. p = 0

Image © 2020 Pearson

56 of 110

Quadratic Cost w.r.t. Predicted y

57 of 110

Predicted y w.r.t. Model Parameters

58 of 110

Quadratic Cost w.r.t. Model Parameters

Hands-on code demo

59 of 110

Recalling Machine Learning

60 of 110

Recalling Machine Learning

Step 3: Determine gradient of cost C w.r.t. parameters m and b

Step 4: Descend gradient

Image © 2020 Pearson

gradient of C w.r.t. p = 0

61 of 110

∇C: the Gradient of Cost

Image © 2020 Pearson

m

b

Hands-on code demo: batch-regression-gradient NB

62 of 110

MSE w.r.t. Predicted y

63 of 110

MSE w.r.t. Model Parameters

Hands-on code demo

64 of 110

Regression Line after 1000 Epochs

65 of 110

Backpropagation

Chain rule of partial derivatives of cost w.r.t. model parameters extends to deep neural networks, which may have 1000s of layers:

66 of 110

Higher-Order Derivatives

67 of 110

Higher-Order Partial Derivatives

In ML, used to accelerate through gradient descent. (Optimization)

Consider the following first-order partial derivatives...

68 of 110

Higher-Order Partial Derivatives

69 of 110

Higher-Order Partial Derivatives

70 of 110

Higher-Order Partial Derivative Notation

71 of 110

Exercise

Find all the second-order partial derivatives of z = x3 + 2xy.

72 of 110

Solution

73 of 110

Calc II: Partial Derivatives & Integrals

  • Review of Introductory Calculus
  • Gradients Applied to Machine Learning
  • Integrals

74 of 110

  • Binary Classification
  • The Confusion Matrix
  • The Receiver-Operating Characteristic (ROC) Curve
  • Calculating Integrals Manually
  • Numeric Integration with Python
  • Finding the Area Under the ROC Curve
  • Resources for Further Study of Calculus

Segment 3: Integrals

75 of 110

Supervised Learning

  • Have x and y
  • Goal: learn function that uses x to approximate y
  • Examples:
    • Regression
      • Clinical measure of forgetfulness
      • Sales of a product
      • Future value of an asset
    • Classification
      • Multinomial
        • Handwritten digits: 10 classes
        • Imagenet: 21k classes
      • Binomial
        • Movie-review sentiment: positive vs negative
        • Photos of fast food: Hot dog vs not hot dog

76 of 110

Accuracy at a Single Threshold

  • Doesn’t reflect model quality at other points in output distribution
  • If y = 1: Prediction of 0.49 is 100% wrong; 0.51 prediction 100% correct
    • Prediction of 0.51 is considered as correct as prediction of 0.99
  • Solution: ROC AUC metric

77 of 110

The Confusion Matrix

78 of 110

Four Hot Dog Predictions

Image © 2020 Pearson

79 of 110

Receiver-Operating Characteristic

80 of 110

The ROC Curve

Image © 2020 Pearson

81 of 110

The ROC Curve

82 of 110

Integral Calculus

  • Study of areas under curves
  • Facilitates the inverse of differential calculus:

  • Also finds areas more generally, volumes, and central points

83 of 110

Integral Calculus Applications in ML

Find area under the curve:

  • Receiver operating characteristic (Calc II)
  • Probability theory’s “expectation” of random variable is widely used in machine learning, incl. deep learning (Prob. & Info. Thy.)

Image © 2020 Pearson

84 of 110

dx Slice Width

dx indicates slice width (Δx) is approaching zero width

85 of 110

Integral Notation

86 of 110

The Power Rule

87 of 110

The Constant Multiple Rule

88 of 110

The Sum Rule

89 of 110

Exercises

90 of 110

Solutions

91 of 110

Definite Integrals

92 of 110

93 of 110

94 of 110

95 of 110

96 of 110

97 of 110

98 of 110

99 of 110

Hands-on code demo

100 of 110

Exercise

Evaluate the following expression using both pencil and Python:

101 of 110

Solution

Hands-on code demo

102 of 110

Area Under the ROC Curve

Hands-on code demo

Image © 2020 Pearson

103 of 110

Resources for Further Study

  • General reference textbook: Michael Spivak’s Calculus
  • Differential Calculus:
    • Ch. 6 of Deisenroth et al. (2020) Mathematics for ML
    • 3Blue1Brown on YouTube
  • Integral Calculus:
    • ditto
    • Appendix 18.5 of Zhang et al.’s (2019) Dive into Deep Learning
  • Next steps in the ML Foundations series:
    • Probability & Information Theory
    • Intro to Statistics
    • Optimization

104 of 110

Next Subject: Probability & Info. Thy.

Apply calculus to ascertain how much meaningful signal is present in data.

Learn the probability theory-based foundations of stats and ML.

105 of 110

POLL with Multiple Answers Possible

What follow-up topics interest you most?

  • Linear Algebra
  • More Calculus
  • Probability / Statistics
  • Computer Science (e.g., algorithms, data structures)
  • Machine Learning Basics
  • Advanced Machine Learning, incl. Deep Learning
  • Something Else

106 of 110

Stay in Touch

jonkrohn.com to sign up for email newsletter

linkedin.com/in/jonkrohn

youtube.com/c/JonKrohnLearns

twitter.com/JonKrohnLearns

107 of 110

108 of 110

PLACEHOLDER FOR:

5-Minute Timer

109 of 110

PLACEHOLDER FOR:

10-Minute Timer

110 of 110

PLACEHOLDER FOR:

15-Minute Timer