1 of 41

Recitation 1: Intro, Projections

Shauli Ravfogel

Partially based on prof. Yoav Goldberg’s slides

2 of 41

Administration

  • Shauli Ravfogel
  • Mail: shauli321@gmail.com
  • Course website: https://gal-chechik-biu.github.io/Neural_Networks_Grad/
  • ~5 exercises + a project
  • Midterm exam

Office hours: contact me via the mail

3 of 41

What is learning?

  • Making predictions
  • Based on past observations
  • finding patterns in data
  • learning from data
  • generalization

4 of 41

Types of learning

  • Supervised
    • Regression
    • Classification
    • Structured
  • Unsupervised
    • Semi supervised
  • Reinforcement

5 of 41

Machine learning

6 of 41

Machine learning - formulation

7 of 41

Machine learning - formulation

8 of 41

The perceptron model

9 of 41

10 of 41

Dot product

  • w*x is a dot product:

11 of 41

12 of 41

13 of 41

Planes and hyperplanes

  • A n-dimensional hyperplane is defined by a1x1 + a2x2 + … + anxn = b
  • The vector n = (a1, …, an) is the normal (perpendicular vector) to the plane.
  • Proof:
    • Let (x1,..,xn), (x’1, …, x’n) be two points on the plane
    • a1x1 + a2x2 + … + anxn = b
    • a1x’1 + a2x’2 + … + anx’n = b
    • => (a1, …, an).dot(x1-x’1 , …, xn-x’n) = 0
    • (x1-x’1 , …, xn-x’n) is a general vector on the plane

14 of 41

Projection on a vector

  • It is often useful to calculate the component of a vector v in the direction of another vector u
  • This is called a projection of v onto u
  • Notation: Proju(v)

15 of 41

16 of 41

17 of 41

Distance from a plane (margin)

18 of 41

Distance from a plane (margin)

19 of 41

20 of 41

Learning as optimization

  • We want to find a “good” fθ(x)
  • This requires a measure for the performance of fθ

21 of 41

  • In general, this optimization is hard
  • Instead, change the parameters in the direction that minimizes the loss.
  • We will focus in gradient-based optimization

(Whiteboard)

22 of 41

Derivative

  • For a single variable: the slope of the tangent line

23 of 41

24 of 41

Gradient

  • A vector pointing to the direction of the steepest ascent
  • Thus minus the gradient points to the direction of steepest descent
  • f changes the most in the direction of the gradient

25 of 41

26 of 41

Example: Linear Regression

27 of 41

28 of 41

Vector notation

29 of 41

Gradient descent

  • We have a model with parameters θ = θ1, … , θn
  • We want to find the “best” θ with respect to some measure of the loss L(θ)
    • e.g. L(θ=w; x, y) = (wx-y)2
  • Gradient descent:
    • Initialize θ randomly
    • Change θ in the direction that minimizes L
    • This direction is given by the opposite direction of the gradient

30 of 41

31 of 41

Algorithm outline

  1. Initialize values θ = θ1, … , θn
  2. While not converged:
    1. Calculate the gradient
    2. For each θk in θ:

32 of 41

33 of 41

34 of 41

35 of 41

Newton method

  • An iterative method for finding the roots (zeros) of a function

36 of 41

37 of 41

Newton method

38 of 41

Newton method for optimization

  • We saw that Newton method enables us to find a root x* of a function F: F(x*) = 0
  • Since extremum points of F correspond to roots of the gradient, we can apply Newton's method on the gradient to optimize the function F.

39 of 41

  • Geometric interpretation: We find the maximum/minimum of the parabola that locally approximates F:

40 of 41

Formulation

We know the function value at point Xn, and want to approximate f(X) = f(Xn + Δx)

41 of 41

The Hessian