1 of 41

Recitation 1: Intro, Projections

Shauli Ravfogel

Partially based on prof. Yoav Goldberg’s slides

2 of 41

Administration

Shauli Ravfogel
Mail: shauli321@gmail.com
Course website: https://gal-chechik-biu.github.io/Neural_Networks_Grad/
~5 exercises + a project
Midterm exam

Office hours: contact me via the mail

3 of 41

What is learning?

Making predictions
Based on past observations
finding patterns in data
learning from data
generalization

4 of 41

Types of learning

Supervised

Regression
Classification
Structured

Unsupervised

Semi supervised

Reinforcement

5 of 41

Machine learning

6 of 41

Machine learning - formulation

7 of 41

Machine learning - formulation

8 of 41

The perceptron model

9 of 41

10 of 41

Dot product

w*x is a dot product:

11 of 41

12 of 41

13 of 41

Planes and hyperplanes

A n-dimensional hyperplane is defined by a₁x₁ + a₂x₂ + … + a_nx_n = b
The vector n = (a₁, …, a_n) is the normal (perpendicular vector) to the plane.
Proof:

Let (x₁,..,x_n), (x’₁, …, x’_n) be two points on the plane
a₁x₁ + a₂x₂ + … + a_nx_n = b
a₁x’₁ + a₂x’₂ + … + a_nx’_n = b
=> (a₁, …, a_n).dot(x₁-x’₁ , …, x_n-x’_n) = 0
(x₁-x’₁ , …, x_n-x’_n) is a general vector on the plane

14 of 41

Projection on a vector

It is often useful to calculate the component of a vector v in the direction of another vector u
This is called a projection of v onto u
Notation: Proj_u(v)

15 of 41

16 of 41

17 of 41

Distance from a plane (margin)

18 of 41

Distance from a plane (margin)

19 of 41

20 of 41

Learning as optimization

We want to find a “good” f_θ(x)
This requires a measure for the performance of f_θ

21 of 41

In general, this optimization is hard
Instead, change the parameters in the direction that minimizes the loss.
We will focus in gradient-based optimization

(Whiteboard)

22 of 41

Derivative

For a single variable: the slope of the tangent line

23 of 41

24 of 41

Gradient

A vector pointing to the direction of the steepest ascent
Thus minus the gradient points to the direction of steepest descent
f changes the most in the direction of the gradient

25 of 41

26 of 41

Example: Linear Regression

27 of 41

28 of 41

Vector notation

29 of 41

Gradient descent

We have a model with parameters θ = θ₁, … , θ_n
We want to find the “best” θ with respect to some measure of the loss L(θ)

e.g. L(θ=w; x, y) = (wx-y)²

Gradient descent:

Initialize θ randomly
Change θ in the direction that minimizes L
This direction is given by the opposite direction of the gradient

30 of 41

31 of 41

Algorithm outline

Initialize values θ = θ₁, … , θ_n
While not converged:

Calculate the gradient
For each θ_kin θ:

32 of 41

33 of 41

34 of 41

35 of 41

Newton method

An iterative method for finding the roots (zeros) of a function

36 of 41

37 of 41

Newton method

38 of 41

Newton method for optimization

We saw that Newton method enables us to find a root x* of a function F: F(x*) = 0
Since extremum points of F correspond to roots of the gradient, we can apply Newton's method on the gradient to optimize the function F.

39 of 41

Geometric interpretation: We find the maximum/minimum of the parabola that locally approximates F:

40 of 41

Formulation

We know the function value at point Xn, and want to approximate f(X) = f(Xn + Δx)

41 of 41

The Hessian