JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 31

Neural networks course

Recitation 4: SVMs, interpretability

2 of 31

Motivation

For linearly separable data, the perceptron convergence to a optimal solution on the train

Not necessarily optimal with regard to generalization (test set performance)

Idea: In order to generalize, find a separation plane that maximizes the distance to the closest training points

3 of 31

4 of 31

5 of 31

6 of 31

7 of 31

8 of 31

Formulation

9 of 31

Formulation

Scaling w doesn’t change the distance, so we can set w such that wx_iy_i = 1 for the minimal i

10 of 31

11 of 31

12 of 31

Dual representation

We have a quadratic minimization problem with linear constraints
This can be efficiently solved with lagrange multipliers

Dual problem

Dual function

Lagrangian

13 of 31

Dual representation for SVMs

14 of 31

We will not show the solution to the maximization of g. But it turns out all α_i are zero except of the ones belonging to the closest points: the support vectors.

15 of 31

Classification: weight all training example

16 of 31

Non linearly separable data

17 of 31

18 of 31

Non linear data

19 of 31

The dual representation

20 of 31

Non linearly separable data

Prediction requires calculating the dot product between the example x and all other example x_i!

The kernel trick computes the dot product in the feature space implicitly

21 of 31

The kernel trick

We can often calculate the product without having to calculate and
Consider the following kernel: φ((a, b)) = (a, b, a² + b²)

Naive calculation:

φ((a, b))φ((c, d)) = (a, b, a² + b²)(c, d, c² + d²) = ac + bd + (a² + b²) (c² + d²) = ac + bd + a²c²+ a²d²+ b²c²+ b²d²

Kernel trick: directly use ac + bd + (a² + b²) (c² + d²) = ac + bd + a²c²+ a²d²+ b²c²+ b²d²

22 of 31

23 of 31

Polynomial Kernel

24 of 31

25 of 31

Kernels

Theorem: if K is PSD then it represents a valid kernel (i.e. it corresponds to dot product in some feature space)