1 of 26

Support Vector Machine

2 of 26

Delta Analytics builds technical capacity around the world.

This course content is being actively developed by Delta Analytics, a 501(c)3 Bay Area nonprofit that aims to empower communities to leverage their data for good.

Please reach out with any questions or feedback to inquiry@deltanalytics.org.

Find out more about our mission here.

3 of 26

Support vector machine (SVM)

4 of 26

Module checklist

Task

Hyperplane (or decision boundary)
Kernel trick

Learning Methodology

Quadratic optimization

Performance Measure

Confusion matrix
Gamma

5 of 26

Reminder: sentiment analysis uses structured data to classify emotional polarity of text

Basis of natural language processing

focus of modules 8 and 9

Structured data

Unstructured data

Interpretable to a human but not a computer program

Interpretable to both a human and computer program

Data representation

text

Sentiment analysis algorithms

Supervised classification

Output classifying the sentiment of text, voice recording, or facial expression

Sentiment analysis

focus of this module

facial images or videos

voice

6 of 26

We will learn how to use support vector machines to predict text sentiment

Support vector machines are the focus of this module

7 of 26

We will use our now familiar framework to introduce linear support vector machines

Task

What is the problem we want our model to solve?

Performance Measure

Quantitative measure we use to evaluate the model’s performance.

Learning Methodology

ML algorithms can be supervised or unsupervised. This determines the learning methodology.

Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning

8 of 26

SVMs are a form of classification that draw boundaries to classify data

Task

How might sentiment analysis apply to the graph above?

What is the problem we want to solve?

9 of 26

Areas divided by the SVM boundaries define the predictions for sentiment

Task

☺

☹

Note, SVM may misclassify some of the points. Boundaries are rarely perfect

For instance red data points represent text with predicted negative sentiment...

…while blue data points are positive sentiment

What is the problem we want to solve?

10 of 26

Task

Feature engineering & selection

Graph	Sentiment analysis equivalent
Data points Red Green	Document or text to be classified Classified as negative Classified as positive
Axis: x1	Frequency of word x1 in the text
Axis: x2	Frequency of word x2 in the text
Axis: x3	Frequency of word x3 in the text
...	Each word adds another dimension (difficult to visualize)

In reality, a graph representing sentiment analysis would be across more than two dimensions

11 of 26

Task

Feature engineering & selection

Sad	Happy	Ok	...	Sentiment
3	0	1		1
1	1	5		0
2	5	1		-1
4	0	2		1
6	0	1		1

Sample data set

What is the response variable in the table to the left?
What do the number values of the response variables represent?
Can you give an example of a sentence that would fit for row 1 of the table?
How could we graph row 1 on a three-dimensional graph in order to visualize SVM (assuming there are only 3 features)?

Check for understanding

12 of 26

Task

Defining f(x)

Like decision trees, a SVM has data points represented by values across various features and a classification outcome

Our f(x) is the optimal “hyperplane” dividing the class outcomes

In the example to the left, the solid hyperplane is a better boundary than the dotted line

Two examples of decision boundaries

The line dividing the the data is called a decision boundary or hyperplane

13 of 26

Is our f(x) correct for this problem?

Task

So far, our examples of support vector machine data points can be split with a linear hyperplane

This is called a data set that is linearly separable

However, what about data sets that look more like the example on the left?

Non-linearly separable data:

14 of 26

Is our f(x) correct for this problem?

Task

Theoretically, if we find the right combinations of features, we can find a mapping of the points in a space that is linearly separable

This is computationally exhaustive.

The “kernel trick” is a mathematical shortcut that SVMs take to create non-linear boundaries

To create a non-linear hyperlane, we can use a math shortcut called kernels

Can someone draw what the above image looks like in a non-linearly separable dimension?

15 of 26

Learning Methodology

Earlier we mentioned that the solid hyperplane or decision boundary is a better boundary than the dotted line. Why?

Intuitively, the solid line more clearly divides the data.

When we add the red dot for class A, the dotted line would have incorrectly classified it as class B.

How does our ML model learn?

Not all hyperplanes are created equal. Some are better than others

16 of 26

Learning Methodology

What are these red lines we’ve drawn onto the example graph we know?

They are the lines parallel to the hyperplane or decision boundary that touch a point in each class

The point the lines are touching are called the support vectors

How does our ML model learn?

Which red line (dotted or solid) looks like the better hyperplane to you?

17 of 26

Learning Methodology

What is our loss function?

The solid line is optimal.

Notice how there is much more distance between the two red, solid lines versus the two red, dotted lines

18 of 26

Learning Methodology

What is our loss function?

If you’re curious, mathematically this can be expressed as a quadratic optimization problem

Minimize the distance between the decision boundaries

Such that all data points are classified correctly as -1 or 1

Note: the equation displayed is a simplified version

In reality, SVMs allow for:

Some data points to be classified incorrectly
More than two classes (not just -1 and 1)

19 of 26

Performance

Quantitative measure we use to evaluate the model’s performance.

As with all models, we have a trade-off between bias and variance when considering performance

High bias

High variance

20 of 26

Performance

Measures of performance

I will introduce a commonly used tool to measure performance of classification models called a confusion matrix

	Positive	Negative
Positive	True positive (Accurate)	False negative
Negative	False positive (Type 1 error)	True negative (Accurate)

Predicted class

Actual

class

21 of 26

Key terms

Recall: probability of true positives given actual class = TP / (TP + FN)

Precision: probability of true positives given predicted class = TP / (TP + FP)

Performance

Measures of performance

	Positive	Negative
Positive	True positive (Accurate)	False negative
Negative	False positive (Type 1 error)	True negative (Accurate)

Predicted class

Actual

class

22 of 26

You can adjust your SVM models with gamma, a sensitivity input. The higher the gamma, the less influence a single training data point has on the model predictions

Performance

Flexibility of the model

Low gamma

High gamma

How does gamma relate to bias and variance?

23 of 26

It turns out high gamma can lead to overfitting, or high variance.

Performance

Flexibility of the model

High bias and low gamma

High variance

and gamma

24 of 26

Data should be linearly separable for “hard-margin” SVMs
Additional assumptions are used when data is not linearly separable
Provides deterministic classification (no probabilistic estimate)

Can be used with sparse or imbalanced data
Finds the optimal decision boundary to separate data points (compared to other classifiers like perceptrons)
Does not take up a lot of memory to store
Computationally easy to add features due to the kernel trick

Model cheat sheet

Pros

Cons

25 of 26

Theory resources we recommend:

Support vector machines

Introduction to Machine Learning with Python (O’Reilly Media) Chapter 2, Linear Models
An Idiot’s Guide to Support Vector Machines (R. Berwick)
Sentiment analysis using Support Vector Machine (Nurulhuda Zainuddin)

26 of 26

Congrats! You finished module X!

Find out more about Delta’s machine learning for good mission here.