Support Vector Machine
Delta Analytics builds technical capacity around the world.
This course content is being actively developed by Delta Analytics, a 501(c)3 Bay Area nonprofit that aims to empower communities to leverage their data for good.
Please reach out with any questions or feedback to inquiry@deltanalytics.org.
Find out more about our mission here.
Support vector machine (SVM)
Module checklist
Reminder: sentiment analysis uses structured data to classify emotional polarity of text
Basis of natural language processing
focus of modules 8 and 9
Structured data
Unstructured data
Interpretable to a human but not a computer program
Interpretable to both a human and computer program
Data representation
text
Sentiment analysis algorithms
Supervised classification
Output classifying the sentiment of text, voice recording, or facial expression
Sentiment analysis
focus of this module
facial images or videos
voice
We will learn how to use support vector machines to predict text sentiment
Support vector machines are the focus of this module
We will use our now familiar framework to introduce linear support vector machines
Task
What is the problem we want our model to solve?
Performance Measure
Quantitative measure we use to evaluate the model’s performance.
Learning Methodology
ML algorithms can be supervised or unsupervised. This determines the learning methodology.
Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning
SVMs are a form of classification that draw boundaries to classify data
Task
How might sentiment analysis apply to the graph above?
What is the problem we want to solve?
Areas divided by the SVM boundaries define the predictions for sentiment
Task
☺
☹
Note, SVM may misclassify some of the points. Boundaries are rarely perfect
For instance red data points represent text with predicted negative sentiment...
…while blue data points are positive sentiment
What is the problem we want to solve?
Task
Feature engineering & selection
Graph | Sentiment analysis equivalent |
Data points
| Document or text to be classified
|
Axis: x1 | Frequency of word x1 in the text |
Axis: x2 | Frequency of word x2 in the text |
Axis: x3 | Frequency of word x3 in the text |
... | Each word adds another dimension (difficult to visualize) |
In reality, a graph representing sentiment analysis would be across more than two dimensions
Task
Feature engineering & selection
Sad | Happy | Ok | ... | Sentiment |
3 | 0 | 1 | | 1 |
1 | 1 | 5 | | 0 |
2 | 5 | 1 | | -1 |
4 | 0 | 2 | | 1 |
6 | 0 | 1 | | 1 |
Sample data set
Check for understanding
Task
Defining f(x)
Like decision trees, a SVM has data points represented by values across various features and a classification outcome
Our f(x) is the optimal “hyperplane” dividing the class outcomes
In the example to the left, the solid hyperplane is a better boundary than the dotted line
Two examples of decision boundaries
The line dividing the the data is called a decision boundary or hyperplane
Is our f(x) correct for this problem?
Task
So far, our examples of support vector machine data points can be split with a linear hyperplane
This is called a data set that is linearly separable
However, what about data sets that look more like the example on the left?
Non-linearly separable data:
Is our f(x) correct for this problem?
Task
Theoretically, if we find the right combinations of features, we can find a mapping of the points in a space that is linearly separable
This is computationally exhaustive.
The “kernel trick” is a mathematical shortcut that SVMs take to create non-linear boundaries
To create a non-linear hyperlane, we can use a math shortcut called kernels
Can someone draw what the above image looks like in a non-linearly separable dimension?
Learning Methodology
Earlier we mentioned that the solid hyperplane or decision boundary is a better boundary than the dotted line. Why?
Intuitively, the solid line more clearly divides the data.
When we add the red dot for class A, the dotted line would have incorrectly classified it as class B.
How does our ML model learn?
Not all hyperplanes are created equal. Some are better than others
Learning Methodology
What are these red lines we’ve drawn onto the example graph we know?
They are the lines parallel to the hyperplane or decision boundary that touch a point in each class
The point the lines are touching are called the support vectors
How does our ML model learn?
Which red line (dotted or solid) looks like the better hyperplane to you?
Learning Methodology
What is our loss function?
The solid line is optimal.
Notice how there is much more distance between the two red, solid lines versus the two red, dotted lines
Learning Methodology
What is our loss function?
If you’re curious, mathematically this can be expressed as a quadratic optimization problem
Minimize the distance between the decision boundaries
Such that all data points are classified correctly as -1 or 1
Note: the equation displayed is a simplified version
In reality, SVMs allow for:
Performance
Quantitative measure we use to evaluate the model’s performance.
As with all models, we have a trade-off between bias and variance when considering performance
High bias
High variance
Performance
Measures of performance
I will introduce a commonly used tool to measure performance of classification models called a confusion matrix
| Positive | Negative |
Positive | True positive (Accurate) | False negative |
Negative | False positive (Type 1 error) | True negative (Accurate) |
Predicted class
Actual
class
Key terms
Recall: probability of true positives given actual class = TP / (TP + FN)
Precision: probability of true positives given predicted class = TP / (TP + FP)
Performance
Measures of performance
| Positive | Negative |
Positive | True positive (Accurate) | False negative |
Negative | False positive (Type 1 error) | True negative (Accurate) |
Predicted class
Actual
class
You can adjust your SVM models with gamma, a sensitivity input. The higher the gamma, the less influence a single training data point has on the model predictions
Performance
Flexibility of the model
Low gamma
High gamma
How does gamma relate to bias and variance?
It turns out high gamma can lead to overfitting, or high variance.
Performance
Flexibility of the model
High bias and low gamma
High variance
and gamma
Model cheat sheet
Pros
Cons
Theory resources we recommend:
Support vector machines
Congrats! You finished module X!
Find out more about Delta’s machine learning for good mission here.