Saba Khalilnaji

GaTech ML Notes: Supervised Learning

Supervised Learning - Search a hypothesis space for a function that will map the training inputs to their labels while minimizing error and avoiding overfitting.

Classification - Given discrete or continuous input return discrete output

Decision Trees

ID3 - Pick attribute with highest gain and build tree top down. Gain is measured in change in randomness (or entropy)

C4.5 - Extension to ID3 that includes pruning for overfitting/noisy data, modeling missing attributes,addresses weaknesses of gain which may use an unrelated attribute that perfectly classifies the dataset, and handles attributes with cost

Thresholded Regression

Perceptron Unit

Will find the weights if the data is linearly separable. Uses a firing threshold that is incorporated into the weights. Single units can do AND, NAND, OR and NOR
Loop

Thresholded Sigmoid Unit - See below, same as non thresholded sigmoid unit except we pick some probability threshold as the cutoff for the classification.

Ensemble Learning - Combine multiple weak hypothesis to form a stronger one.

AdaBoost - Generate weak learners with various distributions over the training set and combine them to form a stronger one:

Generate a weak learner with lowest weighted error

Update the distribution with

Final

Support Vector Machines -

Kernel functions map features to higher dimensional spaces that may allow us to more easily separate the data. Ex: 1 gaussian kernel function for every training point would be a relative distance measure to that point. Note: Kernels can be used in logistic regression as well but computational complexity hinders use.
SVMs minimize hinge loss (which adds a margin) while also maximizing the size of the margin. Margin is

note

This is a convex minimization problem and that has efficient solutions. Hinge loss is not differentiable but has a sub-gradient. Smaller C means less overfit.

Regression - Given a discrete or continuous input return a continuous output

Decision Trees - Instead of entropy we minimize an error function at each node and chose to split on an attribute with a value that minimizes that error in the two child nodes until we converge to a solution. Note there are an infinite number of values we can split on, however we only consider the finite set of values that are present in the training set.

Squared Error - If the error function is squared error, then it is minimized when all the leaf nodes in tree contain the average output of the elements.

Linear Regression

Solve Normal Equations (Linear Least Squares)

Matrix inversion means it will not scale as features grow
Does not work on order 2+ polynomials due to multiple optima and non-convex error functions

Gradient Descent

Iteratively update weights until convergence:
Can be used with any differentiable error function
If error function is mean squared error, then is convex and we will find global minimum. Update will be:

Polynomial Regression

Gradient Descent - same general idea as linear gradient descent

Subject to local optima
If error function is mean squared error, then update will be

note in the linear case above

Advanced Methods: Momentum, Line-Search, Second Order Derivatives

Logistic Regression (Single Sigmoid Unit) - Statistical classification, we are outputting .

Gradient Descent

Can be used with any sigmoid-like function and polynomial
Minimize log-likelihood:

note if is linear

Advanced Methods: Conjugate Gradient, BFGS, L-BFGS

Neural Network - Layered sigmoid units with linear inputs allow us to model all continuous functions with 2 layers and arbitrary functions with 3. Error is non-convex and thus subject to local optima. Neural networks scale better with more features than regression.

Can use any error function (gradient descent/backprop require differential function) i.e. squared error, cross-entropy, etc..
Backpropagation Algorithm (Stochastic Gradient Descent) with squared error

Covervenge/Local Minimum- Update momentum instead of position, stochastic gradient descent, simulated annnealing, line search (gradient is direction) and second order derivatives
Design - Inputs are normalized, with 1-3 hidden layers, a single output or 1-of-n for relative confidence, must decide learning rate/connections/cycles. Can also dynamically add more nodes/layers

Bayesian Learning - Bayesian reasoning provides a probabilistic approach to inference.

Through bayesian analysis we discover that assuming our hypothesis is a function plus some zero-mean gaussian noise, then is the hypothesis that minimizes the squared error. Furthermore, if we assume our function is simply non-deterministic, then is the hypothesis that minimizes the cross-entropy.

Brute Force - Assume . For every hypothesis calculate which is just 1 if the hypothesis correctly labels all training data. Requires significant computation. Note this method only returns the most probable hypothesis when, in most cases, we are interested in the most probable classification. Thus we can use the Bayes Optimal Classifier, which is a weighted of the classifications of all hypothesis weighted by their probabilities.

Bayes Nets - Represents join probability distributions for a set of variables. The joint probability for a set a variable depends only itself given the parents. Exact inference is exponential in the number of variables, see unsupervised learning.

Given a structure we can approximate the values of the conditional probability table by maximizing . One such method is gradient ascent following
Given no structure we can approximate the structure and probabilities with approximate or randomized searches.

Naive Bayes - Assumes that all features are independent given the classification, although this assumption is often “naive”, it generally works well in practice. We calculate where n is the number of features. is estimated by counting the frequency of each label, similarly is estimated by counting the frequency given the data.

Instance Based Learning - Store training examples and compare queries against them

Lazy Learning- generate a model when you have to answer a query. Slow query times with no training time.

k-Nearest Neighbor -find the k-nearest neighbors in euclidian space and return an average or voted output. Works well with noisy data given large training sets but is especially susceptible to the curse of dimensionality because it assumes all features are equally important. Cross-validation can help determine weights of attributes.
Locally Weighted Regression - perform a regression of the nearest neighbors to predict the output
Case-Based Reasoning - use more symbolic descriptions to compare instances (and thus calculate distances between other instances)

Eager Learning-commit to a model from the training data

Radial Basis Function Network - A set of kernel functions (commonly gaussian) placed in the Euclidian space of the training data (can be random, near clusters, or on data point). Weights for each function are trained minimize error. Can be thought of as a weighted sum of many local approximations. Training time is faster than backpropagation.

Ensemble Learning

Boosting (AdaBoost) - See above. Modification for logistic regression:

Overfitting - A hypothesis is overfit if there exists an alternative hypothesis that has higher error on the training data but lower error on the entire distribution of instances.

⅔ ⅓ Rule: Overfitting can be avoided by using ⅓ of the training data as a validation set. Train on the training data until the error on the validation set starts to rise.
Given Limited Data - M instances can be broken down into K buckets. Train for all K buckets and use Kth bucket as validation. Find the iterations where validation error begins to rise. Average for all k, then use when training on all m.
Regularization - Similar hypotheses generalize better. Smaller coefficients on the weights means less complex hypothesis. Thus we can add a penalty to the cost function to help prevent over training, e.g:

Learning Theory

PAC Model - Applies to algorithms that learn target concepts from some concept class using i.i.d. training examples with some confidence and error polynomial in the size of the instances, and the size of the target concept

Finite Hypothesis can learning the target concept with samples.
Infinite Hypothesis use VC dimension to model the complexity of the hypothesis space. VC(H) is the largest subset of instance that can be shattered by H (think of it as the parameters of H). Target concept can be learned with . Note: A hypothesis is PAC learnable IFF the VC(H) is finite.

Mistake Bound Model - Analyze the number of training examples a learner will misclassify it exactly learns the target concept. The best worst-case algorithm will make mistakes such that

Randomized Optimization - Methods for solving optimization problems that do not require a gradient i.e. non-continuous or non-differentiable functions. Can be used in supervised learning to solve the weights in a neural network, a regression model, or even find a structure for a bayesian network. All of these algorithms are subject to local minimum.

Hill Climbing - Given a fitness function, start at a random position, check immediate neighbors and set point to the larger one (flip a coin if both equally larger). Repeat this process until neither neight has a greater fitness. Repeat the random restarts as needed.
Simulated Annealing - Starts at random point, samples a neighbor, and probabilistically jumps to that neighbor. Compared to hill climbing, has a trade-off between exploitation (climbing) and exploration (probabilistic jumping). The probability is a function of the “temperature” i.e. As the temperature drops, we are more inclined to exploit.
Genetic Algorithms - A randomized, parallel, hill-climbing search. Start with a random population, evaluate their fitness, resample in proportion to their fitness (low variance resampling, tournament selection, etc). Cross-over two probabilistically selected individual (perhaps by fitness) to create “offspring”. Select a subset of the population for mutation and repeat until convergence
MIMIC - Does well with structure