Saba Khalilnaji

GaTech ML Notes: Supervised Learning

Supervised Learning - Search a hypothesis space for a function that will map the training inputs to their labels while minimizing error and avoiding overfitting.

  1. Classification - Given discrete or continuous input return discrete output
  1. Decision Trees
  1. ID3 - Pick attribute with highest gain and build tree top down. Gain is measured in change in randomness (or entropy)

  1. C4.5 - Extension to ID3 that includes pruning for overfitting/noisy data, modeling missing attributes,addresses weaknesses of gain which may use an unrelated attribute that perfectly classifies the dataset, and handles attributes with cost
  1. Thresholded Regression
  1. Perceptron Unit
  1. Thresholded Sigmoid Unit - See below, same as non thresholded sigmoid unit except we pick some probability threshold as the cutoff for the classification.
  1. Ensemble Learning - Combine multiple weak hypothesis to form a stronger one.
  1. AdaBoost - Generate weak learners  with various distributions over the training set and combine them to form a stronger one:

Generate a weak learner with lowest weighted error

 Update the distribution  with


  1. Support Vector Machines -


  1. Regression - Given a discrete or continuous input return a continuous output
  1. Decision Trees - Instead of entropy we minimize an error function at each node and chose to split on an attribute with a value that minimizes that error in the two child nodes until we converge to a solution. Note there are an infinite number of values we can split on, however we only consider the finite set of values that are present in the training set.
  1. Squared Error - If the error function is squared error, then it is minimized when all the leaf nodes in tree contain the average output of the elements.
  1. Linear Regression
  1. Solve Normal Equations (Linear Least Squares)

  1. Gradient Descent

  1. Polynomial Regression 
  1. Gradient Descent - same general idea as linear gradient descent

note in the linear case above

  1. Advanced Methods: Momentum, Line-Search, Second Order Derivatives
  1. Logistic Regression (Single Sigmoid Unit) - Statistical classification, we are outputting .
  1. Gradient Descent

        note if  is linear

  1. Advanced Methods: Conjugate Gradient, BFGS, L-BFGS
  1. Neural Network - Layered sigmoid units with linear inputs allow us to model all continuous functions with 2 layers and arbitrary functions with 3. Error is non-convex and thus subject to local optima. Neural networks scale better with more features than regression.

  1. Bayesian Learning - Bayesian reasoning provides a probabilistic approach to inference.

  Through bayesian analysis we discover that assuming our hypothesis is a function plus some zero-mean gaussian noise, then  is the hypothesis that minimizes the squared error. Furthermore, if we assume our function is simply non-deterministic, then is the hypothesis that minimizes the cross-entropy.

  1. Brute Force - Assume . For every hypothesis calculate  which is just 1 if the hypothesis correctly labels all training data. Requires significant computation. Note this method only returns the most probable hypothesis when, in most cases, we are interested in the most probable classification. Thus we can use the Bayes Optimal Classifier, which is a weighted of the classifications of all hypothesis weighted by their probabilities.


  1. Bayes Nets - Represents join probability distributions for a set of variables. The joint probability for a set a variable depends only itself given the parents. Exact inference is exponential in the number of variables, see unsupervised learning.
  1. Naive Bayes - Assumes that all features are independent given the classification, although this assumption is often “naive”, it generally works well in practice. We calculate  where n is the number of features. is estimated by counting the frequency of each label, similarly  is estimated by counting the frequency given the data.
  1. Instance Based Learning - Store training examples and compare queries against them
  1. Lazy Learning- generate a model when you have to answer a query. Slow query times with no training time.
  1. Eager Learning-commit to a model from the training data
  1. Ensemble Learning
  1. Boosting (AdaBoost) - See above. Modification for logistic regression:


  1. Overfitting - A hypothesis is overfit if there exists an alternative hypothesis that has higher error on the training data but lower error on the entire distribution of instances.
  1. ⅔ ⅓ Rule: Overfitting can be avoided by using ⅓ of the training data as a validation set. Train on the training data until the error on the validation set starts to rise.
  2. Given Limited Data - M instances can be broken down into K buckets. Train for all K buckets and use Kth bucket as validation. Find the iterations where validation error begins to rise. Average for all k, then use when training on all m.
  3. Regularization  - Similar hypotheses generalize better. Smaller coefficients on the weights means less complex hypothesis. Thus we can add a penalty to the cost function to help prevent over training, e.g:

  1. Learning Theory
  1. PAC Model - Applies to algorithms that learn target concepts from some concept class using i.i.d. training examples with some confidence and error  polynomial in the size of the instances, and the size of the target concept
  1. Finite Hypothesis can learning the target concept with samples.
  2. Infinite Hypothesis use VC dimension to model the complexity of the hypothesis space. VC(H) is the largest subset of instance that can be shattered by H (think of it as the parameters of H). Target concept can be learned with . Note: A hypothesis is PAC learnable IFF the VC(H) is finite.
  1. Mistake Bound Model - Analyze the number of training examples a learner will misclassify it exactly learns the target concept. The best worst-case algorithm will make  mistakes such that
  1. Randomized Optimization - Methods for solving optimization problems that do not require a gradient i.e. non-continuous or non-differentiable functions. Can be used in supervised learning to solve the weights in a neural network, a regression model, or even find a structure for a bayesian network. All of these algorithms are subject to local minimum.
  1. Hill Climbing - Given a fitness function, start at a random position, check immediate neighbors and set point to the larger one (flip a coin if both equally larger). Repeat this process until neither neight has a greater fitness. Repeat the random restarts as needed.
  2. Simulated Annealing - Starts at random point, samples a neighbor, and probabilistically jumps to that neighbor. Compared to hill climbing, has a trade-off between exploitation (climbing) and exploration (probabilistic jumping). The probability is a function of the “temperature” i.e.  As the temperature drops, we are more inclined to exploit.
  3. Genetic Algorithms - A randomized, parallel, hill-climbing search. Start with a random population, evaluate their fitness, resample in proportion to their fitness (low variance resampling, tournament selection, etc). Cross-over two probabilistically selected individual (perhaps by fitness) to create “offspring”. Select a subset of the population for mutation and repeat until convergence
  4. MIMIC - Does well with structure