CS7641 Machine Learning - Midterm Notes

Difference between Classification and Regression

Decision Trees Other Considerations

Decision Trees Other Considerations Regression

Regression and Function Approximation

Artificial Neural Networks Quiz

How Powerful is a Perceptron Unit

Comparison of Learning Rules Quiz

Instance Based Learning Before

Quiz: Won’t You Compute My Neighbors

Ensemble Learning and Boosting

Ensemble Learning Simple Rules

Quiz: Ensemble Learning Outputs

Still More Support Vector Machines

Quiz: Boosting Tends to Overfit

Quiz: Computational Learning Theory

Quiz: Resources in Machine Learning

Quiz: Teaching Via 20 Questions

Teacher With Constrained Queries

Quiz: Reconstructing Hypothesis

Learner With Constrained Queries

Quiz: Which Hypothesis Spaces Are Infinite

Sampling from the Joint Distribution

Recovering the Joint Distribution

Properties of Simulated Annealing

Finding Dependency Trees Three

Midterm Summaries - All “What We’ve Learned” Slides

- Two types of supervised learning that we typically think about are classification and regression.
- Classification is simply the process of taking some kind of input, let's call it x and mapping it to some discrete label, usually, for what we're talking about, something like, true or false.
- Regression is more about continuous value function. So, something like giving a bunch of points. I want to give in a new point. I want to map it to some real value. So we may pretend that these are examples of a line and so given a point here, I might say the output is right there.
- For the purposes of the sort of things that we're going to be worried about you can really think of the difference between classification and regression is the difference between mapping from some input to some small number of discrete values which might represent concepts.
- And regression is mapping from some input space to some real number. Potentially infinite number of real numbers.
- The difference between a classification task or a regression task is not about the input, it's about the output.
- If the output is continuous then it's regression and if the output is small discrete set or discrete set, then it is a classification task.

I want to define a couple of terms.

- Instances, or another way of thinking about them is input. Those are vectors of values, of attributes. That define whatever your input space is. So they can be pictures and all the pixels that make up pictures like we've been doing so far before. They can be credit score examples like how much money I make. So whatever your input value is, whatever it is you're using to describe the input, whether it's pixels or its discrete values, those are your instances, the set of things that you're looking at.
- And then what we're trying to find is some kind of function and that is the concept that we care about. And this function actually maps inputs to outputs, so it takes the instances, it maps them in this case to some kind of output such as true or false. This is the, the categories of things that we're worried about. The concept is the function that we care about that's going to map pictures, for example true or false. Intuitively a concept is an idea describes a set of things. OK, so we can talk about maleness or femaleness. We can talk about short and tall; we can talk about college students and grad students. And so the concept is the notion of what creditworthiness is, what a male is, what a college student is. A concept is, therefore by definition, a mapping between objects in a world and membership in a set, which makes it a function.
- So with that notion of a concept as functions or as mappings from objects to membership in a set we have a particular target concept. The only difference between a target concept and the general notion of concept is that the target concept is the thing we're trying to find. It's the actual answer. So, a function that determines whether something is a car or not, or male or not, is the target concept.
- Hypothesis class is the set of all concepts that you're willing to entertain. So it's all the functions I'm willing to think about. It could be all possible functions and that's a perfectly reasonable hypothesis class. You can think of hypothesis class as simply being all possible functions in the world. On the other hand, even so far just the classification learning, we already know that we're restricting ourselves to a subset of all possible functions in the world, because all possible functions in the world includes things like x squared, and that's regression. And we've already said, we're not doing regression. So already hypothesis classes all functions we care about and maybe it's all classification functions.
- A training set is a set of all of our inputs, like pictures of people, paired with a label, which is the correct output. If you get bunches and bunches of examples of input and output pairs, that's a training set. And that's what's going to be the basis for you figuring out what is the correct concept or function.
- A candidate is just simply a concept that you think might be the target concept.
- Given that you have a bunch of examples, and you have a particular candidate or a candidate concept, how do you know whether you are right or wrong? How do you know whether it's a good candidate or not? And that's where the testing set comes in. So a testing set looks just like a training set. So here our training set, we'll have pictures and whether someone turns out to be creditworthy or not. And I will take your candidate concept and I'll determine whether it does a good job or not, by looking at the testing set.
- An important point is that the training set and the testing set should not be the same. If you learn from your training set, and I test you only on your training set, then that's considered cheating in the machine learning world. Because then you haven't really shown the ability to generalize. So typically we want the testing set set to include lots of examples that you don't see in your training set. And that is proof that you're able to generalize.

- So, that gets us to decision trees.
- So, the first thing, that, that we want to do is, we want to separate out what a decision tree is from the algorithm that you might use to build the decision tree. S
- So we have these nodes which represent attributes, and we have edges which represent value.
- So it's as if I'm making a bunch of decisions by asking a series of questions.

- XOR is an exponential problem and is also known as hard.
- Whereas OR, at least in terms of space that you need, it's a relatively easy one.
- This is linear.
- We have another name for exponential and that is evil. Evil, evil, evil. And it's evil because it's a very difficult problem. There is no clever way to pick the right attributes in order to give you an answer. You have to look at every single thing. That's what make this kind of problem difficult. So, just as a general point, Michael, I want to make, is that we hope that in our machine learning problems we're looking at things that are more like any than we are looking at things that are more like parity because otherwise, we're going to need to ask a lot of questions in order to answer the parity questions.
- So 2 to the n grows very fast. We already called that evil.
- 2 to the 2 to the n is a double exponential and it's super evil. It grows very, very, very, very, very fast.
- This means we have to have some clever way to search among them. And that gets us back to our notion of an algorithm with actually going to very smartly go through and pick out which decision tree. Because if we aren't very smart about it and we start eliminating whole decision trees along the way. Then we're going to have to look it to billions upon, billions upon, billions upon, billions upon, billions of possible decision choice.

- Here's the ID3 algorithm.

- You're simply going to keep looping forever until you've solved the problem.
- At each step, you're going to pick the best attribute, and we're going to define what we mean by best. There are a couple of different ways we might define best.
- And then, given the best attribute that splits the data the way that we want, it does all those things that we talked about, assign that as a decision attribute for node.
- And then for each value that the attribute A can take on, create a descendent of node.
- Sort the training examples to those leaves based upon exactly what values they take on
- If you've perfectly classified your training set, then you stop. Otherwise, you iterate over each of those leaves, picking the best attribute in turn for the training examples that were sorted into that leaf, and you keep doing that, building up the tree until you're done.

- So that's the ID3 algorithm.
- The key bit that we have to expand upon in this case, is exactly what it means to have a best attribute.
- There are lots of possibilities that you can come up with. The one that is most common s what's called information gain.
- Information gain is simply a mathematical way to capture the amount of information that one gains by picking particular attribute. But what it really talks about is the reduction in the randomness, over the labels that you have with set of data, based upon the knowing the value of particular attribute.
- The formula's simply this:

- The information gain over S and A where S is the collection of training examples that you're looking at and A as a particular attribute is simply defined as the entropy, with respect to the labels, of the set of training examples you have (S) minus the expected or average entropy that you would have over each set of examples that you have with a particular value.
- Entropy is exactly a measure of randomness. So if I have a coin, let's say a two-headed coin. It can be heads or tails, and I don't know anything about the coin except that it's probably fair. If I were to flip the coin, what's the probability that it would end up heads or tails? M: A half. C: It's a half, exactly, if it's a fair coin it's a half. Which means that I have no basis, going into flipping the coin, to guess either way whether it's heads or it's tails. And so that has a lot of entropy. In fact it has exactly what's called one bit of entropy. On the other hand, let's imagine that I have a coin that has heads on both sides. Then, before I even flip the coin, I already know what the outcome's going to be. It's going to come up heads. So what's the probability of it coming up with heads? M: It's one. C: So that actually has no information, no randomness, no entropy whatsoever. And has zero bits of entropy.

- The formula for entropy, using the same notation that we're using for information gain is simply the sum, over all the possible values that you might see, of the probability of you seeing that value, times the log of the probability of you seeing that value, times minus one. Further details in randomized optimization lecture where entropy's going to matter a lot.
- So the goal is to maximize over the entropy gain. And that's the best attribute.

- There are two kinds of biases we worry about when we think about algorithms that are searching through space.
- One is what's called a restriction bias, which is nothing more than the hypothesis set that you actually care about. So in this case, with the decision trees, the hypothesis set is all possible decision trees. That means we're not considering, y equals 2x plus non-boolean functions of a certain type. We're only considering decision trees, and all that they can represent. And nothing else. Instead of looking at the uncountably infinite number of functions that are out there, we're only going to consider those that can be represented by a decision tree over all the cases we've given so far discrete variable.
- The other is called preference bias, which tells us what source of hypotheses from this hypothesis set we prefer, and that is really at the heart of inductive bias.
- Which decision trees would ID3 prefer?

- It's going to be more likely to produce a tree that has basically good splits near the top than a tree that has bad splits at the top. Even if the two trees can represent the same function. Given two decision trees that are both correct. They both represent the function that we might care about. It would prefer the one that had the better split near the top.
- It prefers ones that model the data better to ones that model the data worse. It prefers correct ones to incorrect ones. Given a tree that has very good splits at the top but produces the wrong answer. It will not take that one over one that doesn't have as good splits at the top, but does give you the correct answer.
- Those are really the two main things that are the inductive bias for ID3. Although, when you put those two together, in particular when you look at the first one, there's sort of a third one that comes out as well, which is ID3 algorithm tends to prefer shorter trees to longer trees. Now, that preference for shorter trees actually comes naturally from the fact that you're doing good splits at the top. Because you're going to take trees that actually separate the data well by labels, you're going to tend to come to the answer faster than you would if you didn't do that.

- When do we really stop?
- So the the answer in the algorithm is when everything is classified correctly. But what if we have noise in our data? Then our algorithm goes into an infinite loop. So we could just say when we've run out of attributes. Although that doesn't help us in the case where we have continuous attributes and we might ask an infinite number of questions. So we probably need a slightly better criteria.
- If it's possible for us to have a little bit of noise in the data, an error here or there, then we want to have some way to deal to handle that possibility
- When you get really good at classifying your training data, but it doesn't help you to generalize, we have a name for that: overfitting.
- Now the way you overfit in a decision tree is basically by having a tree that's too big, it's too complicated. Violates Occam's Razor.
- You could take out a validation set. You build a decision tree, and you test it on the validation set and you pick whichever one has the lowest error in the validation set, that's one way to avoid it. There is another way you can do it that's more efficient.
- You do the same idea validation, except that you hold out a set and every time you decide whether to expand the tree or not, you check to see how this would do so far in the validation set. And if the error is low enough, then you stop expanding the tree. That's one way of doing it.
- You could do pruning. You could go ahead and do the tree as if you didn't have to worry about over-fitting, and once you have the full tree built, you could then do a kind of, you could do pruning. You could go to the leaves of the tree and say, well, what if I collapse these leaves back up into the tree? How does that create error on my validation set? And if the error is too big, then you don't do it. And if it's very small, then you go ahead and do it. And that should help you with overfitting.
- There's a whole bunch of different ways you might prune. But pruning, itself, is one way of dealing with overfitting, and giving you a smaller tree. And it's a very simple addition to the standard ID3 algorithm.

- Another consideration is the problem of regression.
- So far we've only been doing classification where the outputs are discrete, but what if we were trying to solve something that looked more like x squared or two x plus 17 or some other continuous function. In other words, a regression problem. How would we have to adapt decision trees to do that?
- What you really have now is a question about splitting. What's the splitting criteria?
- There's also an issue of what you do in the leaves. You could do some sort of more standard fitting algorithm. So report the average or do some kind of a linear fit. There’s any number of things you can do.
- Errors, how you would report an output? If you don't have a clear answer where everything is labeled true or everything is labeled false, how do you pick? So something like an average would work there. If we're trying to get as many right answers as we can, then you probably want to do a vote in the leaves.

- In supervised learning we can take examples of inputs and outputs and based on that we are going to be able to take a new input and predict the corresponding output for that input
- We are going to be talking about mapping continuous inputs to outputs.

- I'm going to draw a graph and on this axis will be the parent height. And on this axis will be the average child height.
- If we plot these against each other, let’s me put the mean up here. Let's say that this is mean height for the population. And now say that you know pair, we sort of imagine that parents of average height will have children of average height. But parents that are really tall, like that hypothetical person from before, will have children that are taller than average but not as tall as themselves. And similarly people that are very let's say of smaller stature will have children that are also you know short. And, but not quite as short again closer to the mean.
- It turns out that you have this, this very nice linear relationship between these quantities, and, there's an important aspect to this. Which is that the slope of this line is less than one, it's two thirds.

- If the slope of this line was one it would mean that everybody's children would height of their parents.
- If this slope is less than one, like it is, it turns out to be in real populations.

- And that's the fact that this is less than one is what makes it regression to the mean.
- Regression now refers to not this idea of collapsing back towards the mean, but the idea of using functional form to approximate a bunch of data points.

- One of the things that's very helpful about regression is that in many ways it's very simple to visualize, it's very simple to think about what some of the issues are and all the various topics in machine learning that are really important to understand and sometimes are difficult concepts really do come up in a fairly easy to understand way.
- This graph that I put up here, is supposed to tell us a little bit about housing prices.
- Let's imagine that we're off to buy a house and what we notice is that there's lots of different houses on the market, and there are lots of different, sizes.
- The square footage of the house can vary. In this case the houses that I visited can be between about 1,000 to 10,000 square feet. As you get bigger houses, the prices tend to go up too. The price that the house cost tends to rise with the size of the house.
- What I've done here is I've plotted as a little x say a set of nine houses that I've observed.
- Start off over here with a house that's a 1,000 square feet and cost a $1,000 and we end up with a house that is 10,000 square feet and cost about $6,000.
- Now we want to answer a question about what happens If we find a house on the market and it's about $5,000, what do you think a fair price for that would be?
- Apparently about $5,000. But there was no corresponding point for that, so you had to interpolate based on the points that were there you had to kind of imagine what might, happening at the 5,000 square foot mark
- What we're going to do in this case is actually try to find a function that fits this.
- What would be the best linear function that captures the relationship between the size and the cost? The green plot line turns out to be the one that minimizes the squared error, the squared deviation, between these x points and the corresponding position on green line.

- So it finds a way of balancing all those different errors against each other and that's the best line we've got.
- Now in this particular case, this line predicts something more like $4,000. It doesn't really look like a very good fit. But it does at least capture the fact that there is increasing cost with, with increase in size.

- Define best fit: the one that has the least squared error. The error is the sum of the distances between these x points and the green line that we fit. We can use calculus to do this
- Imagine that what we're trying to do is that we've got a bunch of data points, and we're trying to find the best constant function. The best function that has the form, the value of the function for any given X is always the same constant, C.
- The error is going to be the sum over all of the data points. The square difference between that constant we chose and what the actual y value is.
- There are many different error functions and there are lots of different ones that could work (i.e. absolute error, squared error, various kinds of squashed errors)

- It turns out that this one is particularly well behaved because this error function is smooth as a function of the constant c, we can use calculus to actually find the minimum error value.

- Using the chain rule, if you want to find how this error function outputs change as a function of input c, we can take the derivative of this sum.
- Now this gives us a nice, smooth function saying what the error is as a function of c. If we want to find the minimum, we set it equal to zero
- Now we just need to solve this equation for c. → → =mean
- This results in the mean. The best constant is the average of all your y's, the mean.
- So in the case, we just have to average the y's together and that gets us the thing that minimizes the squared error. So squared error is this really nice thing because it tends to bring things like mean back into the picture. It's really very convenient. And, it generalizes to higher order functions.

- The ideas that we're going to try to find a way of predicting the value for various points along the way on this curve.
- One thing we could do is find the best line. But we also talked now about finding the best constant. Turns out these all belong to a family of functions that we could fit. Which are functions of this form.
- We've got x is our input and what we're going to do is we're going to take some constant and add that to some scaled version of x times some scaled version of x squared plus some scaled version of x cubed, all the way up to some order k.
- And we've talked about k equals zero, the constant function. And k equals one, the line. But there's also k equals two, parabola.
- It's going up and it's kind of flattening out and maybe we could imagine that it starts coming down again? At least, over the course of these points, it doesn't come down again but at least it sort of flattened out.
- We've got the best line now, the best constant function which is just the average. We have the best line with some slope to it. That's the green one. We have now the best parabola, kind of gets tucked in with all those other points.
- If the only thing we care about is minimizing the sum of squared error, the parabola has less squared error.
- There's more degrees of freedom so at the worst we could have just fit the parabola as a line. Right, we can always just set any of these coefficients to zero.
- So if the best fit to this really was a line then the parabola that we see here wouldn't have any curve to it.
- We have gone from order zero to order one to order two. How about order nine? Order six is in fact the best we can do here. The highest order that works is order eight. And look what it did. It hit every single point dead on in the center. It used all the degrees of freedom it had to reduce the error to essentially zero.

- One could argue that this is a really good idea. Though, if you look at what happens around 9000, there's some craziness. To try to get this particular parabola to hit that particular point, it sent the curve soaring down with an up again.

- To show that as we have more degrees of freedom we're fitting the error better, see what the amount of error for the best fit for each of these orders of k.
- Alright and so, so what you see when we actually plot the, the squared error, this function that we're trying to minimize. As we go from order zero to order one, order two, order three, order four, order five, all the way to eight. By eight, there is no error left because it nailed every single point. So you know it's kind of a good, but it doesn't feel quite right like the curves that we're looking there looked a little bit crazy.

- We can arrange all these constraints, all these equations into matrix form. If you're familiar with linear algebra.
- So if we arrange all these x values into a matrix, and then we have these other guys, we'll call this w, like the coefficients. Obviously w stands for coefficient. And we want that to equal this vector of y's.
- Basically just need to solve this equation for the w's. We can't exactly solve it because it's not going to exactly equal, but we can solve it in a least squares sense.
- Let’s just step through the steps for how we're going to solve for w.

- Premultiply both sides by the transpose of X

- is going to have a nice inverse - So now we can premultiply by that inverse

- Now, conveniently since this has a nice inverse, the inverses cancel each other:

→

- is the weights we're looking for, gives us exactly the coefficients that we need

- Part of the reason we can't just solve these kinds of problems by solving a system of linear equations and just being done with it has to do with these squares is because of the presence of errors.
- The training data that we are given has errors in it.
- And it's not that we're actually modeling a function, but the thing that we're seeing is the function plus some error term on each piece of data.

- Imagine that we've got our data, this is our training set. Ultimately what we're trying to do is find a way of predicting values and then testing them.
- We do some kind of regression and we might want to fit this to a line. And the line is good, it kind of captures what's going on and if we apply this to the testing set, maybe it's going to do a pretty good job.
- But, if we are feeling kind of obsessive compulsive about it we might say well in this particular case we didn't actually track all the ups and downs of the data.
- We might want to use a higher order polynomial to fit this better. If we can fit this with a higher order polynomial maybe it'll hit all these points much better.
- So now we have this other shape and it's making weird predictions in certain places.
- What we really would like to do is we'd like to use a model that's complex enough to actually model the structure that's in the data that we're training on, but not so complex that it's matching so directly that it doesn't really work well on the test set.
- Unfortunately we don't really have the test set to play with because it’s too much teaching to the test.
- We need to actually learn the true structure that is going to need to be generalized.
- We could take some of the training data and pretend it’s a test set and that wouldn't be cheating because it’s not really the test set. We hold out some of it ,as a kind of make pretend test set, a test test set, a trial test set, what we're going to call cross validation set. It's going to be a stand in for the actual test data.
- This cross validation set is going to be really helpful in figuring out what to do.
- We're going to take our training data, and we're going to split it into what are called folds. We're going to train on the first three folds, and use the fourth one to see how we did.
- Train on the second third and fourth fold and check on the first one.
- And we're going to we're going to try all these different combinations leaving out each fold as a kind of a fake test set.
- And then average these errors. The goodness of fit. Average them all together, to see how well we've done.
- And, the model class, the degree of the polynomial in this case that does the best job, the lowest error, is the one that we're going to go with.

- What happens with the training error:

- As we increase the degree of the polynomial from constant to linear to quadratic and all the way up to when this case order six, the errors always falling.
- As you go up, you have more ability to fit the data, closer and closer and closer because, each of these models is nested inside the other.
- We can always go back. If the zero fits best and I give you six degrees of freedom, you can still fit the zero.

- Now let's use this idea of cross validation. Train on the rest of the data, predict on the chunk, repeat that for all the different chunks and average together.
- So I actually did that. And this is what I got with the cross validation error.
- We have this red plot that is constantly falling and the blue plot which is the cross validation error starts out a little bit higher than the, the red plot that's got higher error because we're actually training to minimize error. We're actually trying to minimize error on the training set.
- The parts we aren't looking at, you're more likely to have some error with. That makes sense if you'd have a little bit more error on the data you haven't seen.
- Let's look at what happens as we start to increase the degree, we've got the ability to fit this data better and better and in fact, down at three and four, they're actually pretty close in terms of their ability to fit these examples.
- And then what's really interesting is what happens is now we start to give it the ability to fit the data closer and closer. And by the time we get up to order six polynomial, even though the error on the training set is really low, the error on this cross validation error is really high. And this is beautiful this inverted u, is exactly what you tend to see in these kinds of cases. That the error decreases as you have more power and then it starts to increase as you use too much of that power.
- As it gets more and more and more power it tends to overfit the training data at the expense of future generalization.
- If you don't give enough degrees of freedom, you don't give a model class that's powerful enough and you underfit the data. You won't be able to model what's actually going on and there'll be a lot of error.
- But if you give yourself too much you can overfit the data. You can actually start to model the error and it generalizes very poorly to unseen examples.
- And somewhere in between is kind of the goldilocks zone. Where we're not underfitting and we're not overfitting. We're fitting just right. And that's the point that we really want to find. We want to find the model that fits the data without overfitting, and not underfitting.

- Up to this point I've been talking about regression in the context of a scalar input and continuous input. So basically this x variable. But the truth of the matter is we could actually have vector inputs as well.
- If you look at the housing example, like we said earlier, there are a bunch of features that we weren't keeping track off. So we could have added some of those.
- If you think about lines, we can just generalize to planes and hyperplanes.
- This notion of polynomial function generalizes very nicely.

- In the field of artificial neural networks we have kind of a cartoonish version of the neuron and networks of neurons and we actually put them together to compute various things.
- One of the nice things about the way that they're set up is that they can be tuned or changed so that they fire under different conditions and therefore compute different things.
- They can be trained through a learning process.
- We're going to have inputs, think of them as firing rates or the strength of inputs. X1, X2, and X3 in this case.
- Those are multiplied by weights, w1, w2, w3 correspondingly.
- The weights turn up the gain or the sensitivity of the neuron, to each of the inputs respectively.
- Then we're going to sum them up. We're going to sum over all the inputs.
- The strength of the input times the weight, and that's going to be the activation.
- Then determine if that is greater than or equal to the firing threshold. If it is then we're going to say the output is one and if it's not, we're going to say the output is zero.
- This is a neural net unit called a Perceptron.

- Here the input is 1, 0, -1.5. For the three different, inputs in this case.
- The corresponding weights, are ½, ⅗, and 1.
- The threshold, let's say is 0, it should fire if the weighted sum is greater than or equal to 0

- Multiply x1 times w1 so that gives us a 1/2
- Multiply 0 times 3/5 which would get a 0
- Multiply -1.5 times 1 which will give us - 3/2.
- -1.5 plus a 1/2, so it should be negative one.

- That's the activation. If the activation is above our threshold theta, which in this case is 0, and it is not
- So the output should be 0.

- Notice that the relationships are all linear here.
- So solving this linear inequality gets us a picture like this.
- So this perceptron computes a kind of half plane
- The half of the plane that's above this line is getting us the 1 answers and below that line is giving us a zero answers. Perceptrons are always going to compute lines.
- Perceptron is a linear function, and it computes hyperplanes.

- We'd really like a system that, given examples, finds weights that map the inputs to the outputs.
- We're going to actually look at two different rules that have been developed for doing exactly that, to figuring out what the weights ought to be from training examples.
- One is called the Perceptron Rule, and the other is called gradient descent or the Delta Rule.

- The perception rule is going to make use of the threshold outputs
- Gradient descent is going to use unthresholded values.

- We've got a training set, which is a bunch of examples of x. These are vectors.
- We have y’s, which are zeros and ones. These are the output that we want to hit.
- What we want to do is set the weights so that we capture this same data set. And we're going to do that by, modifying the weights over time.
- So one of the things that we're going to do here is we’re going to give a learning rate for the weights w, and not give a learning rule for Theta, but we do need to learn the theta.

- There's a very convenient trick for actually learning them by just treating it as another kind of weight.
- So if you think about the way that the thresholding function works., we're taking a linear combination of the W's and X's, then we're comparing it to theta.
- But if you think about just subtracting theta from both sides, then, in some sense theta just becomes another one of the weights, and we're just comparing to zero.
- So what I did here was take the actual data, the x's, and I added what is sometimes called a bias unit.
- So basically the input is one always to that. And the weight corresponding to it is going to correspond to negative theta ultimately.
- This just simplifies things so that the threshold can be treated the same as the weights.

- So from now on, we don't have to worry about the threshold. It just gets folded into the weights, and all our comparisons are going to be just to zero instead of theta.
- We're going to iterate over this training set, grabbing an x, which includes the bias piece, and the y, where y is our target and X is our input.
- We're going to change weight i, the weight corresponding to the ith unit, by the amount that we're changing the weight by.
- We need to define that what that weight change is. We're going to take the target, the thing that we want the output to be, and compare it to what the network with the current weight actually spits out.
- So we compute this, this y hat. This approximate output y. By again summing up the inputs according to the weights and comparing it to zero.
- That gets us a zero one value.
- So we're now comparing that to what the actual value is.

- If they are both zeros then this y minus y hat is zero and it means the output should have been zero and the output of our current network really was zero, so that's good.
- If they are both ones, it means the output was supposed to be one and our network outputted one, and the difference between them is going to be zero.
- But in this other case, y minus y hat, if the output was supposed to be zero, but we said one, our network says one, then we get a negative one.
- If the output was supposed to be one and we said zero, then we get a positive one.

- Those are the four cases for what's happening here.
- We're going to take that value multiply it by the current input to that unit i, scale it down by the sort of thing that is going to be called the learning rate and use that as the the weight update change.
- If the output is already correct either both on or both off. Then there's going to be no change to the weights.
- If our output is wrong, i.e. we are giving a one when we should have been giving a zero. That means the total here is too large. And so we need to make it smaller.

- Whichever input xi’s correspond to, very large values, we're going to move those weights very far in a negative direction.
- We're taking this negative one times that value times this little learning rate.

- The other case is if the output was supposed to one but we're outputting a zero, that means our total is too small.

- And what this rule says is increase the weights essentially to try to make the sum bigger.

- Now, we don't want to overdo it, and that's what this learning rate is about.
- Learning rate basically says we'll figure out the direction that we want to move things and just take a little step in that direction.
- We'll keep repeating over all of the input output pairs. We'll have a chance to get into really building things up, but we're going to do it a little bit at a time so we don't overshoot.
- If there is such a half plane that separates the positive from the negative examples, then we say that the data set is linearly separable. That there is a way of separating the positives and negatives with a line.
- What's cool about the perception rule, is that if we have data that is linearly separable. The Perceptron Rule will find it. It only needs a finite number of iterations to find it. It will actually find a line, and it will stop after finding a set of weights that do the trick.
- That happens if the data set is in fact linearly separable and that's pretty cool.
- If the data is linearly separable, then the algorithm works, so the algorithm simply needs to only be run when the data is linearly separable. It's generally not that easy tell actually, when your data is linearly separable especially, here we have it in two dimensions, if it's in 50 dimensions, know whether or not there is a setting of those perimeters that makes it linearly separable, not so clear.

- We are going to need a learning algorithm that is more robust to non-linear separability
- Gradient descent is going to give us an algorithm for doing exactly that. S
- What we did before was:

- Did a summation over all the different input features of the activation on that input feature times the weight, w, for that input feature.
- Sum all those up and get an activation.
- And then we have our estimated output as whether or not that activation is greater than or equal to zero.

- Let's imagine that the output is not thresholded when we're doing the training, and what we're going to do instead is try to figure out the weight so that the non thresholded value is as close to the target as we can.
- This actually kind of brings us back to the regression story.
- We can define an error metric on the weight vector w.
- And the form of that's going to be one half times the sum over all the data in the dataset, of what the target was supposed to be for that particular example. Minus what the activation actually was, the activation being the dot product between the weights and the input, and we're going to square that. We're going to square that error and we want to try to now minimize that.

- Why ½ of that? It turns out that in terms of minimizing the error this is just a constant and it doesn't matter.
- So why do we stick in a half there? Let's get back to that.

- Just like in the regression case we're going to fall back to calculus, calculus is going to tell us how we can push around these weights, to try to push this error down. We want to know how changing the weight changes the error, and we want to push the weight in the direction that causes the error to go down.

- Take the partial derivative of this error metric with respect to each of the individual weights, so that we'll know for each weight which way we should push it a little bit to move in the direction of the gradient. That's the partial derivative with respect to weight wi, of exactly this error measure.
- So to take this partial derivative we just use the chain rule as we always do. Take the power, move it to the front, keep this thing, and then take the derivative of this thing.
- This answers the question of why we put a half in there. Because down the line, it's going to be really convenient that two and the half canceled out. It's just going to mean that our partial derivative is going to look simpler, even though our error measure looked a little bit more complicated.
- What we're left with is the sum over all these data points of what was inside this quantity here times the derivative of that, and here I expanded the a to be, the definition of the a.
- Now, we need to take the partial derivative with respect to weight wi of this sum that involves a bunch of the w’s in it.
- So, when don't match the wi, that derivative is going to be zero because changing the weight won't have any impact on it. The only place where changing this weight has any impact is at xi. So that's what we end up carrying down. This summation disappears. And all that's left is just the one term that matches the weight that we care about.

- So this is what we're left with, the derivative of the error with respect to any weight wi is exactly the sum of the difference between the activation and the target output times the activation on that input unit
- That looks almost exactly like the rule that we use with the perceptrons before.
- This is now just a derivative, but let's actually write down what our weight update is going to be because we're going to take a little step in the direction of this derivative and it's going to involve a learning rate.

Here's our update rules what they end up being.

- The gradient descent rule says what we want to do is move the weights in the negative direction of the gradient.

- If we negate the expression that we had before and take a little step in that direction we get exactly this expression.
- Multiply the input on that weight times the target minus the activation.

- In the perceptron case take that same activation, thresholding it, determine whether it's positive or negative, putting in a zero or a one, and putting that in here, that's what y hat (ŷ) is.

- So really it's the same thing except in one case we have done the thresholding and in the other case we have not done the thresholding.
- But we end up with two different algorithms with two different behaviors.

- The perceptron has this nice guarantee. A finite convergence, which is a really good thing, but that's only in the case where we have linear separability.
- Whereas the gradient descent rule is good because it's more robust to data sets that are not linearly separable, but it's only going to converge in the limit. To a local optimum.

- Once we see these two things next to each other, it raises the question, why don't we just use a gradient descent type on an error metric that's defined in terms of instead of the activation?

- is the thing, that we really want to match the output.

- We don't really want the activation to match the output. There's no need for that.
- So why don't we do gradient descent on ?
- There could be many reasons but the main reason is it's not differentiable. It's a discontinuous function. There's no way to take the derivative at the point where it's discontinuous.
- This activation thing. The change from activation to ŷ has this big step function jump in it, right at zero.
- So once the activation goes positive, actually at zero. It jumps up to one. And before that, it's not. So the derivative is basically zero, and then not differentiable, and then zero again.
- So really, the zero's not giving us any direction to push, in terms of how to fix the weights. And the undefined part doesn't really give us any information either. So this algorithm doesn't really work, if you try to take the derivative through this discontinuous function.
- What if we made this, more differentiable? What is it that makes this so un-differentiable? It's this really pointy spot.
- You could imagine a function that was like this, but then, instead of the point spot, it smoothed out a bit. Kind of a softer version of a threshold, which isn't exactly a threshold. But at least it's differentiable.
- So that would force the algorithm to put its money where its mouth is.
- If that really is the reason that the problem is non-differentiable, fine, we'll make it differentiable.

- Let's take a look at a function called the sigmoid.
- We're going to define the sigmoid using the letter sigma () and it's going to be applied to the activation just like we were doing before, but instead of thresholding it at zero, what it's instead going to do is compute this function of

- It ought to be clear that as the activation gets less and less, we'd want it to go to zero, and in fact it does.

- As goes to negative infinity, the negative goes to infinity, to the infinity is something really, really big. So it's 1 over something really big, which is almost zero.

- The sigmoid function goes to zero as the activation goes to negative infinity, just like threshold

- As the activation gets really really large, we're talking about to the minus something really large, which is like e to the negative infinity which is almost zero, so one over one plus zero is essentially one.

- Reference slide, minus five and below it's essentially at zero, and then it makes this kind of gradual, sigmoid s-shaped curve, then it comes back up to the top and it's basically at one by the time it get to five.
- Instead of just an abrupt of transition to zero, we have this gradual transition between negative five and five. Now it's differentiable and now we can take derivatives
- Not only is this function differentiable, but the derivative itself has a very beautiful form. It turns out if you take the derivative of this sigma function, it can be written as the function itself times one minus the function itself.

- This is just really elegant and simple. If you have the sigma function in your code, there's nothing special that you need for the derivative. You could just compute it this way.

- Now the idea is that using exactly these kind of sigmoid units, we can construct a chain of relationships between the input layer, which are the different components of x, with the output y.
- The way this is going to happen is there's other layers of units in between and each one is computing the weighted sum, sigmoided, of the layer before it.
- These other layers of units are often referred to as hidden layers because you can see the inputs and the outputs. This other stuff is less constrained, or indirectly constrained.
- Each of these units takes the weights, multiplied by the things coming into it, put it through the sigmoid and that's your activation, that's your output.
- What's cool about this is, in the case where all these are sigmoid units this mapping from input to output is differentiable in terms of the weights, and by saying the whole thing is differentiable, what I'm saying is that we can figure out for any given weight in the network how moving it up or down a little bit is going to change the mapping from inputs to outputs.
- So we can move all those weights in the direction of producing something more like the output that we want. Even though there's all these sort of crazy non linearities in between.
- And so, this leads to an idea called backpropagation, which at its heart is a computationally beneficial organization of the chain rule.

- Compute the derivatives with respect to all the different weights in the network, all in one convenient way, that has this lovely interpretation of having information flowing from the inputs to the outputs.
- Then error information flowing back from the outputs towards the inputs
- That tells you how to compute all the derivatives
- Then, therefore how to make all the weight updates to make the network produce something more like what you wanted it to produce.

- This is where learning is actually taking place, this backpropagation is referring to the fact that the errors are flowing backwards. Sometimes it is even called error backpropagation.
- If you replace the sigmoid units with some other function and that function is also differentiable, then we can still do this basic kind of trick that says we can compute derivatives, and therefore we can move weights around to try to get the network to produce what we want it to produce.
- It's really just analogous to a perceptron, because we're not really doing the hard thresholding, we don't have guarantees of convergence in finite time.
- In fact, the error function can have many local optima, and what we mean by that is this idea that we're trying to set the weight so that the error is low, but you can get to these situations where none of the weights can really change without making the error worse. And you'd like to think we're done. We've made the error as low as we can make it, but in fact it could actually just be stuck in a local optima, that there's a much better way of setting the weights It's just we have to change more than just one weight at a time to get there.
- So if we think about the sigmoid and the error function that we picked, the error function was sum of squared errors, so that looks like a parabola in some high dimensional space, but once we start combining them with others like this over and over again then we have an error space where there may be lots of places that look low but only look low if you're standing there but globally would not be the lowest point.
- So you can get these situations in just the one unit version where the error function as you said is this nice little parabola and you can move down the gradient and when you get down to the bottom you're done.
- But when we start throwing these networks of units together, we can get an error surface that looks just in its cartoon form looks crazy like the slide, and you could easily get yourself stuck at a point like this where you're not at the global minimum. You’re at some local optimum.

- One of the things that goes wrong, when you try to actually run gradient descent on a complex network with a lot of data is that you can get stuck in these local minima. I'm trying to find a set of weights for the neural network that tries to minimize error on the training set.
- So, gradient descent is one way to do it, and it can get stuck, but there's other kinds of advanced optimization methods that become very appropriate here. In fact, there's a lot of people in machine learning who think of optimization and learning as kind of being the same thing.
- What you're really trying to do in any kind of learning problem is solve this high order, very difficult optimization problem to figure out what the learned representation needs to be.
- There's things like using momentum terms in the gradient

- As we're doing gradient descent we don't want to get stuck on this ball here, we want to kind of pass all the way through it to get to this ball, so maybe we need to just continue in the direction we've been going.
- So, instead of thinking of it as a kind of physical analogy. Instead of just going to the bottom of this hill and getting stuck, it can kind of bounce out and pop over and come to, what might be a lower, minima, later.

- There's a lot of work in using higher order derivatives to better optimize things instead of just thinking about the way that individual weights change the error function to look at combinations of weights.

- Hamiltonions and what not.

- There's various ideas for randomized optimization that can be applied to make things more robust.
- And sometimes it's worth thinking we don't really want to just minimize the error on the training set, we may actually want to have some kind of penalty for using a structure that's too complex.

- Something similar also happened in the decision tree section. We had an issue with decision trees where if we let the tree grow too much to explain every little quirk in the data, you'd overfit.
- We came up with a lot of ways of dealing with that, like pruning. Not going too far deeply into the tree. You can either do that by filling out the tree and then backing up so you only have a little bit of small error or by stopping once you've reached some sort of threshold as you grow the tree out.
- That's really the same as giving some kind of penalty for complexity.
- Complexity in the tree setting has to do with the size of the tree, in regression it had to do with the order of the polynomial.

- For neural network, there's two things you can do with networks

- Add more and more nodes
- Add more and more layers.

- The more nodes that we put into a network:

- The more complicated the mapping becomes from input to output
- The more local minima we get
- The more we have the ability to actually model the noise, which brings up exactly the same overfitting issues.

- Another thing that can make a complex network is the numbers, the weights, are very large.

- So same number of weights, same number of nodes, same number of layers, but larger numbers often leads to more complex networks and the possibility of overfitting.
- Sometimes we want to penalize a network not just by giving it fewer nodes or layers but also by keeping the numbers in a reasonable range.

- What are neural networks more or less appropriate for, what is the restriction bias and the inductive bias of this class of classifiers, and regression algorithms?
- Restriction bias tells you something about the representational power of whatever data structure it is that you're using and it tells you the set of hypotheses that you're willing to consider.

- If there's a great deal of restriction, then there's lots of different kinds of models that we're not even considering.

- We started out with a simple perceptron unit, and that we decided was linear. So we were only considering planes.
- Then we move to networks, so that we could do things like XOR, and that allowed us to do more.
- Then we started sticking Sigmoids and other arbitrary functions into nodes so that we could represent more and more. If you let weights get big and we have lots of layers and lots of nodes they can be really complex.
- Therefore we are actually not doing much of a restriction at all.
- Boolean functions we can represent if we give ourselves a complex enough network with enough units, we can basically map all the different sub-components of any Boolean expression to threshold like units and basically build a circuit that can compute whatever Boolean function we want.
- What about continuous functions? A continuous function is one where, as the input changes the output changes somewhat smoothly, there's no jumps in the function. There's no discontinuities.

- If we've got a continuous function that we're trying to model with a neural network, as long as it's connected, it has no discontinuous jumps to any place in the space, we can do this with just a single hidden layer, as long as we have enough hidden units, as long as there's enough units in that layer.
- One way to think about that is, if we have enough hidden units, each hidden unit can worry about one little patch of the function that it needs to model.
- The patches get set at the hidden layer and at the output layer they get stitched together. And if you just have that one layer you can make any function as long as it's continuous.

- If it's an arbitrary function, we can still represent that in our neural network.

- Any mapping from inputs to outputs we can represent, even if it's discontinuous, just by adding one more hidden layer, so two total hidden layers.
- And that gives us the ability to not just stitch these patches at their seams, but also to have big jumps between the patches.

- So in fact, neural networks are not very restrictive in terms of their bias as long as you have a sufficiently complex network structure, so maybe multiple hidden layers and multiple units.
- This is worrisome because it means that we're almost certainly going to overfit. We're going to have arbitrarily complicated neural networks and we can represent anything we want to. Including all of the noise that's represented in our training set.
- It is the case though, that when we train neural networks, we typically give them some bounded number of hidden units and we give them some bounded number of layers.

- It's not like any fixed network can actually capture any arbitrary function, any fixed network can only capture whatever it can capture, which is a smaller set.
- So going to neural nets in general doesn't have much restriction, but any given network architecture actually does have a bit more restriction.

- Another thing we can do with overfitting is what we've done the other times we've had to deal with overfitting.

- Use ideas like, cross validation to decide:

- How many hidden layers to use.
- How many nodes to put in each layer.
- When to stop training because the weights have gotten too large.

- It's probably worth pointing out that this is a different property from the other classes of supervised learning algorithms we've looked at so far.

- In a decision tree, you build up the decision tree and you may have overfit.
- In regression, you solve the regression problem, and again that may have overfit.

- What's interesting about neural network training is it's this iterative process that you started out running, and as it's running, it's actually errors going down and down.
- So, in this standard kind of graph, we get the error on the training set dropping as we increase iterations.

- It's doing a better and better job of modeling the training data.

- But, in classic style, if you look at the error in some kind of held-out test set, or maybe in a cross validation set, you see the error starting out kind of high and maybe dropping along with this, and at some point it actually turns around and goes the other way.
- So here, even though we're not changing the network structure itself, we're just continuing to improve our fit, we actually get this pattern that we've seen before, that the cross validation error can turn around and at this low point, you might want to just stop training your network there.
- The more you train it, possibly the worse you'll do.
- It's reflecting this idea that the complexity of the network is not just in the nodes and the layers, but also in the magnitude of the weights.
- Typically what happens at this turnaround point is that some weights are actually getting larger and larger and larger.
- Just wanted to highlight that difference between neural net function approximation of what we see in some of the other algorithms

- Preference bias tells you something about the algorithm that you are using to learn that tells you, given two representations, why I would prefer one over the other.
- Think back what we talked about with decision trees, we preferred trees where nodes near the top had high information gain, we preferred correct trees, we preferred trees that were shorter to ones that were longer unnecessarily and so on and so forth.
- We haven't actually chosen an algorithm. We talked about how derivatives work, how backpropagation works, but we haven’t talked about how we start? You tell me how to update the weights but, how do I start out with the weights?
- You can't run this algorithm without initializing the weights to something. We did talk about how you update the weights but they don't just start undefined and you can't just update something that's undefined. We have to set the initial weights to something.
- Pretty typical thing for people to do, is small, random, values

- Random values because we have no particular reason to pick one set of values over another. So you start somewhere in the space. Probably helps us to avoid local minimum. There's also the issue, if we run the algorithm multiple times if we get stuck, we like it not to get stuck exactly there again if you run it again. So it gives some variability, which is a helpful thing in avoiding local minimal.
- Small values because if the weights get really big that can sometimes lead to overfitting, because it let's you represent arbitrarily complex functions.

- If we start out with small random values, that means we are starting out with low complexity. So that means we prefer simpler explanations to more complex explanations, correct answers to incorrect answers, and so on and so forth.
- So neural networks implement a kind of bias that says we prefer correct over incorrect but all things being equal, the simpler explanation is preferred.
- This reminiscent of the principal that is known as Occam's razor, which is often stated as entities should not be multiplied unnecessarily.

- It's necessary if you're getting better explanatory power, you're fitting your data better.
- Unnecessarily would mean we're not doing any better at fitting the data. If we're not doing any better at fitting the data, then we should not multiply entities.
- Multiply here means make more complex. So don't make something more complex unless you're getting better error, or, if two things have similar error, choose the simpler one, use the one that's less complex.
- That has been shown to, if you mathematize this and you use it in the context of supervised learning, that we're going to get better generalization error with simpler hypotheses.

Before:

- We were given a bunch of training data,
- And we would then learn some function.
- For example, if we have a bunch of points in a plane, we might learn a line to represent them
- And what was happening here is we take all this data, we come up with some function that represents the data, and then we would throw the data away effectively
- And then in the future when we get some data point, let's call it x, we would pass it through this function whatever it is, and that would be how we would determine answers going forward.
- So. I want to propose an alternative and the alternative is basically going to not do this.

What I'm proposing

- Take all the training data we had and put it in a database.
- Next time we get some new x to look at just look it up in the database.
- Done.
- None of this fancy shmancy learning. None of this producing an appropriate function like wx+b. None of that fancy stuff anymore. We just stick into the database. We'll look it up when we're done. Period.

Advantages:

- It doesn't forget, so it actually is very reliable.
- It's very dependable, if you put in an x, y pair you ask for the x you're going to get that y back instead of some kind of crack potty, smooth version of it.
- Another thing is that there's none of this wasted time doing learning. It just takes the data and it very rapidly just puts it in the database. So it's fast.
- It's simple.

Issues:

- In particular the way that you wrote . If I give you one of the other points in between it will return no such point found. Which means it's really quite conservative. It's not willing to go out on a limb and say well I haven't seen this before but it ought to be around here, instead it just says, I don't know.
- So the down side of remembering is, no generalization.
- A similar issue is that it reminds me of the issues that we saw with regard to overfitting. It bottles the noise exactly, it bottles exactly what it was given. So it's going to be very sensitive to noise. It can overfit in a couple of ways:

- By believing the data too much that is literally believing all of it
- What do you do if the same x shows up multiple times but each with a different y.

- Here's some data, it's a graph and you see here's a y axis and here's an x axis, and each of these points, represents a house, on this map, which I'm, I'm cleverly, using in the background.
- You'll notice, that each of the dots is colored. I'm going to say that red represents, really expensive houses, blue represents, moderately expensive houses, and green represents, really inexpensive houses.
- Using the technique that we talked about, use ML and the data to determine the cost of the black dots
- Using neighbors you can figure this out, except the house in the middle, this one doesn't really have any very close neighbors on the map.
- So, this whole nearest neighbor thing doesn't quite work, in this case when you got a bunch of neighbors that are saying different things.
- We need to look at a bigger context - look at more of my neighbors than just the closest one.
- Also, we've been talking about distance sort of implicitly. But this notion of distance is actually quite important.

- Maybe distance is straight-line distance, as the crow flies, driving distance, etc.

- We have to be very careful what we mean by distance
- So, we went from just picking our nearest neighbor to picking our nearest neighbors. What's a good value you think we should stick to with neighbors? We could do as many nearest neighbors as is appropriate. Or maybe we should just make it a free parameter and call it K. K nearest neighbors and we'll pick our K numbers.
- So we have an algorithm, k nearest neighbors, which takes K nearest neighbors as a way of deciding how you're going to label some query point here.
- And we've identified two parameters to the algorithm so far.

- K - the number of neighbors we're going to use.
- And some notion of distance.

- We're using distance here in a kind of in an overloaded sense, because this is something on a map. But really distance is a standard for similarity.
- In other ways, things like the number of veterans you have, whether you're on one side of the highway or the other, the school district you're in, things like that, are other things you might add as features or dimensions when you talk about similarity or distance.
- We have a general algorithm now and I think it does a pretty good job of addressing the points you brought up.
- We no longer have to worry about overfitting as much, at least it seems that way to me. And we have a way of being a little bit more robust about not having an exact data point in the database.

- Here is pseudocode for our K-NN algorithm. And I'm sort of writing it as like, a function:

- So, you're going to be given some training data D
- You're given some kind of distance metric or similarity function. This is important because this represents the domain knowledge
- You get some number of neighbors that you care about,, which also represents domain knowledge. Tells you something about how many neighbors you think you should have.
- And then are given some particular new query point and I want to output some kind of answer, some label, some value.
- So the K-NN algorithm is remarkably simple given these things. You simply find a set of nearest neighbors such that they are the K closest to your query point.

What to return. One is where we're doing classification and one is where we're doing regression.

- Classification:

- A reasonable thing to do there would be. Did we get Y’s associated with the things in NN, and if so, simply vote, whichever is most frequent among the closest points wins. Essentially find a vote of the ’s, that are apart of the neighborhood set and take the plurality, whichever one occurs the most (the mode)
- If there are ties among the output, then you're just going to have to pick one, and there's lots of ways you might do that. You may take the tie that is most commonly represented in the data period. Or I'll just randomly pick each time, or any number of ways you might imagine doing that.

- Regression

- In the regression case our ’s are numbers. If you have to return one, a standard thing to do would be to take the average, or the mean.
- Simply take the mean of the ’s, and you don't have to worry about a tie.

- If you have more than k that are closest because you have a bunch of ties, in terms of the distance, just take all of them. Get the smallest number greater than or equal to k.
- It's a very simple algorithm, but some of that's because a lot of decisions are being left up to the designer.

- The distance metric, The number k, How you're going to break ties, exactly how you choose to implement voting or implement the mean or the average operation that shows how to do here.
- You could put a bunch of different things here in the hands of the designer and you could end up with completely different answer.

- One thing that you might do, as an example, is rather than doing a simple vote by counting, you could do a vote that is weighted by how far away you are. So we could have a weighted vote. That might help us with ties.

- They say that KNN are lazy learners, versus something like linear regression, which is an eager learner.
- Linear regression is eager. It wants to learn right away and it does.
- Nearest neighbor algorithms are lazy. They want to put off learning until they absolutely have to and so we refer to this class as lazy and this class as eager.
- If we never query it then the lazy learner definitely comes out ahead.

- There's several lessons here. And one lesson I don't want you to take away.
- So here's the lesson: I actually had a real function here.

- There was no noise.
- It was fairly well represented.
- The proper answer was 18 and basically none of these are right.
- But the first thing I want you to notice is you get completely different answers, depending upon exactly whether you do one versus three, whether you do Euclidean versus Manhattan.
- And that's because these things make assumptions about your domain that might not be particularly relevant.
- This suggests that maybe this thing doesn't do very well, that KNN doesn't do very well because none of these are close to 18. That seems a little sad.

- However, KNN tends to work really, really well, especially given it's simplicity
- It just doesn't in this particular case, and there's really a reason for that.

- It has to do with this sort of fundamental assumptions in bias of KNN this example happened to violate some of that bias.
- So I think it's worth to take a moment to think about what the preference bias is for KNN and to see if that can lead us to understanding why we didn't get anything close to 18 here.

- Preference bias is our notion of why we would prefer one hypothesis over another, all other things being equal. And what that really means is, it's the thing that encompasses our belief about what makes a good hypothesis.

- So in some of the previous examples that we used it was things like shorter trees, smoother functions, simpler functions, those sorts of things were the ways that we expressed our preferences over various hypothesis.
- KNN is no exception. It also has preference by its built in as does every algorithm of any note.

- There are three that are indicative of this bias, and they're all somewhat related to one another.
- The first one is a notion of locality.

- There's this idea that near points are similar to one another.
- The whole thing we are using to generalize from one thing to another is this notion of nearness.
- Exactly how this notion of nearness works out is embedded in whatever distance function we happen to be given. There's further bias that might come out, based upon exactly the way we implement distances.
- So, in the example we just did, euclidian distance is making a different assumption about what nearness or similarity is, compared to Manhattan distance, for example.
- So, locality however it's expressed in the distance function that is similarity, is built in to KNN that we believe that near points are similar. Kind of by definition.

- That leads actually to the second preference bias which is this notion of smoothness.

- By choosing to average and by choosing to look at points that are similar to one another, we are expecting functions to behave, smoothly.
- In the 2D case it's kind of easy to see, you have these points and you're basically saying, these two points should somehow be related to one another more than this point and this point.
- And that sort of assumes kind of smoothly changing behavior as you move from one neighborhood to another.

- There is another assumption which is a bit more subtle

- For at least the distance functions we've looked at before, the Euclidian distance and the Manhattan distance, they all kind of looked at each of the dimensions and subtracted them and squared them or didn't or took their absolute value and added them all together.
- What that means is we were treating, at least in those cases, that all the features mattered, and they mattered equally.
- Trying to think about what that might mean, its definitely the case that when you look for similar examples in the database you want to care more about because a little bit of a difference in gets squared out. It can lead to a very large difference in the corresponding value.
- Whereas in the 's, it's not quite as crucial. If you're off a little bit more, then you're off a little bit more, it's just a linear relationship. It does seems like that first dimension needs to be a lot more important, I guess, when you're doing the matching. Then the second one.

- The notion of relevance turns out to be very important and highlights a weakness of KNN.
- This brings a theorem or fundamental result of machine learning that is particularly relevant to KNN, but it's actually relevant everywhere.

- As the number of features, or equivalently dimensions, grows, that is as we add more and more features, the amount of data that we need to generalize accurately also grows exponentially.
- This is a problem of course because Exponentially means bad in computer science land because when things are exponential they're effectively untenable. You just can't win. It means that we need more and more data as we add features and dimensions.
- Now as a machine learning person this is a real problem because what you want to do, or what your instinct tells you to do is, we've got this problem, we've got a bunch of data, we're not sure what's important. So why don't we just keep adding more and more features.
- We've got all these sensors and we'll just add this little bit and this little bit, and we'll keep track of GPS location and we'll see the time of the day and we'll just keep adding stuff and then we'll figure out which ones are important.
- But the curse of dimensionality says that every time you add another one of these features, you add another dimension to your input space, you're going to need exponentially more data as you add those features, in order to be able to generalize accurately.
- This is a very serious problem and it captures some of the difficulties in KNN.
- If you have a distance function or a similarity function, that assumes that everything is relevant, or equally relevant, or important, and some of them aren't, you're going to have to see a lot of data before you can figure that out, sort of before it washes itself away.

- Let's look at this little line segment. Say I've got ten little points that I could put down on this line segment and I want them all to represent some part of this line.
- Let's pretend I did the right thing here and I have them kind of uniformly distributed across the line segment. So that means each one of these points is sort of owning, an equal size sub segment of this segment, each owning 10%
- Let's say I move from a line segment now to a two dimensional space. So a little square segment and I've taken my little ten x's, and I put them down here before. How can I make it so that each of the x's I put down represents the same amount of distance as the x’s in the line segment? Fill up the square with x’s.
- That'll be 100 x’s. So each one now holds a hundredth of the space, and I went from needing ten points to 100 points in order to cover the same amount of space.
- What happens if I now move into three dimensions? In this case we're talking about data points in a nearest neighbor method and boy that does seem like a big growth from ten to a 100 to 1000. the growth is exponential.
- If we went into four dimensions then we would need need 10,000 points. And in six dimensions, we would need 1,000,000 points. And so on and so forth. So something like. Ten to the D, where D is the number of dimensions.
- This isn't just an issue for KNN, this is true in general. Don't think about this now as nearest neighbors in the sense of KNN. But think of it as points that are representing or covering the space. And if you want to represent the same sort of hyper-volume of space as you add dimensions, you're going to have to get exponentially more points in order to do that. And coverage is necessary to do learning.
- Really problematic because it's very natural to just keep throwing dimensions into a machine learning problem. Like it's having trouble learning. Let me give it a few more dimensions to give it hints. But really what you're doing is just giving it a larger and larger volume to fill.
- And it can't fill it unless you give it more and more data. So you're better off giving more data than you are giving more dimensions.

- The other stuff that comes up in KNN mainly comes up in these assumptions we make about parameters to the algorithm.
- The one we talked about probably the most is our distance measure
- We looked at Euclidean and we looked at Manhattan. We even looked at weighted versions of those.
- Your choice of distance function really matters. If you pick the wrong kind of distance function, you're just going to get very poor behavior.
- It's probably worth pointing out that this notion of weighted distance is one way to deal with the curse of dimensionality. You can weigh different dimensions differently.
- Also. instead of just taking a weighted average, what about using a distance matrix to pick up some of the points, and then do a different regression on that substantive point. Replace this whole notion of average with a more regression-y thing.

- This actually has a name, it's actually called locally weighted regression and actually works pretty well and in place of sort of averaging function, you can do just about anything you want to. You could throw in a decision tree, you could throw in a neural network, you could throw in lines do linear regression. You can do, almost anything that you can imagine doing.

- I want to start this out by going through a little exercise with you: spam email.
- Normally we think of this as a classification task, where we're going to take some email and we're going to decide if it's spam or not. Given what we've talked about so far, we would be thinking about using a decision tree, neural networks, or KNN.
- I want to propose an alternative which is going to get us to ensemble learn: I don't want you to try to think of some complicated rule that you might come up with that would capture spam email. Instead, I want you to come up with some simple rules that are indicative of spam email.
- You're going to get some email message and you want some computer program to figure out automatically for you if something is a piece of spam or it isn't. And I want you to help write a set of simple rules that'll help indicate that something is spam.

- If it comes from my spouse, it's probably not spam.
- The length of the message. some spam is very, very short, like just the URL.
- It's just an image.
- Lots of misspelled words, a blacklist of words.

- There are tons of these rules that can be thought of. We could come up with a bunch of them.
- Something they all kind of have in common, all of them are sort of right, they're useful but no one of them is going to be very good at telling us whether a message has spam on its own. Any one is evidence but it's not enough to decide whether something is spam or not.
- In decision trees, there's a similar problem going on there. We can think of each of the nodes in a decision tree as being a very simple rule and the decision tree tells us how to combine them. So, we need to figure out how to do that here and that is the fundamental notion of ensemble learning.
- You could also something similar with other ML methods, like neural networks, where each of these rules becomes a feature and we're just trying to learn ways to combine them all together. The difference here in this case is that typically with the new network we've already built the network itself and the nodes and we're trying to learn the weights, whereas in something like a decision tree you're building up rules as you go along. And typically with ensemble learning you're building up a bunch of rules and combining them together until you got something that's good enough.

- First you take a bunch of simple rules, all of which kind of make sense and that you can see as sort of helping, but on their own, individually, don’t give you a good answer.
- Then you magically combine them in some way to create a more complex rule, that in fact, works really well.
- Ensemble learning algorithms have a sort of basic form to them that can be described in just one or two lines.
- The basic form of an ensemble learning algorithm:

- Learn over a subset of the data, and that generates some kind of a rule.

- Then you learn over another subset of the data and that generates a different rule. Rinse repeat for a third. fourth rule, and yet a fifth rule, and so on and so forth.

- Eventually you take all of those rules and you combine them into one of these complex rules.

- In the email case I might look at a small subset of email that I know is already spam and discover that the word “manly” shows up in all of them and therefore pick that up as a rule. That's going to be good at that subset of mail, but not necessarily be good at the other subset of mail.
- I can do the same thing and discover that a lot of the spam mails are in fact short or a lot of them are just images or just URLs and so on and so forth.
- That's how I learn these rules -- by looking at different subsets. Which is why you end up with rules that are very good at a small subset of the data, but aren't necessarily good at a large subset of the data.
- And then after you've collected these rules, you combine them in some way
- Think of this as any other classification learning problem that you would have where you're trying to come up with some way to distinguish between the positives and the negatives.
- We look at only a subset for each rule because if we look at all of the data then it's going to be hard to come up with these simple rules. This will be discussed more later when we talk about overfitting
- This is Ensemble Learning: You learn over a subset of the data over and over again, picking up new rules, and then you combine them and you're done.

- Here's kind of the dumbest (or simplest) thing you can imagine doing, and it turns out to work out pretty well.

- We're going to pick subsets uniformly, just to be specific about it.
- Uniformly randomly choose among some of the data, and say that's the data I'm going to look at, and then apply some learning algorithm to it.

- Okay, so just: pick a subset of the data, apply a learner to it, get some hypothesis out, and get some rule out, and now I'm going to combine them.
- Since we’re being simple, try doing something simple for combining. So, a very simple way of combining, would be to average them. We'll simply take the mean. Each one of them learned over a random subset of the data. You have no reason to believe that one's better than the other.
- There's a couple of reasons this ‘idea’ could go wrong i.e. bad random subset.

- You've got N data points. The learner that you're going to use over your subsets is a 0-order polynomial. The way you're going to combine the output of the learners is by averaging them.
- Your subsets are going to be constructed in the following way: You uniformly randomly picked them and you ended up with N disjoint subsets, and each one has a single point in it that happens to be one of the data points.
- If you look on the left of the slide you’ve got a graph of some data points, each being a subset
- When you do your ensemble learning, you learn all these different rules and then you combine them. What is the output going to be? What does the ensemble output? I want a description and if the answer’s a number, that's a perfectly fine description. But I'll give you a hint, it's a short description.
- So, the ensemble outputs the average or mean. A constant. Which happens to be the mean of the data. We’ve seen this before when we are outputting very simple hypotheses. This is what happens if you do an unweighted average with k-NN where k equals n.
- Therefore, we should probably do something a little smarter than this then.

- Looking at the same housing data that we've looked at a couple of times before. You'll notice that I marked one of them as green
- I'm going to take the housing data you've got, I'm going to do some ensemble learning on it. And I'm going to hold out the green data point. So of the nine data points, you're only going to learn on 8 of them. And I'm going to add that green data point as my test example and see how it does. This is a cross-validation. Or you could just say, I just put my training set and my test set on the same slide.
- The first thing I'm going to do is pick a random subset of these points. I'm going to pick five points randomly and I'm going to do that five times, so I'm going to have five subsets of five examples. I'm going to choose these randomly, and I'm going to choose them with replacement, so we’re not going to end up in the situation we ended up in just a couple of minutes ago where we never got to see the same data point twice.
- So five subsets of five examples, and then I'm going to learn a 3rd-order polynomial. And I'm going to take those 3rd-order polynomials, I’m just going to learn on that subset, and I'm going to combine them by averaging.
- Here's what you get: I'm showing you a plot over those same points, with the five different 3rd-order polynomials. As you can see they're kind of similar. But some of them sort of veer off a little bit because they're looking at different data points. One of them is actually very hard to see. It veers off because just, purely randomly, it never got to see the two final points. But they all seem to be pretty much in agreement between points three and four. There's a lot of consistency there.
- So the question now becomes: how good is the average of these compared to something we might have learned over the entire data set?
- What you're looking at now is the red line: the average of all of five of those 3rd-order polynomials.
- The blue line is the 4th-order polynomial that we learned when we did this with simple regression
- You actually see they’re pretty close.
- It turns out that on this data set (and I did this many, many, many times just to see what would happen with many different random subsets):

- Typically the blue line always does better on the training set (the red points) than the red line does.
- The red line almost always does better on the green point on the test set or the validation set.

- It learns an average of 3rd-degree polynomials, 3rd-order polynomials, which is itself a third order polynomial. It does better by doing this than just learning a 3rd-order polynomial directly.
- The danger is often overfitting, overfitting is like the scary possibility. Mixing the data up in this way and focusing on different subsets of it somehow manages to find the important structure as opposed to getting misled by any of the individual data points.
- It's basically the same kind of argument you make for cross-validation. In practice, this particular technique of ensemble learning does quite well in getting rid of overfitting.
- This particular version, where you take a random subset and you combine by the mean, is called bagging.
- You'll notice it's not what I said we were going to talk about during today's discussion. I said we were going to talk about boosting. So we're talking about bagging but we're going to talk about boosting. The reason I wanted to talk about bagging is because it's really the simplest thing you can think of and it actually works remarkably well. But there are a couple of things that are wrong with it, or a couple of things you might imagine you might do better that might address some of the issues and we're going to see all of those when we talk about boosting right now.

- Let's go back and look at our two questions we were trying to answer.

- We've answered the first one -- learn over a subset of data and define a rule -- by choosing that subset uniformly randomly and applying some learning algorithm.
- We answered the second question -- how do you combine all of those rules of thumbs -- by saying, you simply average them. And that gave us bagging.

- There’s an alternative to the first question (we’ll leave open the second one for a moment) that's going to get us to what we're supposed to be talking about today, which is boosting.
- Rather than choosing uniformly randomly over the data, we should try to take advantage of what we are learning as we go along, and instead of focusing just kind of randomly, we should pick the examples that we are not good at.
- We should pick a subset based upon whether the examples in that subset are hard. Certainly when I'm trying to learn a new skill, I'll spend most of my energy on the stuff that I’m kind of on the edge of being able to do, not the stuff that I've already mastered.
- So that answers the first question. We're going to look at the hardest examples (I'm going to define for you exactly what hard means). I'm going to have to introduce at least one technical definition, but I want to make certain you have that.
- And the second one, the combining, well that's a difficult and sort of complicated thing, but at a high level, I can explain it pretty easily by saying we are going to still stick with the mean.

- We are going to do a weighted mean because the basic idea is to avoid the sort of situations that we came across when we looked at the data before, where taking an average over a bunch of points that are spread out just gives you an average or a constant that doesn't give you a lot of information about the space.
- So we're going to weight it by something, and it's going to turn out the way we choose to weight it will be very important.

- The whole goal of what we're going to add for boosting here is we're going to expand on this notion of hardest examples and weighted mean.
- How have we been defining error so far? Usually we take the squared difference between the correct labels and what's produced by our classifier or regression algorithm. That is how we've been using error when we're thinking about regression error.
- How about a notion of accuracy- how good we are at classifying examples? Sticking with classification for a moment, that would be the same as squared error, except that it's doesn’t really need to be squared. That is to say, if the outputs are zeroes and ones, the squared error is just whether or not there's a mismatch. So it could just be the total number of wrong answers.
- What we've been doing so far is counting mismatches. We might define an error rate or an error percentage as the total number of mismatches over the total number of examples. But implicit in that is the idea that every single example is equally important. That's not always the case.
- Learning only happens if your training set has the same distribution as your future testing set. And if it doesn't, then all bets are off and it's very difficult to talk about induction or learning. That notion of distribution is implicit in everything that we've been doing so far, and we haven't really been taking it into account when we've been talking about error.
- So here's another definition of error.

- The subscript stands for distribution. We don't know how new examples are being drawn, but however they're being drawn, they're being drawn from some distribution
- is our old friend, the hypothesis. That's the specific hypothesis that our learner has output. That's what we think is the true concept
- is whatever the true underlying concept is.
- I'm going to define error as the probability, given the underlying distribution, that I will disagree with the true concept on some particular instance .

- Let me give you a specific example. I'm going to draw four possible values of . And when I say I'm going to draw four possible values of , I mean I'm just going to put four dots on the the screen.
- Then I'm going to tell you that this particular learner outputs a hypothesis. a potential function, that ends up getting the first one and the third one right, but gets the second and the fourth one wrong.
- Half of them are right and half of them are wrong. So, the number of mismatches, is, 2 out of 4 or a half. However, it assumes that you’re likely to see all four of the examples equally often. However, that may not in fact be the case.
- Here's another example of error for you. What if each of the points is likely to be seen in different proportions. So you're going to see the first one half the time. You're going to see the second one 1/20th of the time. You're also going to see the fourth one 1/ 20th of the time and the third one, 4/10ths of the time, what is the error rate now?
- Take into consideration those probabilities, the number of mismatches is half, but the actual number of errors, the expected number of errors is 90% correct, 10% error.
- What's important to see here is that even though you may get many examples wrong, in some sense some examples are more important than others because some are very rare. And if you think of error, or the sort of mistakes that you're making, not as the number of distinct mistakes you can make, but rather the amount of time you will be wrong, or the amount of time you'll make a mistake, then you can begin to see that it's important to think about the underlying distribution of examples that you see.
- The notion of error turns out to be very important for boosting because, in the end, boosting is going to use this trick of distributions in order to define what hardest is. Since we are going to have learning algorithms that do a pretty good job of learning on a bunch of examples, we're going to pass along to them a distribution over the examples, which is another way of saying, which examples are important to learn versus which examples are not as important to learn.
- That's where the hardest notion is going to come in. So, every time we see a bunch of examples, we're going to try to make the harder ones more important to get right than the ones that we already know how to solve.
- Weak learner - a learning algorithm, which is what we mean by a learner here, that is actually fairly straightforward, that means no matter what the distribution is over your data, will do better than chance when it tries to learn labels on that data.

- No matter what the distribution over the data is, you're always going to have an error rate that's less than 1/2.
- What that means, sort of as a formalism, is written down here below:

- For all , that is to say no matter what the distribution is, your learning algorithm will have an expected error (the probability that it will disagree with the true actual concept if you draw a single sample) that is less than or equal to (epsilon).
- Epsilon - - is a term that you end up seeing a lot in mathematical proofs, particularly ones involving machine learning.
- Epsilon just means a really, really small number somewhere between a little bigger than 0 and certainly much smaller than 1.
- Here what this means technically is that you're bounded away from 1/2. Another way of thinking about that is you always get some information from the learner. The learner's always able to learn something. Chance would be the case where your probability is 1/2 and you actually learn nothing at all which kind of ties us back into the notion of information gain way back when with decision trees.

- Here the entire hypothesis space consists only of these three hypotheses and the entire instance space consists entirely of only four examples
- I have an in a square if that particular hypothesis does not get the correct label for that particular instance and I have a green check mark if that particular hypothesis does in fact get the right label for that example.
- I want you to come up with the distribution over the 4 different examples, such that a learning algorithm that has to choose between one of those hypotheses will in fact be able to find one that does better than chance (by having an expected error greater than 1/2).
- Then if you can do that, I want you to see if you can find a distribution which might not exist, such that if you have that distribution over the four examples, a learning algorithm that only looked at H1, H2 and H3 would not be able to return one of them that has an expected error greater than 1/2.
- You always need to have some distribution over your examples to really know what your expected error is.
- All zeros means no such distribution. So if you put in all zeros you're saying no such distribution exists. But otherwise it should add up to one down each of the columns.
- Good distribution, put equal weight on X1, X2, X3, and X4 then H1 gets three out of four correct, that's 3/4. That's better than 1/2. Then fill that in the good boxes, quarters all the way down.
- Evil distribution, looking at X1 if you put all the weight on H1 it does very badly, 100% error, H2 is 100% error. But H3 is 0% error. So putting all the weight on X1 is no good. And if you look X2, X3, and X4, they all have the property that there's always a hypothesis that gets them right. So I started to think, well, maybe there isn't an evil distribution. And then I kind of lucked into putting 1/2 on both the first and the second one because I figured that that ought to work, but then I realized that's an evil distribution because if you choose H1, H2, or H3, they all have exactly 50% error on that distribution.
- So what this tells us is, since there is a distribution for which none of these hypotheses will be better than chance, there is no weak learner for this hypothesis space, on this instance set.
- If we had more hypotheses and more examples and we had the X's and the Y's in the right places, then there'd be lots of ways to get weak learners for all the distributions because you'd have more choices to choose from. What made this one particularly hard is that you only had three hypotheses and not all of them were particularly good.
- Technically if you have a whole lot of hypotheses that are bad at everything, you're going to have a very hard time with a weak learner. And if you have a whole bunch of hypotheses that are good at almost everything, then it's pretty easy to have a weak learner.
- The lesson you should take away from this is: a weak learner is actually a pretty strong condition. You're going to have to have a lot of hypotheses that are going to have to do well on lots of different examples, or otherwise, you're not always going to be able to find one that does well no matter what the distribution is. So it's actually a fairly strong, and important condition.

- We're given a training set made up of a bunch of , pairs. is your input and is your output. For reasons that'll be clear in a moment, all of your labels are either -1 or 1, where -1 means not in the class and 1 means you're in the class. So this is a binary classification task.

Given training ; in

- Then loop at every time step, let's call it , from the first time step 1 to some big time in the future, just call it and not worry about where it comes from right now.

For to

- Construct a distribution, call that , this is your distribution over your examples at some particular time .

Construct

- Given that distribution, find a weak classifier. Your weak learner should output some hypothesis. Let's call that , the hypothesis that gets produced at that time step, and that hypothesis should have some small error. Let's call that error (epsilon sub t), because it's a small number, such that it does well on the training set, given the particular distribution, which is to say, that the probability of it being wrong (disagreeing with the training label) is small with respect to the underlying distribution. To be clear here, it doesn't have to be tiny. It could actually be almost ½, but not bigger than 1/2.

find weak classifier with small error

- You're going to do that and you'll do that for a whole bunch of time steps, constantly finding hypotheses at each time step with small error, constantly making new distributions, and then eventually, you're going to output some final hypothesis. I haven't told you yet how you're going to to get the final hypothesis, but that's the high level bit. Look at your training data, construct distributions, find a weak classifier with low error, keep doing that you have a bunch of them, and then combine them somehow into some final hypothesis.
- Where do we get this distribution and where do we get this final hypothesis? Next slide...

Let's start with the distribution.

- Let's start with the base case, and that is the distribution at the beginning of time, .

- This distribution is going to be over each of the examples and those examples are indexed over . I'm simply going to set that distribution to be uniform. We’ll call it examples. I'm going to say that for every single example, they happen 1/n times, that is, a uniform distribution. We have no reason to believe, for the purposes of this algorithm, that any given example is better, more important, harder than any other example. I know nothing. See if you can learn over all of the examples.

- So I start out with a uniform distribution because that's what you usually do when you don't know anything. At every time step , I'm going to construct a new distribution, to be the equation on the right of the slide

- We know that D is our distribution and it's some number, where, over all the examples, it adds up to 1. It's a stand-in for how important a particular example is -- how often we expect to see it. And that's the trick that we're using with distributions.
- I'm going to take the old distribution for a particular example and I'm going to either make it bigger or smaller, based upon how well the current hypothesis does on that particular example. The trick:

- We know that always returns a value of either -1 or +1 because that's how we define our training set. So, is going to return -1 or +1 for a particular .
- I , which is the label with respect to that example, is also always going to be +1 or -1
- is a constant (will get into later, just think of it as a number right now). It matters eventually. But right now, that number is always positive. It's a learning rate kind of thing

- is going to be 1 if they agree, and -1 if they disagree. So, that means when they agree, that whole product will be positive, which means you'll be raising to some negative number. When they disagree, that product will be negative, which means you'll be raising to some positive number.

- If and agree, that means they're both negative or they're both positive, so when we multiply them together, we get 1. 1 times whatever is a positive number is going to be positive. We're negating that, so it's negative. to the negative power is something between zero and one, so that's going to scale it down. So, it looks like it could depend on the normalization.
- The term is, in fact, whatever normalization constant you need at time in order to make it all work out to be a distribution (sum to 1).
- Then it’s not going to change. But if some of them are correct and some of them are incorrect, the ones that are correct are going to decrease. And the ones that are incorrect are going to increase.
- So the answer is that it depends. It depends on what else is going on. Decrease is what typically happens though.
- A similar question is what happens when they disagree and at least one other example agrees? Then that should increase.
- It's going to put more weight on the ones that it's getting wrong. The ones that it's getting wrong must be the ones that are harder. Or at least that's the underlying idea. And the ones that it’s getting right, it puts less weight on those. The loop goes around again and it tries to make a new classifier.
- The ones that it’s getting wrong are getting more and more weight, but we are guaranteed, or at least we've assumed, that we have a weak learner that will always do better than chance on any distribution, it means that you'll always be able to output some learner that can get some of the ones that you were getting wrong, right.

- That ties together what constructing D does for you, and connects it to the hardest examples.
- That gets us to a nice little trick where we can talk about how we actually output our final example.
- So, the way you construct your final example is basically by doing a weighted average, and the weight is going to be based upon this .
- So the final hypothesis is just the sgn function of the weighted sum of all of the rules of thumb, all of the weak classifiers that you've been picking up over all of these time steps, where they're weighted by the ’s.

- where That is to say, is a measure of how well you're doing with respect to the underlying error.

- So, you get more weight if you do well than if you do less well, where you get less weight.
- Use natural logs because you're using exponentials and that's always a cute thing to do

- So what does this look like to you? Well, it’s a weighted average based on how well you're doing or how well each of the individual hypotheses are doing and then you pass it through a thresholding function where

- If it’s below zero, you say negative
- If it’s above zero, you say positive
- If it’s zero, you just throw up your hands and return zero.

- In other words, you return the sign of the number. So you are throwing away information there, (we into this information in the next lesson, it’s going to turn out that this little bit of information is actually pretty important).
- This is boosting. There's really nothing else to it. You have a very simple algorithm which can be written down in a couple of lines. The hardest parts are constructing the distribution and then simply bringing everything together

- I’ve got three little boxes on the screen. Now, they're the same boxes. I've drawn them up here beforehand because I'm going to solve this problem in only three steps.
- So just pay attention to the first box for now. You have a bunch of data points; red pluses (+) and green minuses (-) with the appropriate labels and they all live in this part of the plane. We want to figure out how to be able to correctly classify these examples.
- We have to specify what our hypothesis space is. So the hypothesis space is the set of axis-aligned semi-planes. For the purpose of this example this means I'm going to draw a line, either horizontal or vertical, and say that everything on one side of that line is positive and everything on the other side of that line is negative.
- They’re axis aligned because it's only horizontal and vertical, and they're semi-planes because the positive part of it is only in part of the plane.
- I'm going to just walk through what boosting would/might end up doing with this particular example given that you have a learner that always chooses between axis-aligned semi-planes.
- Step 1. All of the examples look the same because we have no particular reason to say any are more important than the other or any are easier or harder than the other. And that's just the algorithm we had before. We run through and we ask our learner to return some hypothesis that does well in classifying the examples. It turns out that though there are many, and in fact, there are an infinite number of possible hypotheses you could return, one that works really well is one that looks like a vertical line that separates the first two data points from the rest.
- What I'm saying here is that everything to the left of this line is going to be positive and everything to the right is going to be negative. It gets correct (correctly labeled positive) the two plusses to the left. Iit gets correct all of the minuses as well. But it gets wrong the three plusses on the right side.
- So I'm just going to ask you to trust me here but it turns out that the specific error here is 0.3 and if you stick that into you end up with 4.2.
- Step 2. The ones that it got right should get less weight and the ones that it got wrong should get more weight
- The this learner output by putting a line to the right of the three plusses, because he's gotta get those right in saying that everything to the left is in fact, positive. It gets the three that you were doing poorly on right, in step 1 and it picks up still the other two which it was getting right. And it gets wrong these three minuses which aren't worth so much.
- So the error of this step by the way, turns out to be 0.21 and the at this time step turns out to be 0.65.
- Step 3: The ones that got it wrong should get pushed up in weight, the green minuses in the middle patch, they should become more prominent. The three plusses on the right should become less prominent than they were but it still might be more prominent than they were in the beginning (actually won’t because the bigger so it will have actually a bigger effect on bringing it down), but it’ll still be more prominent than the other ones that haven't been bumped, which will get smaller again, they're really going to disappear.
- Step 3: separate the 3 plusses on the right

- If you look at these three hypotheses and their weights, you end up with something kind of interesting. If you look at this third hypothesis that's chosen here, it turns out to have a very low error (that is decreasing) of 0.14, and it has a much higher of 0.92. Now if you look at these weights and you add them up, you end up with a cute little combination. So, let me draw that for you. If you take each of the three hypotheses that we produced and you weight them accordingly, you end up with the bottom figure.
- Even though we were only using axis-aligned semi-planes for all the weak learners, at the end of the day, it actually kind of bent the line around and captured the positive and negative examples perfectly.
- If you try to look at just some particular hypothesis class H, because you're doing weighted averages over hypotheses drawn from that hypothesis class, this hypothesis class is at least as complicated as this hypothesis class and often is more complicated. So you're able to be more expressive, even though you're using simple hypotheses, because you're combining them in some way.

This slide is more of a conversation, left most of it as is...

- Boosting basically says, if I have some examples that I haven't been able to classify well, I'm going to re-rate all my examples so that the ones I don't do well on become increasingly important. That's what boosting does.
- That's what this whole bit of is all about. It's all about re-weighting based on difficulty and hardness.
- We know that we have the notion of a weak learner. That no matter what happens for whatever distribution, we're always going to be able to find some hypothesis that does well.
- So, if I'm trying to understand why boosting in the end, why the final hypothesis that I get at the end, is going to do well, I can try to get a feeling for that by asking, well, under what circumstances would it not do well? So, if it doesn't do well, then that means there has to be a bunch of examples that it's getting wrong.
- So how many things could it not get right? How many things could it misclassify? How many things could it get incorrect? That that number has to be small. There cannot be a lot of examples that it gets wrong.
- So, here's my reasoning, let's imagine I had a number of examples at the end of this whole process. I've done it T times. I've gone through this many times and I have some number of examples that I'm getting wrong. If I were getting those examples wrong, then I was getting them wrong in the last time step, right? And, since I have a distribution and I re-normalize, and it has to be the case that at least half of the time, I am correct, the number of things I'm getting wrong has to be getting smaller over time. Because let's imagine that I was at a stage where I had a whole bunch of them wrong. Well, then I would naturally renormalize them with a distribution so that all of those things are important. But if they were all important, the ones that I was getting wrong, the next time I run a learner, I am going to have to get at least half of them right, more than half of them are right.
- Why can't it just be the case that the previous ones which were getting right start to get more wrong as we shift our energy towards the errors?
- What goes on is that you get this exponentially aggressive weighting over examples. And you're driving down the number of things you get wrong sort of exponentially quickly, over time. That's why boosting works so well and works so fast.
- I don't get why that's causing us to get fewer things wrong over time. In your example that you worked through, that had the error in the alphas and the errors kept going down and the alphas kept going up. Is that necessarily the case?
- Well, what would be the circumstances under which it isn't the case? How would you ever go back and forth between examples? Over time, every new hypothesis gets a vote based upon how well it does on the last difficult distribution. So even if the ones that you were getting right, you start to get wrong, if you get them increasingly wrong, that error's going to go down and you're going to get less of a vote.
- I feel like the error should be going up, because we're asking it harder and harder questions as we go.
- Even though we're asking it harder and harder questions, it's forced to be able to do well on those hard questions. It's forced to, because it's a weak learner. That's why having a weak learner is such a powerful thing.
- But why couldn't we on iteration 17 have something where the weak learner works right at the edge of its abilities and it just comes back with something that's 1/2 minus epsilon.
- That's fine. But it has to always be able to do that. If it's 1/2 minus epsilon, the things it's getting wrong will have to go back down again.
- Why would the error go down each iteration?
- Well, it doesn't have to, but it shouldn't be getting bigger.
- Why shouldn't it be getting bigger?
- So, imagine the case that you're getting. You are working at the edge of your abilities. You get half of them right roughly and half of them wrong. The ones you got wrong would become more important, so the next time around you're going to get those right versus the other ones. So you could cycle back and forth I suppose, in the worst case, but then you're just going to be sitting around, always having a little bit more information. So your error will not get worse, you'll just have different ones that are able to do well on different parts of the space. Right? Because you're always forced to do better than chance. So.
- Yeah but that's not the same as saying that we're forced to get better and better each iteration. I don't see that, that property just falling out.
- Well, I don't see it falling out either, but then I haven't read the proof in like seven, eight, nine years.
- So we generate a new distribution. What is the previous classification error on this distribution? I mean, if it were the case that we always return the best classifier then I could imagine trying to use that but…
- Well we, well we don't, we don't require that.
- Yeah, I mean, it's just finding one that's epsilon minus, or 1/2 minus epsilon.
- Right, so let's, let's see if we can take the simple case, we got three examples, right, and you're bouncing back and forth and you want to construct something so that you always do well on two of them. And then poorly on one, kind of a thing, and that you keep bouncing back and forth. So let's imagine that you have one-third, one-third, one-third, and your first thing gets the first two right and the last one wrong. So you have an error of a third. And you make that last one more likely and the other two less likely. Suitably normalized, right?
- Yep.
- So now, your next one, you want to somehow bounce back and have it decide that it can miss, so let’s say you missed the third one. So you, you get the third one right. You get the second one right but you get the first one wrong. What's going to happen? Well, three is going to go down. You'll have less than a third error because you had to get one of the ones you were getting right wrong, you had to get the one you were getting wrong right. So your error is going to be, at least in the example I just gave, less than a third. So, if your error is less than a third, then the weighting goes up more. And so, the one that you just got wrong doesn't go back to where it was before. It becomes even more important than it was when you had a uniform distribution. So the next time around, you have to get that one right, but it's not enough to break 1/2. So you're going to have to get something else right as well, and the one in the middle that you were getting right isn't enough. So you'll have to get number three right as well.
- Interesting.
- Right? And so, it's really hard to cycle back and forth between different examples, because you're exponentially weighting how important they are. Which means, you're always going to have to pick up something along the way. Because the ones that you coincidentally got right two times in a row become so unimportant that it doesn't help you to get those right, whereas the ones that you've gotten wrong, in the past, you've got to, on these cycles, pick up some of them in order to get you over 1/2.
- Mmm
- And so, it is very difficult for you to cycle back and forth.
- Interesting.
- And that kind of makes sense, right? If you think about it in kind of an information gain sense, because what's going on there is you're basically saying you must pick up information all the time.
- Hm. You are kind of non-linearly using that information in some way. So that kind of works. It makes some sense to me, but I think that in the end what has to happen is there must be just a few examples in a kind of weighted sense that you're getting wrong. And so if I'm right, that as you move through each of these cycles, you're weighting in such a way that you have to be picking up things you've gotten wrong in the past. So in other words, it's not enough to say, only the things that are hard in the last set are the ones that I have to do better. You must also be picking up some of the things that you've gotten wrong earlier more than you were getting them right because there's just not enough information in the one's that you're getting right all the time, because by the time you get that far along, the weight on them is near zero and they don't matter.
- Interesting.
- And then if you say, well, Charles, I could cycle back by always getting those wrong, yes, but then if you're getting those wrong, they're going to pull up and you're going to have to start getting those right too. And so, over time, you've gotta not just pick out things that do better than a half but things that do well on a lot of the data. Because there's no way for all of the possible distributions for you to do better than chance otherwise.

- All lines fit the sample data but only the middle line is the best.
- Since this is just a sample from the population, we don't know what's going to happen really close to the lines closer to pluses/minuses once you add test data.
- Otherwise you’re believing the training data that you've got too much. You've decided that all of these boundaries are fine because of the specific set of training data that we've got. And while you want to believe the data, because that's all you've got, you don't want to believe the data too much.
- That's overfitting. So, the problem with those two lines, or one way to think about the problem with those two lines, is that these lines are more likely to overfit because they're believing the data too much.
- This line, or the line it's intended to represent anyway, is the line that is consistent with the data while committing least to it.
- Overfitting up to this point, we were generally talking about overfitting as being something where there's a great deal of model complexity in some sense.
- And it doesn't seem like those lines that are closer to the pluses or closer to the minuses are inherently more complex, they're still lines.
- It's a more literal interpretation of the words over and fit, you have decided to fit the data, and you believe it too much, and what you really want to do is commit. The least that you can commit to the data while still being consistent with it, finding the line of least commitment in the linear separable set of data, is the basis behind support vector machines.

- I'm going to try to encapsulate, what we just talked about, the line of least commitment, by drawing another line.
- If we think of this top gray line here as sort of the line that gets as close to the plus points as possible, without crossing over them and mis-classifying them, and we think of the bottom gray line as the one that gets as close as possible to the minus signs without crossing over them and misclassifying them, and then the middle line is sort of in the happy medium.
- What you really want is that, somehow, this distance between these lines is as big as possible. You want to have a line that leaves as much space as possible from the boundaries.
- Here even though we're going to be drawing with lines, we really want to deal with the general case, where we're talking about hyperplanes. Generally, when we write about hyperplanes, we describe them as some output, let's just call it . Here, because of what we're trying to do with classification:

- The output is going to be some value that indicates whether you're in the positive class or you're in the negative class.
- represents the parameters for our plane along with , which is what moves it out of the origin.

- here is going to be our classification label. Whenever we're talking about using a linear separator what we've been talking about is that you are taking some new point, projecting it onto the line, and then looking at the value that comes out from projecting it.
- In this case we want positive values to mean yes, you are part of the class, and negative values to mean that you aren't a part of the class.
- This is now what our linear classifiers actually look like, even in multiple dimensions with hyperplanes.
- Let's take that and and push it to the next level. Let's figure out exactly what we would expect the output of our hyperplane to be in this example that I've drawn on the screen here. So, we know we want to find this orange line in the middle, which has the property that it is your decision value. It tells you whether you are in the positive class or negative class, but also, it has the property of being as far away from the data as possible while still being consistent with it.
- If you're on the decision boundary for this particular line that's where it's kind of not sure if it's positive or negative, so that should be zero.
- One question we can ask ourselves then, if we look at these other lines is, what's the equation for the other gray lines that are right at our positive or negative examples?
- Just like we did with boosting, let's say that our labels are always going to be from the set {-1, +1}. We know that our labels are -1 and +1, so we're going to take advantage of that fact by saying that the line that brushes up against the positive example should output +1 on the very first point that it encounters. That way the things that are kind of past the line are going to be +1 and the things before the line in kind of that demilitarized zone are going to be between zero and +1.
- The top gray line would be .
- The bottom gray line would be
- This helps us in a very a simple way. We know that we want the boundary condition line, the one that is actually our decision boundary, to be as far as possible from both the positive and negative examples, so that would mean then that the distance between the two gray lines, which are parallel to that line, needs to also be maximized.
- We want this vector here to have the maximum length that we can have. Okay, so, how are we going to figure out how long that particular line is? So, here is a simple idea. Well, the lines are parallel to one another. We can pick points on that line to define that particular distance there. So, just because it's really easy to do the math, I'm going to choose a point here and a point here. Those points have the property, that, if I draw the line between them, you get a line that is perpendicular to the two gray lines. And I don't know what their respective x values are, so I'm just going to call, them x1 and x2, and that is going to define the two points that I have. The vector that is defined by their difference is in fact going to have the length that tells you how far apart those two lines are, which, in turn, because of the way that we've constructed them, tells you how far apart you're boundary decision line is from the data and we want that to be maximized because then we made the least commitment to the data.
- The equation for our positive line is . And all I've done there is substitute, some point -- I don't have to know what is, it's going to turn out -- that puts me in some particular place on that line. And similarly, I can do the same thing for my negative line and get .
- Now, we want the distance between these two hyperplanes (or lines in this example) to be maximized. In order to figure out what that means, we need to know exactly what that line is. So it's the difference between the two.
- So, we can just use our favorite trick when we're doing systems of linear equations and just subtract the two lines. We basically have two equations and two unknowns. And, we simply subtract them from one another so that we can get a single equation that happens to represent the distance between them. So, if I subtract the two from one another, what do I get? (quiz next)

- I've got these two equations of two different hyperplanes, though they're parallel to one another because they have the same parameters. That is to say, I have two equations and two unknowns. I want to subtract them from one another and what we want you to do is we want you to solve for the line that is described by their difference.
- The output that I want you to figure out here is exactly what the distances between those two planes
- The distance between the two lines I feel that's just the norm of x1 minus x2. But the difference between these equations is going to be:

- The distance between x1 and x2, I still feel like it’s just that norm of x1 minus x2. But I want you to tell it to me in terms of W because the only things we have to play with here are W and b. That's what defines our line and I want to find the right line, so I'd like to know something about the relationship between W and the distance between x1 and x2.
- If I divide it by W, that would be helpful because then x1 minus x2 would be . But you can't do that because W is a vector, and you can't really divide by a vector. We want to move w over from one side to the other. We could start doing all kinds of tricks with inverses, and with the inverse of a vector. There's all kinds of things that you could do, but actually the easiest thing way of doing it is getting rid of W on one side. And the easiest way to do that is to divide both sides by the length of W. So, rather than dividing both sides by W, we divide them by the length of W.
- So W divided by the length of W, is a normalized version of W. So it's like something that points in the same direction as W, but sits on the unit sphere. In fact it's a hypersphere. So we do that and that effectively is like giving you a value 1 because, like you said, it's a unit sphere. And so now we're actually talking about the difference between the vector x1 and the vector x2 projected onto the unit sphere and that's equal to
- We've taken x1 minus x2 and projected it onto W. So it's like the length of x1 minus x2, but in the W direction.
- W actually represents a vector that's perpendicular to the line. And since we chose x1 and x2, their distance or the difference between them would in fact be perpendicular to the line. What we've just done is projected the difference between those two vectors onto something that is also perpendicular to the line. And so what that ends up giving us is, in fact, its length.
- The thing on left, not just where the braces are, that actually turns out to be the distance between the two hyperplanes, let's let's give that a letter and call it . We're saying that equals 2 over the norm of W ().
- So, if we want to maximize that the only thing that we have to play with is W and that is made larger and larger as W gets smaller and smaller, in other words, pushing it toward the origin.
- So set Ws to all zeroes, and we should be golden. Except if we push all the Ws to zero, we might not be able to correctly classify our points but what this does tell us is that we have a way of thinking about the distance of this vector and where the decision boundary ought to be.
- We want to find the parameters of the hyperplane such that we maximize this distance over here represented by this equation while still being consistent with the data, which makes sense because that's actually what we said in the first place.
- By the way, this thing has a name and it's the reason why I chose -- it's called the margin, and what all of this exercise tells you is that your goal is to find a decision boundary that maximizes the margin, subject to the constraint that you actually want to correctly classify everything, and that is represented by that term.
- Now somehow, it feels like having gone through all this we ought to be able to use it for something and turn it into some other problem we might be able to solve so that we can actually find the best line. And it turns out we can do that.
- So, it turns out that this whole notion of thinking about finding the optimal decision boundary is the same as finding a line that maximizes the margin.
- And we can take what we've just learned, where we've decided the goal is to maximize and turn it into a problem where we can solve this directly.

- We want to do somehow is maximize a particular equation, that is . As a reminder, W are the parameters of our hyperplane. So somehow, we want to maximize that equation, subject to the constraints that we still classify everything correctly.
- Okay, so we want to maximize while classifying everything correctly.While classifying everything correctly is not a very mathematically satisfying expression, it turns out we can turn that into a mathematically satisfying expression. And let me show you how to do that. So here's a simple equation. While classifying everything correctly turns out to be the same as:

- Well, what we really want is that the linear classifier, , is greater than or equal to 1 for the positive examples and less than or equal to -1 for the negative examples. But you cleverly multiply it by the label on the left-hand side, which does exactly that. If is 1, it leaves it untouched. And if is negative, it makes it less than or equal to minus 1. That's very clever.
- It turns out that trying to solve this particular problem, maximizing , while satisfying that constraint, is a little painful to do. But we can solve an equivalent problem, which turns out to be much easier to do, and that is this problem. That is, rather than trying to maximize , we can instead try to minimize
- The point that maximizes one will minimize the other because we took the reciprocal. As long as we're talking about positive things. And since these are lengths, they'll be positive. Taking the reciprocal changes the direction of what the answer is. And the squaring makes it monotone. It doesn't change the ordering of things.
- This is easier because when you have an optimization problem of this form, something like minimizing a W squared, subject to a bunch of constraints, that is called a quadratic programming problem. And people know how to solve quadratic programming problems in relatively straightforward ways.
- Now, what else is nice about that is a couple of things. One is, it turns out that these always have a solution, and in fact, always have a unique solution. Now, I am not going to tell you how to solve quadratic programming problems because I don't know how to do it other than to call it up in MATLAB. But there's a whole set of classes out there, where they teach you how to do quadratic programming. The important thing is that we have defined a specific optimization problem and that there are known techniques that come from linear algebra that tell us how to solve them. And we can just plug and play and go.
- In particular, it turns out that we can transform this particular quadratic programming problem into a different quadratic programming problem, actually, truthfully, into the normal form for a quadratic programming problem, that has the following form.

max

- Here's what this equation tells you:

- We have basically started out by trying to maximize the margin (that's the same thing as trying to maximize ) subject to a particular set of constraints, which are how we codify that we want to classify every data point correctly in the training set.
- We've argued that that's equivalent to minimizing , subject to the same constraints.
- And then notice that you can convert that into a quadratic programming problem, which we know how to solve.
- And it turns out that quadratic programming problem has a very particular form.
- Rather than try to minimize , we can try to maximize another function that has a different set of parameters, which I'll call alpha - .
- data pointsThe equation has the following form: It's the sum over all of the i, indexed by i, of this new set of parameters alpha, minus ½ times, for every pair of examples, the product of their alphas, their labels, and their values, subject to a different set of constraints, namely that all of the alphas are non-negative, and that the sum of the product of the alphas, and the labels that go along with them, are equal to zero.

- Instead of explaining how we got from to this quadratic equation, go read a quadratic programming book.
- What I really need you to believe, though, mainly because I'm asserting it, is that these are equivalent. So if you buy up to the point that we are trying to maximize the margin, then you just have to take a leap of faith here that, if we instead maximize this other equation, it turns out that we are solving the same problem. And that we know how to do it using quadratic programming. Or other people know how to do it and they've written code for us.
- All right, so trust me on this. This is what it is that we want to solve. What's really interesting is what this equation actually tells us about what we're trying to do.

- We've done a little bit of moving stuff around, and kept the same equation of before. Remember, our goal is to use quadratic programming to maximize this equation.
- It turns out that once you find the alphas that maximize this equation, you can actually recover the w, which was the whole point of this exercise in the first place.
- Once you know w it's easy to recover b. You just find the value of x, you stick it into W, you know it’s equal to +1, and then poof, you can find out b.
- So you can recover w directly from this and you can recover b from it in sort of an obvious way. But there are some other properties that are a little bit more interesting for you.
- I want you to pay attention to two things. One I am just going to have to tell you, and the other I want you to think about.
- Here's the one that I'm going to tell you. It turns out that each of those alphas are mostly zero, usually. So if w is the sum of the data points times their labels times alpha, and if the alpha is zero, then the corresponding data point isn't really going to come into play in the definition of W at all. So a bunch of the data just don't really factor into W.
- So basically, some of the vectors matter for finding the solution to this, and some do not. It turns out, each of those points are vectors. But you can find all of the support that you need for finding the optimal w in just using a few of those vectors. The non-zero alphas. So you basically built a machine that only needs a few support vectors.
- So the data points for which the corresponding alpha is non-zero are the support vectors, those are the ones that provide all the support for W. So knowing that W is the sum over a lot of these different data points, and their labels, and the corresponding alphas, and that most of those are zeroes, that implies, that only a few of the X's matter.

- Thinking about everything that we've said, and thinking about what it means to build an optimal decision boundary maximizing a margin I want you to point to one of the positive examples that almost certainly is going to have a zero alpha, and one of the minus examples that almost certainly is going to have a alpha that are not a part of the support vectors. That is, in some sense, doesn't matter. Okay, go. Answer
- So I guess it really does make some intuitive sense that the line is really, really nailed down by the points close to it. And the points that are far away from it, really don't have any influence. So I would put zero alphas on the lower left hand minus and then one of the upper pluses.
- The points that are far away from the decision boundary, and can't be used to define the contours of that decision boundary, it doesn't matter whether they're plus or minus.
- It's like KNN except that you already done the work of figuring out which points actually matter. So you don't have to keep all of them. You can throw away some of them. So it doesn't just take the nearest ones, it actually does this complicated quadratic program to figure out which ones are actually going to contribute.
- It's just another way of thinking about instance-based learning, except that rather than being completely lazy, you put some energy into figuring out which points you could actually stand to throw away.
- Basically the alphas say pay attention to this data point or not. But if you look carefully at this equation, the only place where the x’s come into play with one another is here. Generally speaking, given a couple of vectors, xi transpose xj is the dot product, which is the projection of one of those onto the other.
- This dot product represents the length of the projection. If the x's are, well if they are orthogonal to each other than it's going to be zero. But if they kind of point in the same direction, it's going to be a large value, and if they point in opposite directions it's going to be a negative value. So it indicates how much they're pointing in the same direction. So it could be a measure of their similarity.
- This is the kind of a notion of similarity. So if you look at this equation, what it basically says. Find all pairs of points. Figure out which ones matter for, for defining your decision boundary. And then think about how they relate to one another in terms of their output labels. With respect to how similar they are to one another.

- Given a linearly separable graph as seen above, what happens if I add one more point, the minus point in red?
- In some sense the margin is now negative, because there's going to be no way of slicing this up so that all the negatives are on one side and all the positives on the other. It's not linearly separable. This can be solved using SVM (was used as HW assignment)
- The other graph in the slide looks like there's a ring around the whole thing. You can draw lines all day long, and it's just not going to slice things up. So now we have to come up with some clever way of managing to make this work so here's the little trick we're going to do: I am going to change the data points without changing the data points.
- So that means it has two different components, Q1 and Q2. And I am going to produce from those two components, Q1 and Q2, a triple. So I am going to put that point into three dimensions now. And the dimensions are going to look like this.

- So Q, is a two dimensional point. So it's got, Q1 and Q2 are its two components. Take the first component, make a new vector where the first component of that is squared, take the second component, make a new vector where that value is the second component squared. Then square root times the product of those two as the third dimension.
- Let me point out something for you. One is I haven't actually added any new information, in the sense that I'm still only using Q1 and Q2. Yeah, I threw a constant in there, and I'm multiplying by one another, but at the end of the day, I haven't really done much. It's not like I've thrown in some boolean variable that gives you some extra information. All I've done is taken Q1 and Q2 and multiplied them together, or against one another.
- I did this because it's going to turn out to provide a cute little trick. And in order to see that cute little trick we need to return to our quadratic programming problem. You'll recall that I asked you to talk about Xi and Xj , and what it looks like in this equation. And what I think we agreed to is that we can think about Xi transpose Xj as capturing some notion of similarity. So, it turns out then, if we sort of buy that idea of similarity, that really what matters most in solving this quadratic problem, and ultimately solving our optimization problem, is being able to constantly do these transpose operations.
- So, let's ask the question that, if I have this new function, phi (), what would happen if I took two data points, and I did the transpose or the dot product between them.

- Let's imagine we have two points. I'm going to call them X and Y, just so I can confuse you with notation. And they are both two dimensional points. So they're in a plane. And they have components X1 and X2 and components Y1 and Y2.
- Rather than computing the X transpose Y, their dot product, I want to compute the dot product of those two points, but passed through this function phi.
- So, x is really x1 x2 and y is really y1 y2 and phi x is now this crazy triple x so I, so I wrote x12 x22, root 2 x1 x2. That's the vector that we get for phi x.
- And then the y vector gets transformed to the same thing, except for with y’s, y1 squared, y2 squared, root 2 y1 y2.
- So, then, the, the dot product is just the products of the corresponding components summed up. So x1 squared y1 squared plus X2 squared y2 squared plus 2x1x2y1y2
- We can factor this. X1 Y1 Plus X2 Y2. Whole thing's squared.
- And what's an even simpler way of writing that? It's x transpose y on the inside, and then we square it on the outside.
- There was method to the madness when I created the phi function?
- This is basically a particular form of the equation for a circle Which means that we've gone from thinking about the linear relationship between Xi and Xj or your data points and we've now turned it into something that looks a lot more like a circle.
- If you believe me in the beginning where we notice that Xi transpose Xj is really about similarity. It's really about why it is you would say two points are close to one another or far apart from one another. By coming into this transformation over here, where we basically represented the equation for a circle, we have now replaced our notion of similarity from being this very simple projection to being the notion of similarity is whether you fall in or out of a circle. So more sort of about the distance as opposed to what direction you're pointing.
- Both of them are fine because both of them represent some notion of similarity, some notion of distance in some space. In the original case, we're talking about two points that are lined up together. And over here together with this particular equation we represented whether they're inside the radius of a circle or outside the radius of a circle. Now this particular example assumes that the circle is centered at the origin and so on and so forth, but the idea I want you to see is that we could transform all of our data so that we separated points from within one circle to points that are outside of the circle. And then, if we do that, projecting from two dimensions here into three dimensions, we've basically taken all of the pluses and moved them up and all of the minuses and moved them back, and now we can separate them with the hyperplane.
- Without knowing which ones are the pluses and which ones are the minuses because they are the ones that were closer to the origin, so they get raised up less. I can basically take my data, and I transform it into a higher dimensional space, where suddenly I'm now able to separate it linearly.
- That's very cute, but I chose this particular form for a reason. Can you guess why? There are lots of different ways we could have fit the circle pattern. I chose this particular form because not only does it fit the circle pattern, but it doesn't require that I do this particular transformation. Rather than taking all of my data points and projecting them up into three dimensions directly, I can instead still simply compute the dot product and now I take that answer and I square it.
- In this formulation of the quadratic program that you have there in terms of capital W if you write code to do that, each time in the code you want to compute Xi transpose times Xj, if you just squared it right before you actually used it, it would be as if you projected it into this third dimension and found a plane.
- This is known as the kernel trick.
- So, again if we really push on this notion of similarity. What we're really saying is we care about maximizing some function that depends highly upon how different data points are alike, or how they are different. And simply by writing it this particular way, all we're saying is, you know what, we think the inner product is how we should define similarity. But instead, we could use a different function altogether, phi or more nicely represented as x transpose y squared, and say, that's our notion of similarity. And we can substitute it accordingly.
- We never used phi. We're able to avoid all of that by coming up with a clever representation of similarity. That just so happened to represent something, or could represent something in a higher dimensional space.
- You can't just use anything, but in practice it turns out you can use almost anything. Also it turns out for any function that you use, there is some transformation into some dimensional space, higher dimensional space, that is equivalent. Now, it may turn out that you need an infinite number of dimensions to represent it. But there is some way of transform, transforming your points into higher dimensional space that happens to represent this kernel, or whatever kernel you choose to use.
- The kernel is the function itself. In fact, let me clean up this screen a little bit. And, and see if we can make this a little bit more precise and easier to understand. (next slide)

- Let's look at this xi transpose xj (circled in green). I've just replaced it with a function, which I'm going to call a kernel. Which takes xi and xj as parameters, and will return some number.
- As we talked about before, we think of the xi transpose xj, as some notion of similarity, and so this kernel function is our representation, still, of similarity. Another way of thinking about that, by the way, is that this is the mechanism by which we inject domain knowledge into the support vector machine learning algorithm.
- You can create these kernels and these kernels have arbitrary relationships to one another. So, what you're really doing is, projecting into some higher dimensional space, where, in that higher dimensional space, your points are in fact, linearly separable.
- Also, because you're using this kernel function to represent your domain knowledge, you don't actually have to do the computation of transforming the points into this higher dimensional space. I mean, in fact if you think about it, with the last kernel that we used, computationally, there was actually no more work to be done. Before we were doing x transpose y, and now we're still doing x transpose y, except we're then squaring it. So that's just a constant bit more work. And the other kernel we talked about was just X transpose Y by itself. That's a kernel too.
- We can actually write a general form of both of these. And as a very typical kernel, it's the polynomial kernel where you have x transpose y plus some constant, let's call it c, raised to some power p. And as you can see, both of those earlier kernels are, in fact, just a special case of this. That should look familiar, where we were doing polynomial regression. Now, rather than doing polynomial regression the way we were thinking about it before, we use a polynomial kernel and that will allow us to represent polynomial functions.
- There're lots of other kernels you can come up with, here's just a couple. I will just sort of leave em up to you, to think about. And there's tons of them.Here's one that I happen to like, a sort of radial basis kernel. So if x and y are really close to each other, then it's e to the minus zero over something which is like e to the zero, which is like one. So there's similarities like one if they're on top of each other. If they're very far apart, then it's like their distance is something very big divided by something e to the minus something very big is very close to zero. So it does have that kind of property like the sigmoid where it transitions between zero and one but it's not exactly the same shape as that.
- Actually if you wanted to get something that looked like a sigmoid, here's one. Where alpha's different from the other alphas, but I couldn't think of a different Greek letter. And this function gives you something that looks a lot more like a sigmoid.
- There's been a lot of research over the years on what makes a good kernel function. The most important thing here is that it really captures your domain knowledge. It really captures your notion of similarity. You might notice that since it's just an arbitrary function that returns a number, it means that X and Y or the different data points you have, don't have to actually be points in a numerical space. They could be discrete variables. They could describe whether you're male or female. As long as you have some notion of similarity to play around with, that you can define, that returns a number, then it doesn't matter. It will always work.
- So can you do things like strings or graphs or images. How are two strings similar? Maybe they're, they're similar if their edit distance is small. The number of transformations that you have to give in order to transform one string to another. If there are few of those, then they're very similar. If there are a lot of those then they're very dissimilar.
- While it's not clear whether there are any bad kernel functions, it is the case that in order for all the math to go through, there is a specific technical requirement of a kernel function. It has a name. And it's the Mercer Condition. The Mercer condition is a very technical thing we'll talk about this again, a little bit in the homework assignment. But for your intuition in the meantime, it basically means it acts like a distance, or it acts like a similarity. It's not an arbitrary thing that doesn't relate the various points together. Being positive is something definite in in this context means it's a well behaved distance function.

- It appears that boosting does not always over-fit, it doesn't seem to over-fit in the ways that we would normally expect it to over-fit. And in particular we'd see a, you know, an error line on training And what we expect to see is a testing line that would, you know, hue pretty closely and then start to get bad.

- But what actually happens is that instead, this little bit at the end where you get over fitting seems to instead. Just keep doing well. In fact, getting better and better and better.

- And I promised you an explanation for why that was. You don't have this problem with overfitting at least not in the typical way as you keep applying it over and over again like you do with something like neural networks. It really boils down to noticing that we've been ignoring some information.
- What we normally keep track of is error. So error on say a training set is just the probability that you're going to come up with an incorrect answer or come up with an answer that disagrees with your training set, and that's a very natural thing to think about and it makes a lot of sense.
- But there's also something else that is actually captured inside of boosting and captured by a lot of learning algorithms we haven't been taking advantage of, and that's the notion of confidence. So confidence is not just whether you got it right or wrong. It's how strongly you believe in a particular answer that you’ve given.
- A lot of the algorithms we talked indirectly have something like that. In a nearest neighbor method, if you are doing five nearest neighbor and all five of the neighbors agree, that seems different than the case with 3 vote one way and 2 vote the other.
- If you think of that in terms of regression then you could say something like the variance between them is sort of a stand in for confidence. Low variance means everyone agrees, high variance means there's some major disagreement.
- So what does that mean in the boosting case? As you recall, the final output of the boosted classifier is given by a very simple formula.

- h of x is equal to the sign of the sum over all of the weak hypotheses that you've gotten of alpha times h.
- So the weighted average of all of the hypotheses. If it's positive you produce a plus one. And if it's it negative you produce a minus and if it's exactly zero you don't know what to do so you just produce zero. Just throw up your hands.

- So I'm going to make a tiny change to this formula, just for the purpose of explanation, that doesn't change the fundamental answer. And I'm just going to take exactly this equation as it is. And I'm going to divide it, by the weights that we use.

- The alpha is always set to be the natural log of something. These alphas are applied to hypotheses whereas the alphas in the SVM settings were being applied to data points. Unfortunately in machine learning people invent things separately and reuse notation. Alpha's an easy Greek character to draw, so people use it all the time.
- But here, remember, alpha's the measure of how good a particular weak hypothesis was, and since it has to do better than chance, it works out that it will always be greater than zero.
- This normalization factor, this denominator doesn't, it's just a constant with respect to x, the input. So it won't actually change the answer. So it really is the same answer as we had before, just a different way of writing it. What it ends up doing is it normalizes the output. So it turns out that this value is always going to be between minus one and plus one. Otherwise it doesn't change anything about what we've been doing for boosting.
- This makes it easier for me to draw what I want to draw next. So, we know that the output of this little bit inside the sign function is always going to be between minus one and plus one. Let's imagine that I take some particular data point x and I pass it through this function, I'm going to get some value between minus one and plus one. And let's just say for the sake of the argument, it ends up here.

- It's a positive example and it's near plus one. This would be something that the algorithm is getting correct. It's not just getting it correct, but it is very confident in its correctness because it gave it a very high value. By contrast there could have been another positive that ends up around here.
- So it gets it correct but it doesn't have a lot of confidence so to speak in its correct answer because it's very near to zero. So that's the difference between error and confidence.
- So now imagine there's lots of little points like this.
- And if you're doing well, you would expect that very often you're going to be correct. And so you end up shoving all the positives over here to the right, and all the negatives over here to the left. And it would be really nice if you were sort of confident in all of them.
- So now I want you to imagine that we've been going through these training examples, and we've gotten very, very good training error. In fact, let's imagine that we have negative training error. Let's imagine that we have no training error at all. So we label everything correctly. So then the picture would look just a little bit different We're going to have all the pluses on one side, and all the minuses on the other.
- But we keep on training, we keep adding more and more weak learners into the mix.
- What ends up happening in practice is, you have to do some kind of distribution on the hard examples. And the hard examples are going to be the one that are very near the boundary. So as you add more and more of these weak learners what seems to happen in practice is that these pluses that are near the boundary and these minuses that are near the boundary just start moving farther and farther away from the boundary. So, this minus starts drifting until it's all the way over here, this minus starts drifting until it's all the way over here. And the same happens for the pluses.
- As you keep going, what ends up happening is that your error stays the same. It doesn't change at all, however your confidence keeps going up and up.
- Which has the effect, if you'll look at this little drawing over here of moving the pluses all around over here, so they're all in a bunch, and the minuses are on the other side.
- There's a big gap between the leftmost plus and the rightmost minus, in the context of this lecture reminds us of a margin. Basically what ends up happening is that as you add more and more weak learners here the boosting algorithm ends up becoming more and more confident in its answers which it's getting correct. And therefore effectively ends up creating a bigger and bigger margin.
- Large margins tend to minimize overfitting. So, counter intuitively, as we create more and more of these hypotheses, which you would think would make something more and more complicated, it turns out that you end up with something smoother, less likely to overfit and ultimately, less complicated. So the reason boosting tends to do well and tends to avoid overfitting even as you add more and more learners is that you're increasing the margin.
- So, there you go, do you think, then, that boosting never overfits?

- So we just tried to argue that boosting has this annoying habit of not always overfitting, but of course something can always over fit. Because otherwise we just do boosting and we're done, then neither of us would have jobs.
- So here are five possibilities.
- All right, Michael. What's the answer?
- The last one, boosting tends to overfit if boosting trains too long. You just told me a story about that not being true. So I'm going to eliminate that one from consideration.
- Boosting tends to overfit if it's a nonlinear problem. I don't see why the problem being linear or nonlinear has anything to do with overfitting.
- A whole lot of data is the opposite of what tends to cause overfitting. If there's lots of data then you'd think that it would actually do a pretty reasonable job if there's a lot to fit. There's a lot going on there. It's unlikely to overfit. In fact if a whole lot of data included all of the data, and you actually could get zero training error over it, then you know you have zero generalization error because it'll work on the testing data as well, because it's in there.
- Weak learner uses artificial neural network with many layers and nodes. So I'm guessing that you wanted me to think about that being something that, on its own, is prone to overfitting, because it's got a lot of parameters. If we fit a neural net, and then we fit another neural net, and we fit another neural net. And we're combining all the outputs together in the correct, weighted way. It's not obvious to me that that should be a good thing to do. I'm not sure it would overfit, but it seem like it sure could.
- Let me look at the first one. Weak learner chooses the weakest output. Well, I mean boosting is supposed to work as long as we have a weak learner. And it doesn't matter if it chooses the weakest or the strongest. All that matters is it does significantly better than a half.
- So the only one of these choices that is likely to be true is the second one.
- So let me give you an example of when that would be correct. So let's imagine I have a big powerful new network that could represent any arbitrary function. Okay, it’s got lots of layers and lots of nodes. So, boosting calls it, and it perfectly fits the training data, but of course overfits. So then it returns, and it's got no error, which means all of the examples will have equal weight. And when you go through the loop again, you will just call the same learner, which will use the same neural network, and will return the same neural network. So every time you call the learner, you'll get zero training error, but you will just get the same neural network over and over and over again. And a weighted sum of the same function is just that function.
- So if it overfit, boosting will overfit. And not only will it overfit, but it'll just, it'll be stuck in a horrible loop of error.
- So that's why this is the sort of situation where you can imagine boosting providing a lower fit. If the underlying learners all overfit and you can never get them to stop overfitting, then there's really not much you can do.

- I have this feeling that it's necessary to always make sure you know what problem you're solving, before you start proposing algorithms for solving it. And we haven't really nailed down what exactly that problem is. And that makes things hard. It makes it hard to know, for example, whether or not one algorithm's better than another.
- So one of the things we talked about so far is algorithms for doing for learning classifiers. So, if you can imagine that each of these three boxes is the output of a classifier in a two dimensional space. So it's a two dimensional input space. We've run a leaner, and when it spits out a classifier, this is how it classifies the the regions of the space. So I use blue lines to separate the square, the two dimensional space, into regions, and then I labeled each region with a minus or a plus, so you can tell what the classifier was actually doing.
- So I would choose for the second one support vector machines, and the reason why is because there seems to be a single line that seems to be in the middle between where the plus and minus are, so it looks like it's trying to maximize the margin. Though, those aren't training points, these are labeling the regions. The support vector machine is going to find a linear separator. And so the output, the classifier that comes out of it is going to separate any kind of region into pluses and minuses, with some lines separating them out. That's also true of perceptrons, it's also true of certain kinds of you know, simple neural nets. So all of those seem like they would be reasonable answers for the middle one.
- I'm going to say the third one is a nearest neighbors method. One nearest neighbor. So we, I don't think we got a chance to draw one of these diagrams when we were talking about nearest neighbors. But this is called a Voronoi diagram. Because what's happening is it's breaking up the plane into the regions that are closest to some particular points. So you give it a set of points, it breaks up the plane into the things closest to that. And in this particular case our we probably had three points. And one was labeled plus, and another was labeled plus. One was labeled minus, and so it break the space up into these sections that way.
- I'm going to say the first one is decision trees. Because decision trees split things you know, if you're talking about in a two dimensional space it's going to split at the top node in one dimension. And here it looks like that's this split. Then maybe, looks like on the left side that's a minus. On the right side there's another split. On the left side of that it's a plus. On the right side there's another split into pluses and minuses. You get these sort of interesting nested rectangles. They also can be very pretty.

- Computational learning theory really gets us a formal way of addressing three really important questions.

- One is, what's a learning problem? Let's, let's define very carefully what it is that we want a learning algorithm to do.
- If we can do that, we can actually show that specific algorithms either work or don't work, with regard to the definition of the problem. And maybe we can even come up with algorithms that solve those problems better. So that's kind of on the upper-bound side.
- And then on the lower-bound side we can also show, for example, in some cases that some problems are just fundamentally hard. So you, you define a particular learning problem and you discover. Wait, the algorithms that I'm thinking of don't seem to work. You might actually be able to show that no algorithms, say, no algorithms in some particular class are ever going to be able to solve them because, that problem is not solvable by problems in that class. So those problems are fundamentally hard.

- So, answering these kinds of questions require that you be fairly careful about defining things and using mathematical reasoning to determine what's going on. So we're going to focus mostly on that, talk about some algorithms that are not necessarily practical. You wouldn't necessarily want to use them, but they do help illuminate what the fundamental learning questions are and why certain algorithms are effective and ineffective.

- We just justified this in the same way that a computing theoretician might try to justify theory. In fact, the, the kinds of analyses and tools that are used for analyzing learning questions are also the same kinds of tools and mechanisms that are used in the context of analyzing algorithms in computing.

- Often in theory of computation, we analyze algorithms in terms of their use of resources, usually time and space.
- When we talk about an algorithm running in say n log n time, we're saying something about the time that it takes.
- If you say that it uses n2 space, we're talking about the amount of memory that it takes up as the function of the growth of the inputs.
- We're trying to select among algorithms. We want algorithms that use their resources well. What do you think would some reasonable resources to try to manage in a learning algorithm?
- Two of them are time and space. After all, at the end of the day, it's still an algorithm. We need to be able to analyze algorithms in terms of time and space.
- The only thing that matters in machine learning or the most important thing in machine learning is data. So I would think that another resource that we care about is the data and in particular the set of training samples that we have. We want to know, can we learn well with a small amount of samples. In particular if, our learning algorithm works great in terms of time and space, but in order to run it you actually have to give Examples of every possible input, then that's not going to be a very useful algorithm. So, the fewer samples that it can use, the more that it's generalizing effectively, and the better it is at learning.

- Inductive learning is learning from examples. It's the kind of learning problem that we've been looking at the whole time, but we haven't been very precise about all the various quantities that we want to be able to talk about. The number of properties that we actually need to be thinking about when we go and define what an inductive learning problem is and measure what an inductive learning algorithm does.
- What's the probability that the training is actually is going work. You know, whatever it is that it's trying to produce, it may be that in some cases, because the data is really noisy or just got a bad set of data, it might not actually work. So, we generally talk about a quantity like 1 minus delta as the probability of success. Delta here obviously is a probability of failure and this is just 1 minus that.
- There's also issues like the number of examples to train on.
- There's also something that we really haven't talked about yet, but you could imagine that the complexity of the hypothesis class might matter. The complexity of the class could be like the sum of the complexities of all the hypotheses in the class. A hypothesis class is complex if it has very complex hypotheses, then you can say, well, if you have a hypothesis class that can't represent much, then it will be hard for you to learn anything complicated. The downside to having a hypothesis class that is very, very complex is it would be much easier to overfit. So getting something that actually works well might be challenging. You might need a lot more data to nail down what you're really talking about. So it's a bit of a double edged sword.
- It may be easy to learn if you don't have to learn very well. So the accuracy to which the target concept is approximated, often written as epsilon, is another thing that's going to be important in understanding the complexity of a learning algorithm.
- So those, those are kind of the main complexity related quantities that I wanted to talk about. There's also some choices as to how the learning problem is actually framed. There's the manner in which training examples are presented. And there's the manner in which training examples are selected for presentation. And we're going to talk about both of these. When I talk about the manner in which training examples are presented, there's two that I think are really important to look at

- Batch- that's mostly what we've looked at so far, that there's a training set that's fixed and handed over to the algorithm in a big bolus, right. A big group. A big batch.
- Or it could also be presented online, so one at a time. So we say to the training algorithm, or the learning algorithm, here's an example. And then it has to predict what the label is. And then the algorithm can say, oh here's what the right label is. Let's try again. Here's another example. And it can go back and forth like that.

- Let's talk about the manner in which training examples are selected in the next slide.

- It matters how we select training examples when a learner is needing to learn. Let's at least articulate some various ways that training examples could be selected. And then for each one we might end up with a different answer as to how much training data is going to be needed, if the training examples are selected in that particular way.
- So, you keep using the word learner, and when there's a learner there's often a teacher. So, I'm trying to think about this in the different ways that learners and teachers can interact. So, I'm going to try to think about my, my experience as a professor. So, there are a couple. Here's one, one is: sometimes the learner asks questions of the teacher. In this case, the learner would be selecting the training examples. The learner would say, here's an input x, would you please tell me c(x)?
- Then there's another case. So the learner asks questions of the teacher. But sometimes the teacher just goes ahead and gives x, c(x) pairs. And let's say it actually is trying to help the learner.
- It could be the learner asking to try to get data. It could be the teacher asking to try to lead the learner to something good. Neither sound like things that we've been looking at so far, that somehow the x’s and the c(x)’s, the training examples, just come out of nowhere. They come from some underlying distribution. It's just nature. It's just coming at us from some process that we don't know what that process is.
- So in some sense there's three individuals that could be asking for examples. There's the teacher, there's the learner, and there's the world around them.

- Think about two different cases. One where the teacher is going to try to select which questions to ask. So the teacher's going to say to the student, here's the question you should ask me next. To try to figure out which person it is, as quickly as possible. And then we'll, next we'll look at what happens if the learner's the one asking those questions. So in some sense, it doesn't seem like it should make a difference, because, in either case, the learner is going to ask the question and the teacher is going to answer truthfully. Just in one case, the learner has to come up with the questions, and in the other case the teacher has to come up with the questions.
- The teacher actually knows the answer. How many questions are necessary for a smart learner to figure out the right person, assuming that the teacher is the one who gets to feed the learner good questions?
- You could've done this in one question, a sufficiently helpful teacher. You can, you can do this in one.
- That seems like cheating but it’s a very different problem, when the, the question has to come from the learner's side.

- If the learner is choosing which questions to ask the learner its in a distinctly more difficult situation because it doesn't actually know what the right answer is.
- There's a couple a ways to think about it, but let me just kind of go top down. So, here's what I'm thinking. I think the same principle applies, where you sort of want to ask the question that gives you the maximum amount of information.
- So, when we ask a question, x, it's going to either have a yes answer. Or a no answer. And let's say that at any given point in time, we have n possible people that we're trying to reason about. And then after we choose x, after we choose the question, if the answer is yes then we're going to have, I don't know, some other number of people.
- So I always want to ask the question that's going to give me maximum information. I'm the learner. I don't know everything like the teacher does, but what I do know are all of my hypotheses. Alright. So I know how each hypothesis responds to each of the questions. So, in general, it seems to me that I should try to eliminate as many hypotheses as I can, okay? That makes sense?
- As a learner, I don't know the answer, so finding questions that whittle it down to one if the answer is yes aren't very helpful because the answer might be no. Let's see so under the assumption the target hypothesis was chosen from H say, uniformly at random. What's the expected number of hypotheses we can eliminate with a question that has this form? Well, it's binary. So if you assume that the hypothesis space covers everything. Then the best you can do is eliminate about half of them, on average. Since all possibilities are there, the smallest the, the best you can hope for is about a half.
- Okay, so you want to pick questions that roughly split things in half. And if you can do that every single time, then every single time, you will split the set in half. So that'll take you a logarithm amount of time.
- So the number of times you can divide a number in half before you get down to one is exactly the log base two. So if we start with size of h hypotheses, it will take us like log base two to whittle it down to a single question. This is a lot larger. I mean this is a nice small number, but it's a lot larger than one, right? You know, one, one is really great. This is you know, exponentially, no, this is much, much worse than one. But but it's still really good. You can, you know, it's not considering all the hypothesis. It's very cleverly whittling it down so that it only has to look at the log of that.

- When I gave the example of a teacher asking questions to to lead the learner to a particular hypothesis, I allowed the teacher to ask any possible question in the universe of possible questions. Right, so the teacher could construct a question, and in this particular case, that, that specifically had a yes answer for the target hypothesis. And a no answer everywhere else. And that just simply doesn't happen in any realistic learning questions. So we need to think a little bit more about the question of what happens when the teacher is constrained with respect to the kinds of queries that it can suggest. So I'm going to, I'm going to set up classic computational learning theory hypothesis class, and we'll go through some examples about what happens when the teacher's constrained in this way. So let's imagine that our input space is basically k-bit inputs, so x1 through xk, and the hypothesis class and literals or their negation.
- Here's a concrete example. Here's a hypothesis that is in this set. X1 and x3 and not x5. We're going to connect together some of these variables, not necessarily all of them. And some of them can be negated. And, it's all the connectors have to be ands.
- So let's say our input is this. X1 is 0. x2 is 1. x3 is 0. x4 is 1. x5 is 1. X2 and x4 don't matter. I just have to look at x1, x3, and x5. The bottom output is the only 1.

- All right, so here, here is, here's a table of examples with the input pattern and the actual output pattern of the hypothesis we're looking for. But I'm not going to tell you the hypothesis, you have to figure it out. So the way you're going to do that is, by figuring out for each variable, does it appear in the conjunction in its positive form, in its negative form or it's not there at all? For example if you want to say the hypothesis is X1 and not X5, you would write something like X1 and not X5, and the other ones are all absent. And just to be clear Charles, I, I've picked this particular set of examples so that, you particularly, Charles Isbell PhD, would have the best chance of figuring out the hypothesis with the smallest number of examples.
- I appreciate that.
- Go. Answers
- So, what, how do you start with this now, Charles?
- Okay, so the first thing I'm going to do is, I'm going to look at the first two examples. And the reason I'm doing that is because I know they both generate true. And so I'm going to look for variables that are inconsistent. So if I look at X1, for example, it's a one in the first case, and a zero in the second case. So it can't be the case that it's required in order for this to be true. And the same would be true for x3. So I'm going to say that neither x1 nor x3 matter. By contrast, x2, x4, and x5 all have the same values in both of those cases.
- So we don't know much about them quite yet.
- No
- But let's so let's, that seems very well reasoned. So we know that x1 can't be part of this formula and x3 can't be a part of this formula. So, let's just sort of imagine that they're not there cause they don't really give us any information any more.
- Beautiful.
- Alright, So what's left?
- So what's left is now to make certain that x2, x4 and x5 are necessary, and particularly necessary with the, the values that they have. So I guess all I can really do is see if there's anything else to eliminate. If I were just looking at the first two, I would think that the answer was not X2, X4, not X5.
- Alright. so, hang on, not X2, X4, not X5.
- Right. So, that's what I currently think it is based upon what I just saw.
- And that would be, that's consistent with the first two examples.
- Right. And so, now I want to make certain is consistent with the next three examples. This is the easiest way for me to think about this, anyway. So, let's see. Not X2, X4, so that should be false, which it is. They're all false, so let's see not X2. But x4, up, that should be false, which it is. And then I do that same thing. But wait, why isn't that the answer? That can't be the answer.
- It is the answer, you got it.
- Huh I got it right.
- So the thing to notice is that in, in these first two examples we have X2 is false X4 is true and X5 is false and that's enough to make the conjunction true but making- flipping any one of those bits is enough to make it false so what I showed in the in the remaining examples is that just by turning this X2 into an X1 leaving everything else the same we lose it. Similarly if we flip the X4 to zero and leave everything else the same, we lose it. Similarly if we flip x5 to one, and leave everything else the same, we lose it. So that means that each of these is necessary to make the conjunction. They're all actually in there.
- That's just what I was thinking. So, in other words, you gave me some positive examples to eliminate things that were necessary, and then you gave me negative examples to validate that each of the variables that I saw so far were necessary because getting rid of any one of them, gave me the wrong answer.
- Exactly. Let's, let's even write down those two steps. So the first thing was show what's irrelevant. And how many questions How many queries might we have needed to show that?
- Well, one per variable.
- Well actually we only need two because what I did is I, I used, all the relevant ones I kept the same and all the irrelevant ones I flipped from one to the other. I just have to show you that it's still, the output is still one even though they have two different values.
- Oh, no, no. When I said all of them, you know, k of them was because I didn't know that, what if all of them were irrelevant
- Then it would still be two. because then I could just show you the all zeroes, and the all ones.
- You're right. You're right. You just need, oh that's right. That's exactly right.
- Alright. And then I have to show you that the, that each, the remaining variables is relevant by flipping it and showing you that the answer is zero. And how many questions did I need to do for that?
- Three.
- Yeah, three in this case cause there were three variables that were used in the formula. What's the most it could be?
- Well k, cause all of them could be relevant.
- Yeah, so it's you know, it's kind of interesting that, that in fact the total number of hypothesis here is three to the K. Because you know, you can see it right off this table that for each of the variables, it's either positive, absent or negated. But the number of questions that a smart teacher had to ask was more like, K plus two.
- Huh.
- Which is pretty powerful.
- Right, so. The smart teacher can help me do this in linear term. So what if I were, I, I didn't have the teacher who could give me these examples and I always had to ask?
- That's a good question, let's let's do that.

- Alright, so, you asked unfortunately what happens when the, the learner is now a part of this. Now the learner doesn't have that advantage that the teacher had of knowing what the actual answer was and therefore being able to show specifically what's irrelevant and show what's relevant. So, what could the learner do to try to learn about this? So again, remember that there are 3 to the k possible hypotheses, and if it could use the 20 questions trick, it could do this in log base 2 of 3 to the k, which is the same as K times log base 2 of 3. Which is you know, worse than what we had. It's this is, this is larger than in K. So, but can we actually do that?
- I'm going to say yes.
- I don't think we can, so can you help me figure out how that would go?
- Oh, I was just going to assert it, then hope you would tell me. so, how would we do that? Well, we, we, the trick we did before is, we, we tried to find a specific question we could ask, such that we would eliminate half the hypotheses.
- Indeed. But it's not clear how you could even ask such a question. Yeah, so, so just to do this as a thought exercise, I have a hypothesis in mind.
- Okay.
- And you can ask me anything you want, and I will tell you true or false. But you're going to have a very painful time finding it.
- Yeah, but that's just because I'm human. Okay, so I need to find a question where, of all the hypotheses, so I have all the possible 3 to the K hypotheses. I want to try to come up with something that's going to eliminate a third of them which is just going to be hard for me to do because I could write the program to do this.
- I'm not sure you could. I think, at the moment, there's well, because I didn't choose my hypothesis at random. I chose a specific hypothesis. Though I guess I could have chosen at random from a subset, and you would have still had a hard time finding it. But let's, just as an exercise. Throw out, give me a, give me a x1 to x5, and I'll tell you what the output is.
- Okay, 00001. Or actually, you know what? All zeros.
- Okay, all zeroes, the output is zero.
- Oh, that's what I should, that's not what I should have done. I should have. No, no.
- That's okay, I won't count that one.
- [LAUGH] Can I just give you like, maybe 3 to the k of them and you'll not count any of them until I get it right?
- Well, that's the problem, right? Well, not 3 to the k, but if you, if you, you know, make 2 to the k guesses, do, you'll be okay. But you'll also have looked at all possible inputs. So that's not really that interesting. But in particular, the example that I'm thinking of, you're going to have to guess almost this many just to get a positive example. So almost everything that you throw in is giving almost no information. Because saying no doesn't really tell you very much.
- Yeah that's what I was thinking. Well, what I was thinking was I need to find one where the answer is yes.
- Exactly, and I made it so that it's going to take you exponential time just to find one. Once you've found that one, then you're, then you're home free but it's going to take you, you know, you essentially have to enumerate all possibilities before you find one.
- Okay, 0 0 0 0 1, okay? 0 0 0
- There's only one pattern that gives a one.
- Right. Exactly. And you're going to, because every single one of them is relevant. And I'm going to have to look.
- Two of them are negated. This is the only pattern that gives you a one. Now once you have found that and you know that that's the only one, now it's easy. You can just read off the equation. So, what's the equation?
- X1 not X2 and X3 and X4 and not X5.
- And that is the, that's the equation and you are not, you're not, there's no as a learner you are not going to be able to find that, right? Because it's just a needle in a haystack until you hit it.
- Yeah, so it's, it's going to take me exponential time, but it, but remember we're not worried about time. We're worried about sample complexity. So remember the cheat that we have here. The cheat that we have here is that I know all the hypotheses and what they say.
- It doesn't help you.
- Yeah it does, because the hypothesis, cause every hypoth, well no, that's not true. I'm thinking the wrong thing. I'm sorry. I'm cheating, you're right. I'm cheating. I'm, I'm acting as if we have the example you had before.
- So this constrained-ness is really, it's very frustrating, right? Because the question that you really want to be able to ask, you can't really ask, right? You want to be able to ask a question that, that takes the hypothesis class and split it in half and. Well maybe you can, maybe you can nearly do that. But it's still going to be, oh no sorry, that would make it linear. I'm sorry, let me say that again. You'd like to be able to ask a question that, that splits this hypothesis class in half, but unfortunately almost all of your questions give very little information. Just knocks out a couple of the possible hypotheses, and so it ends up being 2 to the k kind of time, not time but samples before you can get a handle on what the hypothesis is. So, it is harder for the learner too.
- Right, so when the learner does it you have no reason to believe one hypothesis over the other. You've got all of them. And so in order to figure it out, no it kind of has to be that way because otherwise it is still linear. So, this is bothering me, because if what you said is true, then why does 20 questions work? Why do i ever get log, log 2.
- Right. So we'd like to be able to ask questions. So I, so here, let's play this game now. You think of a, a formula. And I'm going to.
- Oh, wait, you. I know the, the answer is, is that the 20 questions is still the optimal thing to do, given that you know nothing. So that, that log base 2 is kind of an expected answer, but sometimes you'll do much worse, and sometimes you'll do better.
- No, in this particular case, if I could ask you more general questions. I can do this in, in with the, you know, linear in K. So the questions that I'd like to ask you are things like, is X1 in the formula, yes or no? [LAUGH] Is X1 positive in the formula? Is X1 negative in the formula? I can just fill in these boxes by asking the right questions.
- Right.
- But, but those questions are not in our constrained set. And it's the constrained set that matters here. And our constrained set is, in this particular example just really harsh.
- So, and there's no way to approximate that, right? So I can't say, okay, so the first question I want to ask is x1 positive, negative, or absent? So, if I looked at all the hyp, if I looked at all the hypotheses I could do that by asking, now it's very hard to do, because there's no direct way to ask that question. The only way to ask that question is, I have to try. Well, I have to try all possible exponential cases to know.
- Yeah, 'because we're constrained to only ask queries that are data points, right? So give me the label for this data point. And that's not really the same as is the hypothesis you're thinking of having this particular property.
- But as soon as I get a one, I know something.
- Soon as you get a one, you're in a much happier place. So, in fact, if we didn't, if we had conjunctions of literals without negations
- Mm-hm.
- We'd be in a much better situation, because then you could, your first question can be, you know, one one one one one one. You know the answer has to be one, or the formula's empty. So then you're, you're basically off and running, but the fact that there can be negation in there means that most queries really give you useless information.
- So, so Michael, okay, so you've depressed me. You've basically said this is really hard to do, to learn because I think that we've convinced ourselves, at least you've convinced me that until I get a one, until I, I, I get a positive result, I can't really know anything. And eventually I will get one if I can just do an exponential number of samples, but then my sample complexity is exponential, and I'm sad. So what you're basically saying is, I'm sad sample complexity makes me a bad person, and there's really nothing I can do to learn anything or get anything good out of my learning process.
- That seems like a very sad way of saying it.

- So maybe, maybe this will make you feel better Charles. So there's we can actually, you know, when all else fails, change the problem. So let's say that instead of trying to learn the way we were describing it before, we're going to change the rules. We're going to say we're going to use a learning formalism that's sometimes referred to as mistake bands. So here's how the things work in mistake bands, the learner is sitting around and input arrives. And then the learner gets to guess an answer for that input. So the learner the, maybe the learner chose the input or maybe it came from a, a helpful teacher or maybe a malicious teacher, turns out it's not going to matter. But the input's going to show up. The learner's going to guess the answer, so it doesn't have to now guess the hypothesis and get that right. It just has to get the output correct for this input. If the learners wrong, then we're going to charge it a point and tell it, that it was wrong and then it goes up to one and we repeat this and so, this is going to run forever and what we're going to do is bound the total number of mistakes made, into infinity. Right? So, were going to keep playing this game forever. And we want to say it'll never make tot total number of mistakes will never be larger than a certain amount.
- So this is giving the learner some new powers, right? So the learner now is guessing answers. And all we have to guarantee is that if is guess is wrong, it better learn a heck of a lot from that. Hmm.
- Otherwise if it even if it doesn't know much if it guesses right that's fine. It doesn't have to learn.
- Okay. I see. So, so in the case before so long as I had been guessing false I would have been okay even if I didn't know what the hypothesis was. And then the moment I got a one, I got a true, I could actually learn something. Then I should learn something and try to do better guesses from that point on. Okay.
- Outstanding, yes, exactly so. So lets turn that into an algorithm. So here's an algorithm that's really simple and actually can learn very effectively for these mistake bound problems. So it, it works like this. It starts off in this weird state where it imagines that in the formula that every variable is present. In, in both it's positive and negated form. Right, which is kind of weird. And so what that, that would mean is the formula is x1 and not x1. X2 and not x2. X3 and not x3. X4 and not x4. X5 and not x5. So any input that it gets, it;s always going to produce the same answer, right? What, what is that answer.
- False. False, right, so it's going to keep saying false, for a long time. Until at some point, it actually could be right. Each time it's right, it's actually not getting charged any mistakes for that. It's just that at some point, it's going to get an input that the correct answer is true. It's going to say false, and it's going to have made a mistake. So let's say that this here, here's that example. X1 is true. X3 and X4 are true, and the other two are false, and the learner said false, but the answer was actually true. So if the answer to this one is true, what do we know?
- We know that one zero. Wait we know that 0100 1, which is the opposite of what you're saying could not be a part of the formula. So we could remove that from our formula.
- okay. That's one way to think of it. Couldn't quite think of it that way.
- But am I right? Yeah, if this is what you meant. so...
- [LAUGH]
- Uh...the first variable, the x1 not cannot be in the formula. Cause if it was, there's no way that this would've been able to produce true.
- Right.
- So we can erase x1 not. We can erase x2... in a positive form, because of the second bit. We can, the third bit says that X3, if it's in there, it can't be negated. We don't know if it's in there, but we know it can't be negated. X4 Is the same. And x5, if it's in there.
- Right you get rid of the positive. Yeah, that's what I meant. That's why I wrote down the opposite of everything that was up there.
- Yeah, and that is what we're left with. Okay, it's not exactly what the algorithm says, but it, it produced the right answer in this case. Alright, so now, now what is it going to do? Now it's going to continue to say no, unless it sees. This particular bit pattern, right, this is the only thing that will ever predict true on and it will always be right,when it does that because we know that is the correct answer for that. So it's going to guess no everywhere else and so let's say it gets something wrong again and let's say it gets one zero, one one, one wrong. So in this particular case, it's going to guess no, and we say, I'm sorry, the answer is yes. All right, so now what does it know from that?
- Well, I'm just reading your algorithm now. And I'm just going to do what number three says.
- All right. That's a good idea. It says if we're wrong, which we are in this case, set all the positive variables that were zero to absent. All the positive values that were zero, there are none of those and said all the negative variables that were one, to absent, alright. So then that was x5, x5 is there in its negated form, but it's actually a one in the input, so we're going to turn that away to absent. Alright.
- Mm-hm.
- So and that's the same thing that you did when we were looking at the problem before, you said if you have two answers, where the, two inputs where that output is both true, any bits that are different in those two patterns, must be not part of the formula.
- Right.
- Alright, so now we've definitely, X5 is not in the formula and that's actually correct there's, there's at no point in the future where we have to revisit that. In this other cases we are not quite so sure. It could be that they're, in there or not in there. And, so each time that we get one wrong, we're going to move something from negated to absent. And when we do that, that thing is always going to be correct. So at most we can move K things from negated or positive to absent.
- Oh! So if I. . Think about, oh, so even if we may have to see in fact, every, even if we may see an exponential number or examples. We will never make more than k plus one mistakes.
- Perfect.
- And so that's exactly what I wanted to do before, right? So, if, if I'm a teacher, if I'm a good teacher in this case, then I can basically, and I knew that you started out assuming that everything was false.
- That you know, all variables were there in both their positive or negative form, I can just give you one example that is true. And that would let you eliminate half of the formula right away, and then I could just keep giving you examples that are true but only with one variable difference each time. And then eventually you would learn it. So, then the total number of samples you would need would also be k+1 if I know how you're starting out as a learner. Does that make sense?
- If we charge things by mistakes.
- No. But even if we don't charge things by mistake. If I'm a teacher who's trying to give you the best examples and I know that as the learner you're starting out with a formula that is x one and not x one, x two and not x two. Like you said before. Then I could just give you the first example that is true. That'll eliminate half of those literals. And then only give you true examples from that point on, where only change one of the variables that are left, and you'll know that you can make them absent to get rid of it, and so, just as you can only make k plus 1 mistakes, I could give you exactly the right k plus 1 examples. If I know how you're starting out as a learner. If I don't know that, then I have to do the k plus 2 that you showed before.

- Alright, so, remember, Charles, we were talking about three different kinds of ways of choosing the inputs to the learner. Okay? So what were they again? We just looked at two of them.
- So the learner chooses examples. The teacher, hopefully a nice one, chooses examples, and then there was the case with the examples are given to us by nature. And I guess there was a fourth one, which is. That a mean teacher gives it to us but I, you know I don't think that's all that interesting. I tend to think of nature as a mean teacher.
- Well not only that but in the, at least in the mistake bound setting that we just looked at again the, the learner was robust against any choice of where the inputs were coming from. So mean teacher it would've it would've done just as well.
- That's a fine point.
- It doesn't matter when it makes its mistakes, it's only going to make a fixed number of them no matter what.
- Okay.
- So, we've kind of dealt with three of these but we haven't really talked about the, the nature chooses case yet. And that's in some ways the most interesting and relevant and in other ways the most complicated because you have to really take into consideration this space of possible distributions. I think now though we're in a pretty good situation in terms of being able to define some important terms and we're going to use those terms to get a handle on this question of what happens when nature chooses.
- Okay, that sounds reasonable.
- So computational complexity, we talked about. The, the, how much computational effort is going to be needed for a learner to converge to the answer.
- Sample complexity in the case of a batch, that is to say, we have a training set. Is how large does that training set need to be for the learner to be able to create successful hypotheses. And that's in the batch setting.
- In the online setting, we have this notion of a mistake bound. How many misclassifications can the learner make over an infinite run?
- Mind if I ask you a question? You said something I thought pretty interesting. How, for computational complexity, you said how much computational effort is needed for a learner to converge To the right answer. Is that the, is that a requirement when you talk about computational complexity? Or is just that you need to know how much computational effort is needed for a learner to converge to something?
- Well, so if it's converging to something and we don't care what it's converging to, then it's really easy to have an algorithm with low computational complexity, right? It's just like return garbage. It runs in constant time.
- Mm hm.
- So, yeah, it's in the context of actually solving the problem, that computational complexity is most interesting.
- Well, I was going to say, what if a set of hypothesis that I'm willing to entertain, doesn't include the true concept.
- good. Right. So in some sense maybe in that case what we're trying to find is the best hypothesis in the hypothesis class. But we can still ask about computational complexity in that case. So, I was saying successful hypothesis here. By that I meant, you know, whatever it is that the problem demands that we return and if it's a hypothesis class that's very limited we might just return the best thing in that class.
- Okay.
- But the important thing here is that we're not going to talk about computational [LAUGH] complexity, for the most part. We're going to focus on sample complexity for the time being. And that's really the relevant concept when we're talking about the idea of nature choosing.
- I believe you

- All right. We're going to do a couple more definitions. This notion of a version space turns out to be really important in understanding how to analyze these algorithms. So, imagine we've got some hypothesis, space H, capital H. And a true hypothesis that we're trying to learn, C of H. This is also sometimes called a concept. And we're trying to learn from a training set S, which is a subset of the possible inputs. And what our training set consists of is those examples along with the true class for all of those X's. So, at any given time a learner might have some candidate hypothesis, little H and big H, and a learner who produces a candidate hypothesis little H, such that C of X is equal to H of X for all X of the training set, is called a consistent learner. Right, so what would be another way of describing that a consistent learner is?
- A consistent learner can actually learn the hypothesis, or the true concept.
- Well it produces, right, I mean of all the possible things it could return, it returns the one that matches the data that it's seen so far.
- Right, so it's consistent with the data. That makes sense.
- Yeah. And the version space is essentially the space of all the hypotheses that are that are consistent with the data. So we'll say the version space for a given set of data S is going to be the set of hypotheses, that are in the hypothesis set, such that they're consistent with respect to the samples that they're given.
- Okay that makes sense.

- Alright, here's the quiz. So just to practice this concept of what a version space is let's go through an example. So here we go. Here's the actual target concept C and, for all possible, it's, mapping from two input bits to an output bit. And the two inputs, X1 and X2 can take on all four possible different values and the outputs are zero One, one, zero, which is to say the XOR function, okay so that's what we're trying to learn.
- Okay.
- But the training data that we have only has a couple examples in it. It says well, one training example says if the inputs are zero and zero. The output is zero. And if the inputs are one and zero, the output is one. And, ultimately, we're going to ask questions like, well, what happens if the input is one and one? What's the output? Now, since we know that it's XOR, we know that the answer is zero but that's kind of using information unfairly. So here's what we're going to do. Here's a set of, a hypothesis set. These are set of functions. That says, well the output could be just copy X1, negate X1, copy X2, negate X2, ignore the inputs and return true, ignore the inputs and return false, take the OR of the inputs, take the AND of the inputs, take the XOR of the inputs and then return whether or not the inputs are equal to each other. And, what I'd like you to do is, some of these are in the version space and some of them are not for this training set that I marked here. So, can you check off which ones are in the version space?
- I think I can.
- Awesome. Let's do it. Answer
- All right Charles, what do you think?
- Okay, so being in the version space just means that you're consistent with the, data that you see. Right?
- Good. Mm-hm.
- Okay, so we should be able to very quickly go through this. So X1, just copy X1 over. Well, if we look at the training data, in the first case, X1 is zero and the output is zero. Second case, X1 is one and the output is one. So that is, in fact, consistent with just always copying X1.
- So that is in the version space, right?
- Right. And without even having to think about it, since x1 is in the version space, doing the opposite of x1 can't possibly be in the version space.
- Agreed.
- So let's skip that. Looking at x2, we can see that in the first case, you go x2 0, c of x is 0. Yeah, so yeah that's looking good. That's consistent so far. But then on the next row you get 0 and then you get the, opposite of 0. You get 1, the compliment of 0. So, that definitely is inconsistent with just copying over X2, so you can't have X2 and by very similar reasoning you can't have the opposite of X2.
- Agreed.
- Okay, so, true is clearly not consistent because we got a 0 once and a 1 once.
- Mm-hm.
- And by the same argument false is not consistent because we got a zero once and a one once. Now or, let's see or.
- But in case it's not clear, I probably could have been clearer about this, but here, zeroes and falses are the same thing and ones and trues are the same thing.
- Right. Just like in C.
- [LAUGH]
- Okay. So I just assume everything you do is written in C, Michael, so, it just works that way. In fact there's a C right up there at the top of the, at the top of the slide.
- Yeah so I feel like I should do this.
- Now I understand what's going on.
- Right, now it's in C.
- Much better. Okay, so, or. Or means if either one of them is true, then you say yes and, huh! That's actually consistent with or.
- Yep.
- Hm. But it is not consistent with and, because one and zero would be zero, not one. Second case, XOR, well, I already know it's consistent with XOR, because I happen to know that that's the target concept.
- Yeah.
- And an equiv would be not consistent. Though interestingly, not equivalent would be consistent.
- Yes, because not equivalent is XOR.
- Oh, yeah, that's right.
- All right, so that's, yeah, that's it, X1, OR, AND, XOR.
- Excellent. Cool.

- Alright, the next topic we're going to get into is to nail down a concept called PAC learning. So to do that, we're going to delve into what the error of a hypothesis is. And there's two kinds of errors. There's the training error and the true error. So the training error is on the training set. What is the fraction of those examples that are classifieds by some given hypotheses H.
- Okay.
- Now H could be different from the target concept. The target concept ought to have a training error of zero. But some other hypothesis h might have some error. Okay?
- Sure, that makes sense.
- Okay, so the true error is actually the fractions of examples that would be misclassified on a sample drawn from D in essentially the infinite limit. So what is the, essentially the probability that a sample drawn from D Would be mis-classified by some hypothesis, h.
- Okay.
- And we can actually write that mathematically. Error with respect to some distribution D of some hypothesis h, is the probability that if we draw the input X from that distribution that you're going to get a mismatch between the true label, according to the concept, and the hypothesis that we're currently evaluating.
- Right, so this sort of captures the notion that it's okay for you to misclassify examples you will never ever ever see.
- Yes, that's right. So we're only being penalized proportional to the probability of getting that thing wrong. So, for examples that we never see, that's fine. For examples that we see really really rarely. We get only a tiny little bit of contribution to the overall error.
- Okay, I can follow that.

- Alright, with just a couple more definitions we'll be able to nail down what PAC Learning is. So let's consider a, concept class C. That is to say that the set from which the concept that we're trying to learn comes from. L is our learner. H is the hypothesis space, so it's the set of, of mappings that the learner is going to consider. N is going to be the size of that space. So kind of the space of the hypotheses that are being considered. D is that distribution of inputs like we looked at before. Then we got Greek letters epsilon and delta, where epsilon is our error goal. Which is to say we would like the error in the hypothesis that we produce to be no bigger than epsilon. It could be smaller, that'd be great. It could be zero, that'd be awesome. But it can't be bigger than epsilon. But, we could get really unlucky and actually not meet our error goal. And so delta is what allows us to set a, a kind of uncertainty goal or certainty goal, which is to say, that with probability one minus delta, the algorithm has to work.
- And by work, you mean has to. Be able to produce a true error, less than or equal to epsilon.
- Exactly.
- So why can't we always force epsilon to be zero, and, you know, delta to be zero?
- Right, to be absolutely sure that we get zero error. So the, part of it is because we are sampling training examples from this distribution D. and it's always possible that we, for example, unless well, as long as there's probability mass on more than one example, there's always a probability that we draw a finite sample that only ever gives up one example over and over again.
- That's fair.
- So we just have to be really careful about that. So, so if we're that unlucky it's very unlikely to happen but when that happens our error is going to be very high. But that's okay because it won't happen very often.
- Okay sure,sure. So basically you can't force it to be zero under all circumstances either epsilon or delta you can't force them to be zero under all circumstances. So you have to allow somewhere to.
- Exactly, it gives us the wiggle room we need. Thats actually where this name PAC comes from. It stands for probably approximately correct.
- Oh I see.
- So we would like to be correct, but then we are just going to keep adding weasel words until we can actually achieve it. We would like to be correct, but we cant be exactly correct, so lets be approximately correct. But we can't be approximately correct all the time, so let's only probably be approximately correct.
- So using your words here you could have called that one minus delta epsilon correct.
- Yes, right. That's exactly right. That's what the Greek letters are playing, and correct is the, this you know, error Sub D of H equals 0.
- Yeah, so 1 9 is delta upsilon error sub D of H equals 0 doesn't roll off the tongue quite as well as probably approximately correct.
- I would agree with that.
- Okay fair enough.

- Alright, now we can actually dive in and give a definition for PAC-learnable. So, a concept class, C, is PAC-learnable by some learning algorithm, L, using its own representation of hypothesis, H, if and only if that learner will, with high probability, at least from its set of hypothesis. That has error less than or equal to epsilon, so it's very accurate. And it needs to be the case that the time that it takes to do this. And the number of samples that it needs draws from this distribution D is relatively small. In fact, bounded by a polynomial in one over epsilon, one over delta and the size of the hypothesis space n.
- Okay, so in other words you're saying something is PAC-learnable if you can learn to get low error, at least with some high confidence you can be fairly confident that you will have a low error in time that's sort of polynomial in all the parameters.
- Exactly, and you can see here that this, this epsilon and delta actually giving us a lot of wiggle room, if you really want to have perfect error. Or perfect certainty. Then these things go to infinity. So, you just, you need, you need to look at all the possible data.
- Mm.
- So, yeah this is really going to only give us partial guarantees.
- Okay. Sure. Okay, I think I understand that.

- Let's just do a little quiz. So, here is our concept class and our hypothesis class, which are the same in this case, which is the set of functions H sub I, that return, for an input X, it returns the Ith bit of that input.
- Mm-hm.
- All right. So if we're talking about K bit inputs then there's going to be a hypothesis in this set for each of those K bits, and that hypothesis returns the corresponding bit. And so, we're given a training set with examples like that. And then we need to produce outputs that are consistent with that. And so what I'd like you to try to figure out is, tell me, do you think that there is an algorithm L, such that C is PAC learnable by L using H? So if you can give me a learning algorithm that would do a good job, and would, with high probability, be able to figure this out, then you should go for it.
- Okay. Answer
- Okay, so what'd you think? Was that enough information to kind of chew on it a little bit?
- Maybe. [LAUGH] So, let's see, I have, I guess, k hypotheses I have to chose from. I know that basically you always return the output of one. I don't get to chose what those examples are, they're drawn from some distribution.
- Right.
- And I want to know whether I'm going to need to see, another of examples that are, polynomial, in the error that I'm interested in, with the certainty that I'm interested in. And I gotta be able to come up with an algorithm, a learning algorithm that will do that.
- Exactly. So what do you think?
- I want to say the answer is yes.
- So how would you argue that though? Do you have a particular algorithm in mind?
- Well I actually did, I was going to keep the all the hypothesis that are consistent with the data that I've seen.
- Okay, alright and what is that called?
- The version space.
- Right. Okay.
- So keep track of that and then, whenever I stop getting samples, I have to pick one of those. So I'm just going to, pick one of them uniformly.
- Okay. Well that seems good so far. You could pick one uniformly. So what makes you think that that's actually going to. Be able to return the right answer, or a, a close enough answer with not that much data. How do we actually bound the number of samples that we need to make it so that when we pick uniformly we actually get a good answer?
- Well, it's a little hard, I mean my, my intuition is that, given what we don't know what the concept is only the class that it comes from If I do anything other than choose uniformly, then I can be very unlucky. I basically... uniformly basically means in this case I don't know anything else to do. So there's sort of no better algorithm than the one that we just came up with, which is find all the hypotheses that are consistent... And you have no reason to choose one over the other because you don't have anymore data. So you should just close your eyes and pick one. And if you do anything else, then in the absence of some specific domain knowledge that tells you to do otherwise, you know, you can just basically end up being very unlucky.
- So, I, I agree with you and this is a good algorithm. But what we lack at the moment is an argument as to why the number of samples that it needs, isn't exponential. Right? Cause there's an exponentially. Large, you know, 2 to the K different possible inputs. If we had to see all of them to be able to guess right, that would be too many.
- Mm hm.
- So, we really want it to be, you know, a polynomial in, in K. Not an exponential in K. So I'm going to say that we don't have enough background yet to be able to answer this definitively. The, yes is the right answer. But we're going to need to dive in a little bit more deeply, and develop some more concepts to be able to argue why that's the right answer.

- So this is going to be a key concept for being able to develop and answer two questions like that and it's the notion of epsilon exhaustion. Which sounds kind of tiring.
- I know I'm epsilon exhausted right now.
- Yea, me too. So what we mean by this is, well here's the definition. So, a version space. Of, version space that is derived from a particular sample, it's considered epsilon exhausted if and only if for all the hypotheses that are in that version space they have low error. So if we can do this then your algorithm going to work. Your algorithm says at that point choose any of the hypotheses in your hypothesis set. You are going to be fine, you are going to have low error.
- Sure.
- If you don't do this, your algorithm is going to be in trouble because you are uniformly at random choose one of the hypothesis and it has a chance of being wrong. It could be a fairly high chance if there is, say there is only two hypothesis left in the version space, one has high error and one has low error. We really have to make sure that the only things left in the version space have low error
- Okay, that makes sense, that makes sense.
- Alright, so we're going to have to develop a little bit of theory to figure out when that occurs but this is a really key concept.
- Okay, so, I guess if was trying to put this into English just to make certain I understand it, what you're saying is, something is epsilon exhausted, a version space is epsilon exhausted exactly in the case when everything that you might possibly choose has an error less than epsilon.
- Sure.
- Period. And so if there's anything in there that has error greater than epsilon, then it's not epsilon exhausted.
- Right. It's still epsilon energized.
- Right. That makes a lot of sense, epsilon energized. That's a pretty good name for a band. OK.