1 of 52

THE MACHINE LEARNING LANDSCAPE

2 of 52

What Is Machine Learning?

  • Machine Learning is the science (and art) of programming computers so they can learn from data.
  • Here is a slightly more general definition:
  • [Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

—Arthur Samuel, 1959

  • And a more engineering-oriented one:
  • A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

—Tom Mitchell, 1997

3 of 52

Cont…

  • For example, your spam filter is a Machine Learning program that can learn to flag spam given examples of spam emails (e.g., flagged by users) and examples of regular (nonspam, also called “ham”) emails.
  • The examples that the system uses to learn are called the training set. Each training example is called a training instance (or sample).
  • In this case, the task T is to flag spam for new emails, the experience E is the training data, and the performance measure P needs to be defined;
  • for example, you can use the ratio of correctly classified emails. This particular performance measure is called accuracy and it is often used in classification tasks.
  • If you just download a copy of Wikipedia, your computer has a lot more data, but it is not suddenly better at any task. Thus, it is not Machine Learning.

4 of 52

Why Use Machine Learning?

  • Consider how you would write a spam filter using traditional programming techniques (Figure 1-1):

1. First you would look at what spam typically looks like. You might notice that some words or phrases (such as “4U,” “credit card,” “free,” and “amazing”) tend to come up a lot in the subject. Perhaps you would also notice a few other patterns in the sender’s name, the email’s body, and so on.

2. You would write a detection algorithm for each of the patterns that you noticed, and your program would flag emails as spam if a number of these patterns are detected.

3. You would test your program, and repeat steps 1 and 2 until it is good enough.

5 of 52

Cont…

  • Since the problem is not trivial, your program will likely become a long list of complex rules—pretty hard to maintain.
  • In contrast, a spam filter based on Machine Learning techniques automatically learns which words and phrases are good predictors of spam by detecting unusually frequent patterns of words in the spam examples compared to the ham examples (Figure 1-2). The program is much shorter, easier to maintain, and most likely more accurate.

6 of 52

Cont…

  • Moreover, if spammers notice that all their emails containing “4U” are blocked, they might start writing “For U” instead. A spam filter using traditional programming techniques would need to be updated to flag “For U” emails. If spammers keep working around your spam filter, you will need to keep writing new rules forever.
  • In contrast, a spam filter based on Machine Learning techniques automatically notices that “For U” has become unusually frequent in spam flagged by users, and it starts flagging them without your intervention (Figure 1-3).

7 of 52

Cont…

  • Another area where Machine Learning shines is for problems that either are too complex for traditional approaches or have no known algorithm.
  • For example, consider speech recognition: say you want to start simple and write a program capable of distinguishing the words “one” and “two.” You might notice that the word “two” starts with a high-pitch sound (“T”), so you could hardcode an algorithm that measures high-pitch sound intensity and use that to distinguish ones and twos.

8 of 52

Cont…

  • Obviously this technique will not scale to thousands of words spoken by millions of very different people in noisy environments and in dozens of languages.
  • The best solution (at least today) is to write an algorithm that learns by itself, given many example recordings for each word.
  • Finally, Machine Learning can help humans learn (Figure 1-4): ML algorithms can be inspected to see what they have learned (although for some algorithms this can be tricky).
  • For instance, once the spam filter has been trained on enough spam, it can easily be inspected to reveal the list of words and combinations of words that it believes are the best predictors of spam. Sometimes this will reveal unsuspected correlations or new trends, and thereby lead to a better understanding of the problem.

9 of 52

  • Applying ML techniques to dig into large amounts of data can help discover patterns that were not immediately apparent. This is called data mining.

10 of 52

Cont…

  • To summarize, Machine Learning is great for:
  • Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.
  • Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
  • Fluctuating environments: a Machine Learning system can adapt to new data.
  • Getting insights about complex problems and large amounts of data.

11 of 52

Types of Machine Learning Systems

  • There are so many different types of Machine Learning systems that it is useful to classify them in broad categories based on:
  • Whether or not they are trained with human supervision (supervised, unsupervised, semi-supervised, and Reinforcement Learning)
  • Whether or not they can learn incrementally on the fly (online versus batch
  • learning)
  • • Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)

12 of 52

Supervised/Unsupervised Learning

  • Machine Learning systems can be classified according to the amount and type of supervision they get during training. There are four major categories: supervised learning, unsupervised learning, semisupervised learning, and Reinforcement Learning.
  • Supervised learning
  • In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels (Figure 1-5).

13 of 52

Cont…

  • A typical supervised learning task is classification. The spam filter is a good example of this: it is trained with many example emails along with their class (spam or ham), and it must learn how to classify new emails.
  • Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of task is called regression (Figure 1-6).1 To train the system, you need to give it many examples of cars, including both their predictors and their labels (i.e., their prices).
  • In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., “Mileage = 15,000”). Many people use the words attribute and feature interchangeably, though.

14 of 52

Cont…

15 of 52

Cont…

  • Note that some regression algorithms can be used for classification as well, and vice versa. For example, Logistic Regression is commonly used for classification, as it can output a value that corresponds to the probability of belonging to a given class (e.g., 20% chance of being spam).
  • Here are some of the most important supervised learning algorithms (covered in this book):
  • k-Nearest Neighbors
  • Linear Regression
  • Logistic Regression
  • Support Vector Machines (SVMs)
  • Decision Trees and Random Forests
  • Neural networks2

16 of 52

Unsupervised learning

  • In unsupervised learning, as you might guess, the training data is unlabeled (Figure 1-7). The system tries to learn without a teacher.

17 of 52

Cont…

  • Here are some of the most important unsupervised learning algorithms
  • Clustering
    • K-Means
    • DBSCAN
    • Hierarchical Cluster Analysis (HCA)
  • Anomaly detection and novelty detection
    • One-class SVM
    • Isolation Forest
  • Visualization and dimensionality reduction
    • Principal Component Analysis (PCA)
    • Kernel PCA
    • Locally-Linear Embedding (LLE)
    • t-distributed Stochastic Neighbor Embedding (t-SNE)
  • Association rule learning
    • Apriori
    • Eclat

18 of 52

Cont…

  • For example, say you have a lot of data about your blog’s visitors. You may want to run a clustering algorithm to try to detect groups of similar visitors (Figure 1-8).
  • At no point do you tell the algorithm which group a visitor belongs to: it finds those connections without your help.
  • For example, it might notice that 40% of your visitors are males who love comic books and generally read your blog in the evening, while 20% are young sci-fi lovers who visit during the weekends, and so on.
  • If you use a hierarchical clustering algorithm, it may also subdivide each group into smaller groups. This may help you target your posts for each group.

19 of 52

Cont…

  • Visualization algorithms are also good examples of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D representation of your data that can easily be plotted (Figure 1-9).
  • These algorithms try to preserve as much structure as they can (e.g., trying to keep separate clusters in the input space from overlapping in the visualization), so you can understand how the data is organized and perhaps identify unsuspected patterns.

20 of 52

Cont…

21 of 52

Cont…

  • A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. For example, a car’s mileage may be very correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called feature extraction.

  • It is often a good idea to try to reduce the dimension of your training data using a dimensionality reduction algorithm before you feed it to another Machine Learning algorithm (such as a supervised learning algorithm). It will run much faster, the data will take up less disk and memory space, and in some cases it may also perform better.

22 of 52

Cont…

  • Yet another important unsupervised task is anomaly detection—for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm.
  • The system is shown mostly normal instances during training, so it learns to recognize them and when it sees a new instance it can tell whether it looks like a normal one or whether it is likely an anomaly (see Figure 1-10).
  • A very similar task is novelty detection: the difference is that novelty detection algorithms expect to see only normal data during training, while anomaly detection algorithms are usually more tolerant, they can often perform well even with a small percentage of outliers in the training set.

23 of 52

  • Finally, another common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes.
  • For example, suppose you own a supermarket. Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak. Thus, you may want to place these items close to each other.

24 of 52

Semisupervised learning

  • Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semisupervised learning.
  • Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in
  • photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering).
  • Now all the system needs is for you to tell it who these people are Just one label per person,4 and it is able to name everyone in every photo, which is useful for searching photos.

25 of 52

Cont…

  • Most semisupervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, deep belief networks (DBNs) are based on unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques.

26 of 52

Reinforcement Learning

  • Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as in Figure 1-12).
  • It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

27 of 52

Cont…

28 of 52

  • For example, many robots implement Reinforcement Learning algorithms to learn how to walk. DeepMind’s AlphaGo program is also a good example of Reinforcement Learning: it made the headlines in May 2017 when it beat the world champion Ke Jie at the game of Go.
  • It learned its winning policy by analyzing millions of games, and then playing many games against itself. Note that learning was turned off during the games against the champion; AlphaGo was just applying the policy it had learned.

29 of 52

Batch and Online Learning

  • In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning.

  • In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives.

30 of 52

Cont…

31 of 52

Cont…

  • Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main memory (this is called out-of-core learning). The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data.
  • incremental learning.
  • One important parameter of

online learning systems is

how fast they should adapt

to changing data: this is

called the learning rate.

32 of 52

Instance-Based Versus Model-Based Learning

  • There are two main approaches to generalization: instance-based learning and model-based learning.
  • Instead of just flagging emails that are identical to known spam emails, your spam filter could be programmed to also flag emails that are very similar to known spam emails.
  • This requires a measure of similarity between two emails. A (very basic) similarity measure between two emails could be to count the number of words they have in common. The system would flag an email as spam if it has many words in common with a known spam email.
  • This is called instance-based learning: the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity measure.

33 of 52

Cont…

  • For example, in Figure 1-15 the new instance would be classified as a triangle because the majority of the most similar instances belong to that class.

34 of 52

Model-based learning

  • Another way to generalize from a set of examples is to build a model of these examples, then use that model to make predictions. This is called model-based learning.

35 of 52

Cont…

36 of 52

Cont…

  • There does seem to be a trend here! Although the data is noisy (i.e., partly random), it looks like life satisfaction goes up more or less linearly as the country’s GDP per capita increases. So you decide to model life satisfaction as a linear function of GDP per capita.
  • This step is called model selection: you selected a linear model of life satisfaction with just one attribute, GDP per capita.

  • This model has two model parameters, θ0 and θ1.5 By tweaking these parameters, you can make your model represent any linear function, as shown in Figure 1-18.

37 of 52

Cont…

38 of 52

Cont..

39 of 52

Cont…

  • If all went well, your model will make good predictions. If not, you may need to use more attributes (employment rate, health, air pollution, etc.), get more or better quality training data, or perhaps select a more powerful model (e.g., a Polynomial Regression model).
  • In summary:

• You studied the data.

• You selected a model.

• You trained it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function).

• Finally, you applied the model to make predictions on new cases (this is called inference), hoping that this model will generalize well.

40 of 52

Main Challenges of Machine Learning

  • The main task is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data.”
  • Insufficient Quantity of Training Data
  • Non-representative Training Data
  • Poor-Quality Data
  • Irrelevant Features
  • Overfitting the Training Data
  • Underfitting the Training Data
  • Stepping Back

41 of 52

Insufficient Quantity of Training Data

  • For a toddler to learn what an apple is, all it takes is for you to point to an apple and say “apple” (possibly repeating this procedure a few times).
  • Now the child is able to recognize apples in all sorts of colors and shapes. Genius.
  • Machine Learning is not quite there yet; it takes a lot of data for most Machine Learning algorithms to work properly.
  • Even for very simple problems you typically need thousands of examples, and for complex problems such as image or speech recognition you may need millions of examples (unless you can reuse parts of an existing model).

42 of 52

Cont…

43 of 52

Non representative Training Data

  • For example, the set of countries we used earlier for training the linear model was not perfectly representative; a few countries were missing. Figure 1-21 shows what the data looks like when you add the missing countries.

44 of 52

Cont…

  • A Famous Example of Sampling Bias:
  • Perhaps the most famous example of sampling bias happened during the US presidential election in 1936, which pitted Landon against Roosevelt: the Literary Digest conducted a very large poll, sending mail to about 10 million people. It got 2.4 million answers, and predicted with high confidence that Landon would get 57% of the votes.
  • Instead, Roosevelt won with 62% of the votes. The flaw was in the Literary Digest’s sampling method:

• First, to obtain the addresses to send the polls to, the Literary Digest used telephone directories, lists of magazine subscribers, club membership lists, and the like. All of these lists tend to favor wealthier people, who are more likely to vote Republican (hence Landon).

• Second, less than 25% of the people who received the poll answered. Again, this introduces a sampling bias, by ruling out people who don’t care much about politics, people who don’t like the Literary Digest, and other key groups. This is a special type of sampling bias called nonresponse bias.

45 of 52

Poor-Quality Data

  • Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well.
  • It is often well worth the effort to spend time cleaning up your training data. The truth is, most data scientists spend a significant part of their time doing just that.

46 of 52

Irrelevant Features

  • Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones.
  • A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering, involves:
  • Feature selection: selecting the most useful features to train on among existing features.
  • Feature extraction: combining existing features to produce a more useful one (as we saw earlier, dimensionality reduction algorithms can help).
  • Creating new features by gathering new data.

47 of 52

Overfitting the Training Data

  • Say you are visiting a foreign country and the taxi driver rips you off. You might be tempted to say that all taxi drivers in that country are thieves.
  • Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful.
  • In Machine Learning this is called overfitting: it means that the model performs well on the training data, but it does not generalize well.

48 of 52

Underfitting the Training Data

  • As you might guess, underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data. For example, a linear model of life satisfaction is prone to underfit;
  • reality is just more complex than the model, so its predictions are bound to be inaccurate, even on the training examples.
  • The main options to fix this problem are:

• Selecting a more powerful model, with more parameters

• Feeding better features to the learning algorithm (feature engineering)

• Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)

49 of 52

Stepping Back

  • Machine Learning is about making machines get better at some task by learning from data, instead of having to explicitly code rules.

• There are many different types of ML systems: supervised or not, batch or online, instance-based or model-based, and so on.

• In a ML project you gather data in a training set, and you feed the training set to a learning algorithm. If the algorithm is model-based it tunes some parameters to fit the model to the training set (i.e., to make good predictions on the training set itself), and then hopefully it will be able to make good predictions on new cases as well. If the algorithm is instance-based, it just learns the examples by heart and generalizes to new instances by comparing them to the learned instances using a similarity measure.

• The system will not perform well if your training set is too small, or if the data is not representative, noisy, or polluted with irrelevant features (garbage in, garbage out). Lastly, your model needs to be neither too simple (in which case it will underfit) nor too complex (in which case it will overfit).

50 of 52

Testing and Validating

  • The problem is that you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produce the best model for that particular set.
  • This means that the model is unlikely to perform as well on new data.
  • A common solution to this problem is called holdout validation: you simply hold out part of the training set to evaluate several candidate models and select the best one. The new heldout set is called the validation set.
  • More specifically, you train multiple models with various hyperparameters on the reduced training set and you select the model that performs best on the validation set. After this holdout validation process, you train the best model on the full training set and this gives you the final model.
  • Lastly, you evaluate this final model on the test set to get an estimate of the generalization error.

51 of 52

Data Mismatch

  • Suppose you want to create a mobile app to take pictures of flowers and automatically determine their species.
  • You can easily download millions of pictures of flowers on the web, but they won’t be perfectly representative of the pictures that will actually be taken using the app on a mobile device.
  • Perhaps you only have 10,000 representative pictures (i.e., actually taken with the app). In this case, the most important rule to remember is that the validation set and the test must be as representative as possible of the data you expect to use in production, so they should be composed exclusively of representative pictures: you can shuffle them and put half in the validation set, and half in the test set (making sure that no duplicates or near-duplicates end up in both sets).

52 of 52