1 of 54

Chapter4: �Fitting a model to data�讓模型與資料速配

1

2 of 54

2

Decision

Tree

機器學習

監督式學習

非監督式學習

增強式學習

迴歸問題

分類問題

分群問題

降維

決策樹

線性迴歸

羅吉斯迴歸

支援向量機

主成分分析

k-平均演算法

3 of 54

回顧

圖解機器學習

3

4 of 54

4

物件	海拔 (米)	每平方米價格	年份	浴室幾間	臥室幾間	………	舊金山/紐約
1	50	20050					紐約
2	110	8340					舊金山

我們想建立一個預測模型，判斷一棟房子是紐約或是在舊金山

5 of 54

Fitting a model to data(讓模型與資料速配)

Predictive modeling involves finding a model of the target variable in terms of other descriptive attributes. 建立預測模型是根據描述屬性找到一個跟目標變量相關聯的模型

5

6 of 54

Fitting a model to data(讓模型與資料速配)

In Chapter 3, we constructed a supervised segmentation model by recursively finding informative attributes on ever-more-precise subsets of the set of all instances, or from the geometric perspective, ever-more-precise sub-regions of the instance space. 在第3章中，我們構造了一個監督式的分割模型，將所有的案例分成越來越來精確的子集合(同一子集合內案例的目標值越來越趨於一致)，藉此逐步找到具傳遞訊息能力的變數。或是從幾何的角度，就是將案例空間分成越來越來精確的子區域

6

7 of 54

7

8 of 54

8

9 of 54

Fitting a model to data(讓模型與資料速配)

From the data we produced both the structure of the model (the particular tree model that resulted from the tree induction) and the numeric “parameters” of the model (the probability estimates at the leaf nodes).

從資料我們生成了模型的結構（從樹歸納法產生的樹狀模型）和模型的數值“參數”（葉節點處的機率估計）。

9

10 of 54

10

11 of 54

Fitting a model to data

An alternative method for learning a predictive model from a dataset is to start by specifying the structure of the model with certain numeric parameters left unspecified. 從資料集中學習預測模型的另一種方法是首先指明模型的結構，讓其中一些參數未確定的。
Then the data mining calculates the best parameter values given a particular set of training data. 然後藉著餵訓練資料給資料探勘之算法去估計出最佳參數值。

11

12 of 54

Fitting a model to data

The goal of data mining is to tune the parameters so that the model fits the data as well as possible. 資料探勘的目標是調整參數，使模型儘可能與資料相配
This general approach is called parameter learning or parametric modeling. 這種方法被稱為參數學習或參數化建模

12

13 of 54

Fitting a model to data

Many data mining procedures fall within this general framework. We will illustrate with some of the most common, all of which are based on linear models. 許多資料探勘的方法都屬於此類。我們將用一些最常見的例子來說明，本章所有方法都是基於線性模型
The crux of the fundamental concept of this chapter—fitting a model to data by finding “optimal” model parameters. 本章基本概念的關鍵 - 通過找到“最佳”模型參數讓模型與資料相配

13

14 of 54

Classification via mathematical function(以數學函數分類)

It shows the space broken up into regions by horizontal and vertical decision boundaries that partition the instance space into similar regions.

它通過水平和垂直決策邊界將空間劃分為區塊，就是將實例空間劃分為相似的區域

A main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into.建立同質的區域的主要目的是讓我們預測新的、未見過的案例的目標變數，也就是從案例落在哪一個分割區域中來決定

14

Recall the instance-space view of tree models from Chapter 3.

回顧第3章樹狀模型實例空間的的視圖

Figure 4-1

15 of 54

Classification via mathematical function

15

Figure 4-2. The raw data points of Figure 4-1, without

decision lines. 下圖只顯示原始資料點，沒有判定線

16 of 54

Classification via mathematical function

This is call a linear classifier and is essentially a weighted sum of the values for the various attributes 這稱為線性分類器，它實質上是不同屬性值的加權總和
Compare entropy values in Figure 4-3 and Figure 4-1. 比較圖4-3和圖4-1中的熵值

16

We can separate the instances almost perfectly (by class) if we

are allowed to introduce a boundary that is still a straight line,

but is not perpendicular to the axes. 如果我們用一條是直

線但不垂直於軸線的邊界，我們可以幾乎完美地分離這些案例（按類別）

-- 20

40

Figure 4-3

17 of 54

Linear Discriminant Function (線性判別函數)

17

-

18 of 54

Linear Discriminant Function(線性判別函數)

There are actually many different linear discriminants that can separate the classes perfectly. They have different slopes and intercepts, and each represents a different model of the data. 實際上有很多不同的線性判別式可以完美地區分類別。它們具有不同的斜率和截距，並且每個都代表不同的資料模型

Which is the “best” line?

哪一條是最佳的線？

18

19 of 54

Optimizing an objective function�(最佳化目標函數)

19

STEP1：Define an objective function that represents our goal.

定義一個代表我們目標的目標函數

STEP2：The function can be calculated for a particular set of weights and a particular set of data. 針對特定的一組權重和一組特定的資料，我們可以計算該函數

STEP3：Find the optimal value for the weights by maximizing or minimizing the objective function. 通過最大化或最小化目標函數來找到權重的最佳值。

20 of 54

An example of mining a linear discriminant from data(一個從資料中挖掘線性判別式的例子)

20

Data：sepal (花萼)width寬度,

petal (花瓣) width寬度

Types：Iris (鳶尾) Setosa,

Iris Versicolor

Two different separation lines：

兩個不同的分離線

Logistic regression羅吉斯回歸
Support vector machine

支持向量機

The two methods produce different

boundaries because they’re

optimizing different functions.

這兩種方法產生不同的邊界，因為它們最佳化不同的函數

filled dots: Iris Setosa

circles: Iris Versicolor

21 of 54

Linear models for scoring and ranking instances (線性模型用於案例的評分與評等)

In Chapter 3, we explained how a tree model could produce an estimate of class membership probability. 在第3章中我們解釋了樹狀模型如何產生類別成員的機率估計。
We can do this with linear models as well. 我們也可以用線性模型來做到這一點。

21

22 of 54

Linear models for scoring and ranking instances

In some applications, we don’t need a precise probability estimate. We simply need to rank cases by the likelihood of belonging to one class or the other. 在某些應用中，我們不需要精確的機率估計。我們只需要將案例是屬於一個類別或另一個類別的可能性進行排名。
For example, for targeted marketing we may have a limited budget for targeting prospective customers.例如目標行銷來說，我們對行銷潛在客戶的預算是有限的
We need a list of customers as to their likelihood for responding to our marketing offers. 我們需要一個列表，列出每位客戶回應我們行銷活動的可能性

22

23 of 54

Linear discriminant functions can give us such a ranking for free(線性判別式可以免費給我們這樣的排名)

請看課本Fig. 4-4 (投影片下一頁)
Consider the + instances to be responders and ∙ instances to be non-responders. 假設 +是回應者， ∙是不回應者。
A linear discriminant function gives a line that separate the two classes.一個線性判別函數給出了一條分隔兩個類別的線。
If a new customer x happens to be on the line, i.e., f(x) = 0, then we are most uncertain about his/her class.如果新客戶x碰巧在線上，即f（x）= 0，那麼我們對他/她的所屬類別最無法確定。
But if f(x) is positive and large, then we would be certain that x will most likely be a responder. 但是如果f(x)是正數且值很大，那麼我們可以肯定x非常有可能是一個回應者。
Thus f(x) essentially gives a ranking. 因此f(x)本質上給出了一個排名

23

24 of 54

24

25 of 54

Class probability estimation and logistic regression(類別的機率估計與羅吉斯回歸)

Linear discriminants could be used to give estimates of class probability.線性判別式可以用來給出類別機率的估計。
The most common procedure is called logistic regression. 最常見的方法稱為羅吉斯回歸。
Consider the problem of using the basic linear model to estimate the class probability: f(x) = w₀+w₁x₁+w₂x₂+….. 讓我們來思考使用基本線性模型來估計類別機率的問題

25

26 of 54

Class probability estimation and logistic regression

As we discussed, an instance being further from the separating boundary ought to lead to a higher probability of being in one class or the other, and the output of the linear function, f(x), gives the distance from the separating boundary. 正如我們前面討論過的，離邊界越遠的案例應該有更高的機率讓它屬於某個類別，而線性函數f(x)的輸出正是給出了案例離邊界的距離。
However, this shows a problem: f(x) ranges from -∞ to ∞, and a probability should range from zero to one.但是，這裡有一個問題要解決： f(x)的範圍從-∞到∞，但機率應該是從0到1。

26

27 of 54

Class probability estimation and logistic regression

Probability ranges from zero to one, odds range from 0 to ∞, log-odds ranges from -∞ to ∞.

27

Table 4-2. Probabilities, odds, and the corresponding log-odds.

Prob. Odds log-odds

Probability Odds Log-odds

0.5 50:50 or 1 0

0.9 90:10 or 9 2.19

0.999 999:1 or 999 6.9

0.01 1:99 or 0.0101 –4.6

0.001 1:999 or 0.001001 –6.9

28 of 54

Class probability estimation and logistic regression

28

P₊(x): probability that a data

item represented by feature

x belongs to class +

p₊(x)

Distance from the decision boundary

Wiki上的應用案例

29 of 54

Support vector machine(支持向量機)

SVM is a type of linear discriminants.支持向量機是一種線性判別函數
Instead of thinking about separating with a line, first fit the fattest bar between the classes. Shown by parallel dashed lines in the figure.它不是考量用一條線分開不同類別，而是首先將兩類之間最胖的分隔帶找到。圖中用平行虛線表示。
SVM’s objective function incorporates �the idea that a wider bar is better.支持�向量機的目標函數納入了這個了想法，�即越寬的分隔帶越好。
Once the widest bar is found, the �linear discriminant will be the center �line through the bar.一旦找到最寬的

分隔帶，SVM線性判別式就是通過

分隔帶的中心線。

29

30 of 54

Support vector machine(支持向量機)

The distance between the dashed parallel lines is called the margin around the linear discriminant.�虛線平行線之間的距離稱為線性判別式的margin
The objective is to maximize the margin.�目標是最大化margin
The margin-maximizing

boundary gives the

maximal leeway for

classifying points that

fall closer to the

boundary. margin最大化的

邊界給了當要分類的點落在

邊界附近時最大的轉圜餘地

30

31 of 54

Support vector machine(支持向量機)

Sometimes a single line can not perfectly separate the data into classes, as the following figure shows. 有時單一的線不能將資料完美地劃分到類別中，如下圖所示。
SVM’s objective function will penalize a training point for being on the wrong

side of the decision

boundary. 支持向量機的

目標函數會將落於決策邊

界錯誤一側的訓練點進行

懲罰。

31

32 of 54

Support vector machine(支持向量機)

If the data are not linearly separable, the best fit is some balance between a fat margin and a low total error penalty. 如果資料不是線性可分的，則最佳配適是在寬分隔帶與低的總誤差懲罰之間取得平衡。
The penalty for a �misclassified point is �proportional to the distance �from the marginal boundary.�一個誤分類點的懲罰與距離

邊界的距離成正比

32

33 of 54

Support vector machine(支持向量機)

33

The term “loss” is used a general term for error penalty.

Support vector uses hinge loss. The hinge loss only becomes positive when

an example is on the wrong side of the boundary and beyond the margin.

Zero-one loss assigns a loss of zero for a correct decision and one for an

incorrect decision. 我們常用“損失”作為錯誤懲罰的泛稱，而支持向量使用所謂的鉸鏈損失。

當一個例子位於邊界的錯誤一側並超出界限時，鉸鏈損失才變為正值。

0/1損失對正確決策會給予損失值0，對錯誤決策會給予損失值1。

34 of 54

Logistic regression vs. Tree induction(羅吉斯回歸v.s.樹歸納法)

A classification tree uses decision boundaries that are perpendicular to the instance-space axes, whereas the linear classifier can use decision boundaries of any direction or orientation.分類樹使用垂直於案例空間的軸的決策邊界，而線性分類器可以使用任何方向的決策邊界，但只有一條。

34

Figure 4-1

Figure 4-3

35 of 54

Logistic regression vs. Tree induction

A classification tree is a “piecewise” classifier that segments the instance space recursively when it has to, using a divide-and-conquer approach. In principle, a classification tree can cut up the instance space arbitrarily finely into very small regions.分類樹是一個“分段”分類器，它使用「各個擊破」的方法逐步分割案例空間。原則上，分類樹可以將案例空間任意細化為非常小的區域。
A linear classifier places a single decision surface through the entire space. It has great freedom in the orientation of the surface, but it is limited to a single division into two segments. 線性分類器在整個空間中放置單一決策面。它讓決策面在方向上有很大的自由度，但它僅為一分為二的分割。

35

36 of 54

Logistic regression vs. Tree induction

It is usually not easy to determine in advance which of these characteristics are a better match to a given dataset. 通常要事先決定那些特徵與手上的資料集較匹配是不容易的事
When applied to a business problem, there is a difference in the comprehensibility (可瞭解性) of the models to stakeholders with different backgrounds. 當應用於企業問題時，不同背景的利害關係人對於模型的理解也會有差異

36

37 of 54

Logistic regression (LR) vs. Tree induction

What an LR model is doing can be quite understandable to people with a strong background in statistics, and difficult to understand for those who do not. 有統計背景的人應該能夠了解LR模型是如何運作，但對那些沒有統計學背景的人可能未必如此
A DT (decision tree), if it is not too large, may be considerably more understandable to someone without a strong statistics or mathematics background. 另外，決策樹(如果不是太龐大)，對於沒有統計或數學背景的人應該較容易理解。

37

38 of 54

Logistic regression (LR) vs. Tree induction

Why is this important? For many business problems, the data science team does not have the ultimate say in which models are used or implemented. 為什麼這很重要？對於許多企業問題來說，資料科學團隊對於企業最終要使用哪個模型是沒有決定權的
Often there is at least one manager who must “sign off” on the use of a model in practice, and in many cases a set of stakeholders need to be satisfied with the model. 在實務上，至少有一位經理必須“簽署”使用某個模型，而且在很多情況下，必須讓利害關係人對這個模型覺得滿意

38

39 of 54

A real example classified by LR and DT(一個以線性回歸與決策樹分類的實例)

Pages 104-107

39

Figure 4-11. One of the cell images from which the Wisconsin Breast Cancer dataset

was derived. (Image courtesy of Nick Street and Bill Wolberg.) 一個來自威斯康辛州

乳腺癌資料庫的細胞影像

40 of 54

Breast cancer classification(乳癌分類)

Table 4-3. The attributes of the Wisconsin Breast Cancer dataset. 威斯康辛州乳腺癌資料集的屬性

Attribute name Description

RADIUS Mean of distances from center to points on

the perimeter從中心到周邊的平均距離

TEXTURE Standard deviation of grayscale values 灰階值的標準差

PERIMETER Perimeter of the mass 細胞周長

AREA Area of the mass 細胞面積

SMOOTHNESS Local variation in radius lengths 半徑長度的局部變化

COMPACTNESS Computed as: perimeter2/area – 1.0 計算為：周長2 /面積 - 1.0

CONCAVITY Severity of concave portions of the contour 輪廓凹陷的嚴重程度

CONCAVE POINTS Number of concave portions of the contour輪廓凹陷的數量

SYMMETRY A measure of the nucleii’s symmetry衡量細胞核的對稱性

FRACTAL DIMENSION碎型維數 ‘ Coastline approximation’ – 1.0 海岸線估計法 – 1.0

DIAGNOSIS (Target) Diagnosis of cell sample: malignant benign

細胞樣本的診斷：惡性或良性

40

41 of 54

Breast cancer classification

Table 4-4. Linear equation learned by logistic regression on the Wisconsin Breast Cancer 通過羅吉斯回歸學習威斯康辛州乳腺癌資料得到的線性方程

dataset (see text and Table 4-3 for a description of the attributes).

Attribute Weight (learned parameter)

SMOOTHNESS_worst 22.3

CONCAVE_mean 19.47

CONCAVE_worst 11.68

SYMMETRY_worst 4.99

CONCAVITY_worst 2.86

CONCAVITY_mean 2.34

RADIUS_worst 0.25

TEXTURE_worst 0.13

AREA_SE 0.06

TEXTURE_mean 0.03

TEXTURE_SE –0.29

COMPACTNESS_mean –7.1

COMPACTNESS_SE –27.87

w₀ (intercept) –17.7

41

42 of 54

Breast cancer classification by DT(以決策樹進行乳癌分類)

42

Figure 4-13. Decision tree learned from the Wisconsin Breast Cancer dataset.

Summary 從威斯康辛州乳腺癌資料集學習到的決策樹

43 of 54

Nonlinear functions(非線性函數)

43

Figure 4-12. The Iris dataset with a nonlinear feature. In this figure, logistic regression

and support vector machine—both linear models—are provided an additional feature,

Sepal width², which allows both the freedom to create more complex, nonlinear models

(boundaries), as shown. 具有非線性特徵的Iris資料集。在圖中，羅吉斯回歸和支持向量機

- 兩種線性模型 - 都增加額外的花萼寬度平方特徵，它讓兩者多了自由度，能創建更複雜的

非線性模型（邊界）。

44 of 54

Neural networks (NN)(類神經網路)

NN also implement complex nonlinear numeric functions. NN也實作了複雜的非線性數值函數

We can think of a NN as a stack of models. On the bottom of the stack are the original features. 我們可以將NN看作一堆模型。在堆疊的底部是原始特徵

44

45 of 54

Neural networks (NN)(類神經網路)

45

46 of 54

Neural networks (NN)(類神經網路)

From these features are learned a variety of relatively simple models.從這些特徵中學到了各種相對簡單的模型

Let’s say these are LRs. Then, each subsequent layer in the stack applies a simple model (let’s say, another LR) to the outputs of the next layer down. 假設這些是LR。然後堆疊中的每個後續層都會向下一層的輸出套入一個簡單模型(也就是另一個LR)

46

47 of 54

結論

This chapter introduced a second type of predictive modeling technique called function fitting or parametric modeling.本章介紹了第二種建立預測模型的技術，稱為函數配適或參數化建模
In this case the model is a partially specified equation: a numeric function of the data attributes, with some unspecified numeric parameters. 在這種情況下，模型是部分明確的等式，就是資料屬性的數值函數，函數中有一些待決定的數值參數

47

48 of 54

結論

Linear modeling techniques include linear discriminants such as support-vector machines, logistic regression, and traditional linear regression. 線性建模技術包括線性判別，如支持向量機，羅吉斯回歸和傳統線性回歸等

48

49 of 54

結論

The task of the data mining procedure is to “fit” the model to the data by finding the best set of parameters, in some sense of “best.” 資料探勘的任務是通過找到某種意義上“最好“的最佳參數集合，將模型與資料做搭配

49

50 of 54

結論

Conceptually the key difference between these techniques is their answer to a key issue, What exactly do we mean by best fitting the data?從概念上講，這些技術之間的關鍵區別在於他們對一個關鍵問題的解答: 對資料做最佳配適到底是指什麼？
The goodness of fit is described by an “objective function,” and each technique uses a different function. The resulting techniques may be quite different. “目標函數”描述了配適的優良程度，每種技術使用不同的函數。由此產生的技術可能完全不同。

50

51 of 54

結論

We now have seen two very different sorts of data modeling, tree induction and function fitting, and have compared them. 我們現在已經看到了兩種截然不同的資料建模，樹歸納法和函數配適，並對它們進行了比較。
We have also introduced two criteria by which models can be evaluated: the predictive performance of a model and its intelligibility. It is often advantageous to build different sorts of models from a dataset to gain insight.我們還引入了兩個評估模型的標準：模型的預測性能和可理解性。從資料集中構建不同類型的模型通常有利於得到深入的了解。

51

52 of 54

結論

This chapter focused on the fundamental concept of optimizing a model’s fit to data.本章重點討論了「模型配適數據最佳化」的基本概念
However, doing this leads to the most important fundamental problem with data mining—if you look hard enough, you will find structure in a dataset, even if it’s just there by chance. 但是這樣做會導致資料探勘中最重要的基本問題 - 如果你仔細審視，你也可以在資料集中找到某種結構，即使只是剛好出現

52

53 of 54

結論

This tendency is known as overfitting. Recognizing and avoiding overfitting is an important general topic in data science.這種趨勢被稱為過度配適。認識和避免過度配適是資料科學中一個重要的議題

53

54 of 54

Overfitting(過度配適)

When your learner outputs a classification that is 100% accurate on the training data but 50% accurate on test data, when in fact it could have output one that is 75% accurate on both, it has overfit. 當你的學習器輸出的分類結果對訓練資料100％準確，但對測試資料的準確率為50％時，而實際上它可以輸出兩者都精準到75％的分類結果時，這時就是過度配適產生了

54