Chapter4: �Fitting a model to data�讓模型與資料速配
1
2
Decision
Tree
機器學習
監督式學習
非監督式學習
增強式學習
迴歸問題
分類問題
分群問題
降維
決策樹
線性迴歸
羅吉斯迴歸
支援向量機
主成分分析
k-平均演算法
回顧
3
4
物件 | 海拔 (米) | 每平方米價格 | 年份 | 浴室幾間 | 臥室幾間 | ……… | 舊金山/紐約 |
1 | 50 | 20050 |
|
|
|
| 紐約 |
2 | 110 | 8340 |
|
|
|
| 舊金山 |
我們想建立一個預測模型,判斷一棟房子是紐約或是在舊金山
Fitting a model to data(讓模型與資料速配)
5
Fitting a model to data(讓模型與資料速配)
6
7
8
Fitting a model to data(讓模型與資料速配)
從資料我們生成了模型的結構(從樹歸納法產生的樹狀模型)和模型的數值“參數”(葉節點處的機率估計)。
9
10
Fitting a model to data
11
Fitting a model to data
12
Fitting a model to data
13
Classification via mathematical function(以數學函數分類)
它通過水平和垂直決策邊界將空間劃分為區塊,就是將實例空間劃分為相似的區域
14
Recall the instance-space view of tree models from Chapter 3.
回顧第3章樹狀模型實例空間的的視圖
Figure 4-1
Classification via mathematical function
15
Figure 4-2. The raw data points of Figure 4-1, without
decision lines. 下圖只顯示原始資料點,沒有判定線
Classification via mathematical function
16
We can separate the instances almost perfectly (by class) if we
are allowed to introduce a boundary that is still a straight line,
but is not perpendicular to the axes. 如果我們用一條是直
線但不垂直於軸線的邊界,我們可以幾乎完美地分離這些案例(按類別)
-- 20
40
Figure 4-3
Linear Discriminant Function (線性判別函數)
17
-
-
Linear Discriminant Function(線性判別函數)
Which is the “best” line?
哪一條是最佳的線?
18
Optimizing an objective function�(最佳化目標函數)
19
STEP1:Define an objective function that represents our goal.
定義一個代表我們目標的目標函數
STEP2:The function can be calculated for a particular set of weights and a particular set of data. 針對特定的一組權重和一組特定的資料,我們可以計算該函數
STEP3:Find the optimal value for the weights by maximizing or minimizing the objective function. 通過最大化或最小化目標函數來找到權重的最佳值。
An example of mining a linear discriminant from data(一個從資料中挖掘線性判別式的例子)
20
Data:sepal (花萼)width寬度,
petal (花瓣) width寬度
Types:Iris (鳶尾) Setosa,
Iris Versicolor
Two different separation lines:
兩個不同的分離線
支持向量機
The two methods produce different
boundaries because they’re
optimizing different functions.
這兩種方法產生不同的邊界,因為它們最佳化不同的函數
filled dots: Iris Setosa
circles: Iris Versicolor
Linear models for scoring and ranking instances (線性模型用於案例的評分與評等)
21
Linear models for scoring and ranking instances
22
Linear discriminant functions can give us such a ranking for free(線性判別式可以免費給我們這樣的排名)
23
24
Class probability estimation and logistic regression(類別的機率估計與羅吉斯回歸)
25
Class probability estimation and logistic regression
26
Class probability estimation and logistic regression
27
Table 4-2. Probabilities, odds, and the corresponding log-odds.
Prob. Odds log-odds
Probability Odds Log-odds
0.5 50:50 or 1 0
0.9 90:10 or 9 2.19
0.999 999:1 or 999 6.9
0.01 1:99 or 0.0101 –4.6
0.001 1:999 or 0.001001 –6.9
Class probability estimation and logistic regression
28
P+(x): probability that a data
item represented by feature
x belongs to class +
p+(x)
Distance from the decision boundary
Support vector machine(支持向量機)
分隔帶,SVM線性判別式就是通過
分隔帶的中心線。
29
Support vector machine(支持向量機)
boundary gives the
maximal leeway for
classifying points that
fall closer to the
boundary. margin最大化的
邊界給了當要分類的點落在
邊界附近時最大的轉圜餘地
30
Support vector machine(支持向量機)
side of the decision
boundary. 支持向量機的
目標函數會將落於決策邊
界錯誤一側的訓練點進行
懲罰。
31
Support vector machine(支持向量機)
邊界的距離成正比
32
Support vector machine(支持向量機)
33
The term “loss” is used a general term for error penalty.
Support vector uses hinge loss. The hinge loss only becomes positive when
an example is on the wrong side of the boundary and beyond the margin.
Zero-one loss assigns a loss of zero for a correct decision and one for an
incorrect decision. 我們常用“損失”作為錯誤懲罰的泛稱,而支持向量使用所謂的鉸鏈損失。
當一個例子位於邊界的錯誤一側並超出界限時,鉸鏈損失才變為正值。
0/1損失對正確決策會給予損失值0,對錯誤決策會給予損失值1。
Logistic regression vs. Tree induction(羅吉斯回歸v.s.樹歸納法)
34
Figure 4-1
Figure 4-3
Logistic regression vs. Tree induction
35
Logistic regression vs. Tree induction
36
Logistic regression (LR) vs. Tree induction
37
Logistic regression (LR) vs. Tree induction
38
A real example classified by LR and DT(一個以線性回歸與決策樹分類的實例)
39
Figure 4-11. One of the cell images from which the Wisconsin Breast Cancer dataset
was derived. (Image courtesy of Nick Street and Bill Wolberg.) 一個來自威斯康辛州
乳腺癌資料庫的細胞影像
Breast cancer classification(乳癌分類)
Table 4-3. The attributes of the Wisconsin Breast Cancer dataset. 威斯康辛州乳腺癌資料集的屬性
Attribute name Description
RADIUS Mean of distances from center to points on
the perimeter從中心到周邊的平均距離
TEXTURE Standard deviation of grayscale values 灰階值的標準差
PERIMETER Perimeter of the mass 細胞周長
AREA Area of the mass 細胞面積
SMOOTHNESS Local variation in radius lengths 半徑長度的局部變化
COMPACTNESS Computed as: perimeter2/area – 1.0 計算為:周長2 /面積 - 1.0
CONCAVITY Severity of concave portions of the contour 輪廓凹陷的嚴重程度
CONCAVE POINTS Number of concave portions of the contour輪廓凹陷的數量
SYMMETRY A measure of the nucleii’s symmetry衡量細胞核的對稱性
FRACTAL DIMENSION碎型維數 ‘ Coastline approximation’ – 1.0 海岸線估計法 – 1.0
DIAGNOSIS (Target) Diagnosis of cell sample: malignant benign
細胞樣本的診斷:惡性或良性
40
Breast cancer classification
Table 4-4. Linear equation learned by logistic regression on the Wisconsin Breast Cancer 通過羅吉斯回歸學習威斯康辛州乳腺癌資料得到的線性方程
dataset (see text and Table 4-3 for a description of the attributes).
Attribute Weight (learned parameter)
SMOOTHNESS_worst 22.3
CONCAVE_mean 19.47
CONCAVE_worst 11.68
SYMMETRY_worst 4.99
CONCAVITY_worst 2.86
CONCAVITY_mean 2.34
RADIUS_worst 0.25
TEXTURE_worst 0.13
AREA_SE 0.06
TEXTURE_mean 0.03
TEXTURE_SE –0.29
COMPACTNESS_mean –7.1
COMPACTNESS_SE –27.87
w0 (intercept) –17.7
41
Breast cancer classification by DT(以決策樹進行乳癌分類)
42
Figure 4-13. Decision tree learned from the Wisconsin Breast Cancer dataset.
Summary 從威斯康辛州乳腺癌資料集學習到的決策樹
Nonlinear functions(非線性函數)
43
Figure 4-12. The Iris dataset with a nonlinear feature. In this figure, logistic regression
and support vector machine—both linear models—are provided an additional feature,
Sepal width2, which allows both the freedom to create more complex, nonlinear models
(boundaries), as shown. 具有非線性特徵的Iris資料集。 在圖中,羅吉斯回歸和支持向量機
- 兩種線性模型 - 都增加額外的花萼寬度平方特徵,它讓兩者多了自由度,能創建更複雜的
非線性模型(邊界)。
Neural networks (NN)(類神經網路)
44
Neural networks (NN)(類神經網路)
45
Neural networks (NN)(類神經網路)
46
結論
47
結論
48
結論
49
結論
50
結論
51
結論
52
結論
53
Overfitting(過度配適)
54