Overfitting and Its Avoidance�過度配適以及如何避免
Chapter 5
| |
| |
| |
Overfitting過度配適
參數化建模 (parameter modeling)
決策樹 (decision tree)
Overfitting過度配適
Overfitting過度配適
Overfitting過度配適
Overfitting過度配適
Holdout data 預留數據
Holdout data 預留數據
Holdout data 預留數據
Overfitting Examined檢視過度配適
changes.
這張圖片呈現了訓練
資料集與預留數據集
的準確度隨著模型複
雜度改變的變化情形
Figure 1. A typical fitting graph.
Overfitting in Tree Induction�樹狀模型的過度配適
Figure 3. A typical fitting graph for tree induction.
Overfitting in Tree Induction�樹狀模型的過度配適
Figure 3. A typical fitting graph for tree induction.
Overfitting in Tree Induction�樹狀模型的過度配適
Figure 3. A typical fitting graph for tree induction.
Overfitting in Tree Induction�樹狀模型的過度配適
the leaves are pure tends to overfit.
讓樹生長直到葉節點都變純
會容易造成過度配適
typical fitting graph for
tree induction.
這張圖就是一個典型
樹的配適圖
Figure 3. A typical fitting graph for tree induction.
Overfitting in Tree Induction�樹狀模型的過度配適
the tree starts to overfit:
it acquires details of the
training set that are not
characteristic of the
population in general, as
represented by the
holdout set.
在某個點(甜蜜點)樹開始
過度配適:它取得了訓練集
的細節,但卻不是母體
的特徵,如預留數據所
顯示的
Figure 3. A typical fitting graph for tree induction.
Overfitting in Mathematical Functions�數學函式的過度配適
Overfitting in Mathematical Functions�數學函式的過度配適
Example: Overfitting Linear Functions�例子:過度配適的線性函數
Data:sepal width, petal width
資料:萼片寬度,花瓣寬度
Types:Iris Setosa, Iris Versicolor
類型:山鳶尾,變色鳶尾
Two different separation lines:
兩種不同的線
a. 羅吉斯回歸
b. Support vector machine
b. 支援向量機
Figure 5-4
Example: Overfitting Linear Functions�例子:過度配適的線性函數
Figure 5-4
Figure 5-5
Example: Overfitting Linear Functions�例子:過度配適的線性函數
Figure 5-4
Figure 5-6
Overfitting in Mathematical Functions�數學函式的過度配適
Example: Overfitting Non-Linear Functions�例子:過度配適的非線性函數
Figure 6
Figure 7
Added additional feature Sepal width2,
which gives a boundary that’s a parabola.
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Table 5-1. A small set of training examples
Instance x y Class
1 p r c1
2 p r c1
3 p r c1
4 q s c1
5 p s c2
6 q r c2
7 q s c2
8 q r c2
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
我們首先得到左邊的樹
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Table 5-1. A small set of training examples
Instance x y Class
1 p r c1
2 p r c1
3 p r c1
4 q s c1
5 p s c2
6 q r c2
7 q s c2
8 q r c2
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
Table 5-1. A small set of training examples
Instance x y Class
1 p r c1
2 p r c1
3 p r c1
4 q s c1
5 p s c2
6 q r c2
7 q s c2
8 q r c2
Example: Why is Overfitting Bad?�例子:為什麼過度配適不好?
錯誤率計算
錯誤率計算
Summary
Summary
Summary
Summary
Overfitting 詼諧版
From Holdout Evaluation to Cross-Validation�從維持評估到交叉驗證
From Holdout Evaluation to Cross-Validation�從維持評估到交叉驗證
Cross-Validation交叉驗證
Cross-Validation交叉驗證
Cross-Validation交叉驗證
Cross-Validation交叉驗證
Cross-Validation交差驗證
From Holdout Evaluation to Cross-Validation�從維持評估到交叉驗證
Holdout Evaluation Splits the data into only one training and one holdout set.
預留評估將資料分成一個訓練集和一個預留集。
Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. ( k folds, typically k would be 5 or 10. )
交叉驗證通過執行多個拆分並系統地交換樣本進行測試來計算其對所有資料的估計。 (k fold,通常k是5或10)。
The Churn Dataset Revisited�再看顧客流失資料集
The Churn Dataset Revisited�再看顧客流失資料集
The Churn Dataset Revisited�再看顧客流失資料集
The Churn Dataset Revisited�再看顧客流失資料集
Observations觀察
“Example: Addressing the Churn Problem with Tree Induction” in Chapter 3.
Observations觀察
“Example: Addressing the Churn Problem with Tree Induction” in Chapter 3.
Observations觀察
“Example: Addressing the Churn Problem with Tree Induction” in Chapter 3.
Observations觀察
“Example: Addressing the Churn Problem with Tree Induction” in Chapter 3.
Learning Curves學習曲線
(for the telecommunications churn problem)
Learning Curves學習曲線
Learning Curves學習曲線
Learning Curves學習曲線
65
機器學習
X
y
此員工離職之機率
此客戶流失之機率
此引擎是否需要維修?
與員工相關之資料
與客戶相關之資料
與引擎相關之資料
建立關聯性
Avoiding Overfitting with Tree Induction�使用樹狀結構時避免過度配適
Avoiding Overfitting with Tree Induction�使用樹狀結構時避免過度配適
(i) to stop growing the tree before it gets too complex, and
(ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).
樹歸納通常使用兩種技術來避免過度配適:
(i)在樹變得太複雜之前停止讓樹增長
(ii)讓樹長直到它太大,然後“修剪”它,減小它的尺寸(從而減小它的複雜度)。
To stop growing the tree before it gets too complex�在樹太複雜之前停止樹增長
To stop growing the tree before it gets too complex�在樹太複雜之前停止增長樹
To stop growing the tree before it gets too complex�在樹太複雜之前停止增長樹
To grow the tree until it is too large, then “prune” it back�增長樹直到它太大,然後“修剪”回來
A General Method for Avoiding Overfitting�一個避免過度配適的方法
Training set
Test set
( hold out )
Training subset
Validation set
Final holdout set
Overfitting in Tree Induction�樹狀模型的過度配適
the tree starts to overfit:
it acquires details of the
training set that are not
characteristic of the
population in general, as
represented by the
holdout set.
在某個點(甜蜜點)樹開始
過度配適:它取得了訓練集
的細節,但卻不是母體
的特徵,如預留數據所
顯示的
Figure 3. A typical fitting graph for tree induction.
Nested cross-validation�嵌套交叉驗證
Nested cross-validation�嵌套交叉驗證
Nested cross-validation�嵌套交叉驗證
From Holdout Evaluation to Cross-Validation�從預留評估到交叉驗證
Holdout Evaluation Splits the data into only one training and one holdout set.
預留評估將資料分成只有一個訓練和一個維持組。
Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. ( k folds, typically k would be 5 or 10. )
交叉驗證通過執行多個拆分並系統地交換樣本進行測試來計算其對所有資料的估計。 (k fold,通常k是5或10)。
Nested Cross-Validation�嵌套交叉驗證
Original data
Training set
Test set
Nested cross-validation�嵌套交叉驗證
Nested cross-validation�嵌套交叉驗證
Nested cross-validation�嵌套交叉驗證
Nested cross-validation�嵌套交叉驗證
Sequential forward selection�循序向前選擇
Sequential forward selection�循序向前選擇
Sequential backward selection�循序向後選擇
Nested cross-validation�嵌套交叉驗證