Introduction��In this exercise, you'll apply target encoding to features in the Ames dataset�
First you'll need to choose which features
you want to apply a target encoding to.
Categorical features with a large number of categories are often good candidates.
Run this cell to see how many categories each categorical feature in the Ames dataset has.
We talked about how the M-estimate encoding uses smoothing
to improve estimates for rare categories.
To see how many times a category occurs in the dataset,
you can use the value_counts method.
This cell shows the counts for SaleType,
but you might want to consider others as well.
Warranty deed(ワランティ・ディード)とは、不動産取引において、売り主(譲渡人)が買い主(譲受人)に対して、物件の権原(所有権)に欠陥がないことを保証
CODは、eコマースや物流業界において「Cash on Delivery(代金引換)」
CWD means "Warranty Deed - Cash"
1) Choose Features for Encoding
Which features did you identify for target encoding?
The Neighborhood feature looks promising. It has the most categories of any feature, and several categories are rare.
Others that could be worth considering are SaleType, MSSubClass, Exterior1st, Exterior2nd. In fact, almost any of the nominal名ばかりの features would be worth trying because of the prevalence of rare categories.
Now you'll apply a target encoding to your choice of feature.
As we discussed in the tutorial, to avoid overfitting,
we need to fit the encoder on data heldout from the training set. Run this cell to create the encoding and training splits:
訓練データ(training data) テストデータ(test data)に分割
2) Apply M-Estimate Encoding
Apply a target encoding to your choice of categorical features.
Also choose a value for the smoothing parameter m
(any value is okay for a correct answer).
エンコーダーにデータを学習させ新しい特徴の列を作成 target encoding
encoder.transformは、fitメソッドで学習した内容をもとに、新しいデータを変換する。特に、カテゴリー変数を機械学習モデルが扱える数値形式に変換する。
ヒストグラムが
今回のtarget encodingで作成したデータ
SalesPriceと関係ありそう
しかし性能は変化なし
Do you think that target encoding was worthwhile in this case? Depending on which feature or features you chose,
you may have ended up with a score
significantly worse than the baseline. うまくいかないかも
In that case, it's likely the extra information gained
by the encoding couldn't make up for the loss of data
used for the encoding.
エンコーディングによって得られた追加情報は、
エンコーディングに使用されたデータの損失を補うことができなかった
かも
, .... In this question, you'll explore the problem of overfitting with target encodings.
This will illustrate this importance of training fitting target encoders
on data held-out from the training set.
So let's see what happens when we fit the encoder and the model on the same dataset.
To emphasize how dramatic the overfitting can be,
we'll mean-encode a feature that should have no relationship with SalePrice,
a count: 0, 1, 2, 3, 4, 5
3) Overfitting with Target Encoders
Based on your understanding of how mean-encoding works,
can you explain how XGBoost was able to get an almost a perfect fit
after mean-encoding the count feature?
Since Count never has any duplicate values,
the mean-encoded Count is essentially an exact copy of the target.
In other words, mean-encoding turned a completely meaningless feature into a perfect feature.
Now, the only reason this worked is because we trained XGBoost on the same set
we used to train the encoder.
If we had used a hold-out set instead, none of this "fake" encoding would have transferred to the training data.
The lesson is that when using a target encoder it‘s very important to use separate data sets for training the encoder and training the model.
Otherwise the results can be very disappointing!