1 of 18

1 What Is Feature Engineering

機械学習モデルを作る最も重要なステップ

feature engineeringを学びましょう

トレーニングデータ

feature…

target

target

(テスト)データ

ここを

あてたい

feature…

解あり

2 of 18

feature engineering

次の5つを学びます

・determine which features are the most important with mutual

  information (targetの推定に重要な特徴料は)

・invent new features in several real-world problem domains

(体重と慎重から体脂肪率とか)

・encode high-cardinality categoricals with a target encoding

(分類の特徴量を数字に)

・create segmentation features with k-means clustering

(分類特徴量をクラスタリングでつくる)

・decompose a dataset‘s variation into features with principal

 component analysis

  (元の情報の大部分を保持した少数の変数(主成分)を抽出)

3 of 18

notebook

Jupyter Notebookは、

Webブラウザ上でPythonのコードを記述・実行できる、

インタラクティブな開発環境です。

ノートブックと呼ばれるファイルにコードを記述し、

逐次実行結果を確認しながら作業を進めたり、

Markdown記法で説明を加えたり、

グラフを表示したりと、

データ分析やプログラミング学習に非常に役立ちます。

4 of 18

House Prices Getting Started competition

で演習をします

5 of 18

The Goal of Feature Engineering

The goal of feature engineering is

simply to make your data

better suited to the problem at hand. 

手持ちのデータを利用しやすくする!

6 of 18

Consider "apparent temperature" measures

体感温度(たいかんおんど)人間が感じる温度

like the heat index and the wind chill.

These quantities attempt to measure

the perceived(感じられた) temperature to humans

based on air temperature, humidity, and wind speed, things which we can measure directly.

You could think of an apparent temperature as the result of

a kind of feature engineering,

an attempt to make the observed data more relevant to

what we actually care about: how it actually feels outside!

7 of 18

You might perform feature engineering to:

improve a model‘s predictive performance 予測性能向上

reduce computational or data needs  計算量やデータ量圧縮

improve interpretability of the results 結果の解釈を改善

8 of 18

A Guiding Principle of Feature Engineering� 従うべき原則

For a feature to be useful,

it must have a relationship to the target

that your model is able to learn.

Linear models, for instance, are only able to learn linear relationships.

So, when using a linear model,

your goal is

to transform the features to make their relationship

to the target linear.  targerにlinearな特徴がほしい!

9 of 18

The key idea here is that

a transformation you apply to a feature

becomes in essence a part of the model itself.

Say you were trying to predict

the Price of square plots of land from the Length of one side.

Fitting a linear model directly to Length

gives poor results: the relationship is not linear.

正方形の土地

1辺の長さ

価格

10 of 18

If we square the Length feature to get 'Area', however,

we create a linear relationship.

Adding Area to the feature set means

this linear model can now fit a parabola.

Squaring a feature, in other words, gave the linear model

the ability to fit squared features.

11 of 18

This should show you

why there can be such a high return on time

invested in feature engineering.

Whatever relationships your model can't learn,

you can provide yourself through transformations.

As you develop your feature set,

think about what information your model could use

to achieve its best performance.

効果大

モデルが

使いやすい

特徴量を追加

12 of 18

Example - Concrete Formulations�コンクリートの成分配合

To illustrate these ideas we'll see

how adding a few synthetic features to a dataset

can improve the predictive performance

of a random forest model.

The Concrete dataset contains

a variety of concrete formulations and

the resulting product's compressive strength,

which is a measure of how much load that kind of concrete can bear.

The task for this dataset is to predict

a concrete's compressive strength given its formulation.

こう配合すると

この圧縮強度

13 of 18

target

14 of 18

You can see here

the various ingredients going into each variety of concrete.

We'll see in a moment

how adding some additional synthetic features derived from these

can help a model to learn important relationships among them.

We'll first establish a baseline

by training the model on the un-augmented dataset.

This will help us determine whether our new features are actually useful.

Establishing baselines like this is good practice

at the start of the feature engineering process.

A baseline score can help you decide

whether your new features are worth keeping, or

whether you should discard them and possibly try something else.

基準値

15 of 18

16 of 18

If you ever cook at home, you might know that

the ratio of ingredients in a recipe is

usually a better predictor of how the recipe turns out

than their absolute amounts.

We might reason then that

ratios of the features above

would be a good predictor of CompressiveStrength.

The cell below adds three new ratio features to the dataset.

比が大切

17 of 18

の特徴量を

3つ追加

スコア改善!

MAE (Mean Absolute Error) は、

回帰問題における予測値と正解値の差の絶対値

平均したものです

18 of 18

And sure enough, performance improved!

This is evidence that these new ratio features

exposed important information

to the model that it wasn't detecting before.

Continue

We've seen that

engineering new features can improve model performance.

But how do you identify features in the dataset

that might be useful to combine?

Discover useful features with mutual information