1 What Is Feature Engineering
機械学習モデルを作る最も重要なステップ
feature engineeringを学びましょう
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
トレーニングデータ
feature…
target
target
(テスト)データ
ここを
あてたい
feature…
解あり
feature engineering
次の5つを学びます
・determine which features are the most important with mutual
information (targetの推定に重要な特徴料は)
・invent new features in several real-world problem domains
(体重と慎重から体脂肪率とか)
・encode high-cardinality categoricals with a target encoding
(分類の特徴量を数字に)
・create segmentation features with k-means clustering
(分類特徴量をクラスタリングでつくる)
・decompose a dataset‘s variation into features with principal
component analysis
(元の情報の大部分を保持した少数の変数(主成分)を抽出)
notebook
Jupyter Notebookは、
Webブラウザ上でPythonのコードを記述・実行できる、
インタラクティブな開発環境です。
ノートブックと呼ばれるファイルにコードを記述し、
逐次実行結果を確認しながら作業を進めたり、
Markdown記法で説明を加えたり、
グラフを表示したりと、
データ分析やプログラミング学習に非常に役立ちます。
House Prices Getting Started competition
で演習をします
The Goal of Feature Engineering
The goal of feature engineering is
simply to make your data
better suited to the problem at hand.
手持ちのデータを利用しやすくする!
Consider "apparent temperature" measures
体感温度(たいかんおんど)人間が感じる温度
like the heat index and the wind chill.
These quantities attempt to measure
the perceived(感じられた) temperature to humans
based on air temperature, humidity, and wind speed, things which we can measure directly.
You could think of an apparent temperature as the result of
a kind of feature engineering,
an attempt to make the observed data more relevant to
what we actually care about: how it actually feels outside!
You might perform feature engineering to:
improve a model‘s predictive performance 予測性能向上
reduce computational or data needs 計算量やデータ量圧縮
improve interpretability of the results 結果の解釈を改善
A Guiding Principle of Feature Engineering� 従うべき原則
For a feature to be useful,
it must have a relationship to the target
that your model is able to learn.
Linear models, for instance, are only able to learn linear relationships.
So, when using a linear model,
your goal is
to transform the features to make their relationship
to the target linear. targerにlinearな特徴がほしい!
The key idea here is that
a transformation you apply to a feature
becomes in essence a part of the model itself.
Say you were trying to predict
the Price of square plots of land from the Length of one side.
Fitting a linear model directly to Length
gives poor results: the relationship is not linear.
正方形の土地
1辺の長さ
価格
If we square the Length feature to get 'Area', however,
we create a linear relationship.
Adding Area to the feature set means
this linear model can now fit a parabola.
Squaring a feature, in other words, gave the linear model
the ability to fit squared features.
This should show you
why there can be such a high return on time
invested in feature engineering.
Whatever relationships your model can't learn,
you can provide yourself through transformations.
As you develop your feature set,
think about what information your model could use
to achieve its best performance.
効果大
モデルが
使いやすい
特徴量を追加
Example - Concrete Formulations�コンクリートの成分配合
To illustrate these ideas we'll see
how adding a few synthetic features to a dataset
can improve the predictive performance
of a random forest model.
The Concrete dataset contains
a variety of concrete formulations and
the resulting product's compressive strength,
which is a measure of how much load that kind of concrete can bear.
The task for this dataset is to predict
a concrete's compressive strength given its formulation.
こう配合すると
この圧縮強度
target
You can see here
the various ingredients going into each variety of concrete.
We'll see in a moment
how adding some additional synthetic features derived from these
can help a model to learn important relationships among them.
We'll first establish a baseline
by training the model on the un-augmented dataset.
This will help us determine whether our new features are actually useful.
Establishing baselines like this is good practice
at the start of the feature engineering process.
A baseline score can help you decide
whether your new features are worth keeping, or
whether you should discard them and possibly try something else.
基準値
If you ever cook at home, you might know that
the ratio of ingredients in a recipe is
usually a better predictor of how the recipe turns out
than their absolute amounts.
We might reason then that
ratios of the features above
would be a good predictor of CompressiveStrength.
The cell below adds three new ratio features to the dataset.
比が大切
比の特徴量を
3つ追加
スコア改善!
MAE (Mean Absolute Error) は、
回帰問題における予測値と正解値の差の絶対値を
平均したものです
And sure enough, performance improved!
This is evidence that these new ratio features
exposed important information
to the model that it wasn't detecting before.
Continue
We've seen that
engineering new features can improve model performance.
But how do you identify features in the dataset
that might be useful to combine?
Discover useful features with mutual information