1 of 17

Exercise: �2 Mutual Information

2 of 17

In this exercise �you'll identify an initial set of features �in the Ames dataset �to develop using mutual information scores �and interaction plots.��Run this cell to set everything up!

注目する特徴量の選び方

3 of 17

本文で使用した2つの関数

4 of 17

家の価格を予想したい

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

5 of 17

MoSold 売れた月

YearBuild 建築年

Screen Porch 網戸付きベランダ

  1. Understand Mutual Information

Based on the plots, which feature do you think would

have the highest mutual information with SalePrice?

6 of 17

The Ames dataset has seventy-eight features

a lot to work with all at once!

Fortunately, you can identify

the features with the most potential.

Use the make_mi_scores function

(introduced in the tutorial)

to compute mutual information scores for the Ames features:

78個の特徴量から

重要な特徴量を選びたい

7 of 17

8 of 17

2) Examine MI Scores

Do the scores seem reasonable?

Do the high scoring features represent things you'd think most people would value in a home?

Do you notice any themes in what they describe?

Some common themes among most of these features are:

  • Location: Neighborhood
  • Size: all of the Area and SF features, and counts like FullBath and GarageCars
  • Quality: all of the Qual features
  • Year: YearBuilt and YearRemodAdd
  • Types: descriptions of features and styles like Foundation and GarageType

These are all the kinds of features you'll commonly see in real-estate listings

(like on Zillow), It's good then that our mutual information metric scored them highly.

On the other hand, the lowest ranked features seem to mostly represent things that are rare

or exceptional in some way, and so wouldn't be relevant to the average home buye

9 of 17

  • Location: Neighborhood
  • Size: all of the Area and SF features,

   and counts like FullBath and GarageCars

  • Quality: all of the Qual features
  • Year: YearBuilt and YearRemodAdd  リフォーム年月日
  • Types: descriptions of features and styles like Foundation and GarageType

                               Foundation(ファウンデーション:基礎土台)

These are all the kinds of features you'll commonly see in real-estate listings

(like on Zillow), アメリカの大手不動産情報サイト

It's good then that our mutual information metric scored them highly.

On the other hand,

the lowest ranked features seem to mostly represent things

that are rare or exceptional in some way,

and so wouldn't be relevant to the average home buye

不動産の「SF features」は、「Special Features」の略で、物件の付加価値を高めるような特徴や設備のことです。具体的には、キッチンやバスルームのグレードアップ、広々としたバルコニー、最新のセキュリティシステムなど、他の物件とは異なる魅力的な要素を指します。

10 of 17

In this step you'll investigate possible interaction effects for the `BldgType` feature. This feature describes the broad structure of the dwelling in five categories:

Bldg Type (Nominal): Type of dwelling>

  1Fam Single-family Detached

  2FmCon Two-family Conversion; originally built as one-family dwelling

Duplx Duplex

TwnhsE Townhouse End Unit

TwnhsI Townhouse Inside Unit

The `BldgType` feature didn't get a very high MI score.

A plot confirms that the categories in `BldgType` don't do a good job of distinguishing values in `SalePrice`

(the distributions look fairly similar, in other words):

住宅

11 of 17

Still, the type of a dwelling seems like

it should be important information.

Investigate whether BldgType produces a significant interaction

with either of the following:

GrLivArea # Above ground living area

MoSold # Month sold

地上面積から住宅価格

予測するのには

使えそう

住宅タイプに分ければ

12 of 17

販売月から住宅価格を

予測するのには

使えなそう

13 of 17

3) Discover Interactions

From the plots,

does BldgType seem to exhibit an interaction effect

with either GrLivArea or MoSold?

交互作用 2つの因子が組み合わさることで

初めて現れる相乗効果のこと

相乗効果 複数の因子が作用するとそれぞれの独自の効果の総和よりも大きい効果を表すこと

The trends lines within each category of BldgType are clearly very different, indicating an interaction between these features.

Since knowing BldgType tells us more about how GrLivArea relates to SalePrice, we should consider including BldgType in our feature set.

The trend lines for MoSold, however, are almost all the same. This feature hasn't become more informative for knowing BldgType

14 of 17

3) Discover Interactions

From the plots,

does BldgType seem to exhibit an interaction effect

with either GrLivArea or MoSold?

交互作用 2つの因子が組み合わさることで

初めて現れる相乗効果のこと

相乗効果 複数の因子が作用するとそれぞれの独自の効果の総和よりも大きい効果を表すこと

The trend lines for MoSold, however, are

almost all the same. This feature hasn't become more informative for knowing BldgType

15 of 17

Let's take a moment to make a list of features we might focus on. In the exercise in Lesson 3, you'll start to build up a more informative feature set through combinations of the original features you identified as having high potential.

You found that the ten features with the highest MI scores were:

OverallQual 0.581262

Neighborhood 0.569813

GrLivArea 0.496909

YearBuilt 0.437939

GarageArea 0.415014

TotalBsmtSF 0.390280

GarageCars 0.381467

FirstFlrSF 0.368825

BsmtQual 0.364779

KitchenQual 0.326194

対象不動産の周辺にある、生活や商業活動など、人間活動に関連してまとまりを示しているエリアを指します。

16 of 17

Do you recognize the themes here?

Location, size, and quality.

You needn't restrict development to only these top features, but you do now have a good place to start.

Combining these top features with other related features, especially those you've identified as creating interactions,

is a good strategy for coming up with a highly informative set of features to train your model on

17 of 17

Keep Going  次の節へGO

Start creating features 

and learn

what kinds of transformations

different models are most likely to benefit from.