END-TO-END MACHINE LEARNING PROJECT
Introduction
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.
Working with Real Data
• Popular open data repositories:
—UC Irvine Machine Learning Repository
—Kaggle datasets
—Amazon’s AWS datasets
• Meta portals (they list open data repositories):
—http://dataportals.org/
—http://opendatamonitor.eu/
—http://quandl.com/
Cont…
—Wikipedia’s list of Machine Learning datasets
—Quora.com question
—Datasets subreddit
Cont…
Look at the Big Picture
Cont…
Frame the Problem
Cont…
Pipelines
Cont…
Select a Performance Measure
Cont…
Cont…
Cont…
Check the Assumptions
Get the Data
https://github.com/ageron/handson-ml2.
Download the Data
Cont…
Cont…
Cont…
Take a Quick Look at the Data Structure
Cont…
Cont…
Cont…
Cont…
Cont…
Cont…
Cont….
%matplotlib inline # only in a Jupyter notebook
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
Create a Test Set
Cont…
Cont…
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
housing["income_cat"].hist()
Cont….
Cont…
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
>>> strat_test_set["income_cat"].value_counts() / len(strat_test_set)
3 0.350533
2 0.318798
4 0.176357
5 0.114583
1 0.039729
Cont…
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
Discover and Visualize the Data to Gain Insights
Visualizing Geographical Data
housing.plot(kind="scatter", x="longitude", y="latitude")
Cont…
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
Cont…
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, )
plt.legend()
Cont…
Looking for Correlations
Cont…
Cont…
Cont…
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
Cont…
housing.plot(kind="scatter",x="median_income", y="median_house_value", alpha=0.1)
Cont…
Experimenting with Attribute Combinations
housing["rooms_per_household"]=housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"]=housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
median_house_value 1.000000
median_income 0.687160
rooms_per_household 0.146285
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population_per_household -0.021985
population -0.026920
longitude -0.047432
latitude -0.142724
bedrooms_per_room -0.259984
Name: median_house_value, dtype: float64
Prepare the Data for Machine Learning Algorithms