House Price Prediction Modeling
Regression Model Demo
Ren Hwai, 2024 July
Information Links
About Myself
https://github.com/Ren1990/house_price_reg_model
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
https://www.linkedin.com/in/renhwai-kong/
https://public.tableau.com/app/profile/kyloren.kong/viz/Demo_2024InvestmentPortfolio/DBPortfolio
https://renhwaichatbot.streamlit.app/
Hi! This is me, Ren Hwai, chilling in Iceland. Happy family trip during my career break!
After working in top US semicond company for 8 years as
Senior Technology Development Process Engineer & Smart Manufacturing Analyst (Eng. IV),
I take a long break to sharpen my Python skill in data science & analysis, and study for CFA (Chartered Finance Analyst) to look for new industry exposure and work opportunity.
“You can't connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future.” - Steve Jobs
Executive Summary
Introduction
Objective
Model Training
Data Cleaning
EDA
Transformation
Model Screening
Training
Hyperparameter Tuning with Optuna
Feature Selection & Engineering
Conclusion
Iteration
Dataset Overview
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
count | 1460 |
mean | $180,921 |
std | $79,442 |
min | $34,900 |
25% | $129,975 |
50% | $163,000 |
75% | $214,000 |
max | $755,000 |
Data Exploration
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Correlation Score
Numeric Feature | Correlation to SalePrice |
OverallQual | 0.790982 |
GrLivArea | 0.708624 |
GarageCars | 0.640409 |
GarageArea | 0.623431 |
TotalBsmtSF | 0.613581 |
1stFlrSF | 0.605852 |
FullBath | 0.560664 |
TotRmsAbvGrd | 0.533723 |
YearBuilt | 0.522897 |
YearRemodAdd | 0.507101 |
Correlation Heatmap
Scatter Plot of SalePrice vs OverallQual
Data Exploration
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Examples of T-Test Result
feat | at least one pair feat_val reject h0 | feat_val1 | feat_val2 | p value | statistic | degree of freedom |
ExterQual | TRUE | TA | Ex | 3.50E-153 | -31.97 | 958 |
KitchenQual | TRUE | TA | Ex | 4.70E-148 | -32.12 | 835 |
BsmtQual | TRUE | TA | Ex | 3.60E-140 | -31.43 | 770 |
GarageFinish | TRUE | Unf | Fin | 1.50E-81 | -21.12 | 957 |
FireplaceQu | TRUE | nan | Gd | 1.40E-78 | -20.43 | 1070 |
Foundation | TRUE | PConc | CBlock | 3.70E-72 | 19.16 | 1281 |
Neighborhood | TRUE | NridgHt | NAmes | 6.50E-69 | 23.16 | 302 |
MasVnrType | TRUE | nan | Stone | 1.20E-57 | -17.09 | 1000 |
GarageType | TRUE | Attchd | Detchd | 8.90E-56 | 16.54 | 1257 |
HeatingQC | TRUE | Ex | TA | 1.10E-51 | 15.9 | 1169 |
SaleType | TRUE | WD | New | 1.40E-44 | -14.52 | 1389 |
SaleCondition | TRUE | Normal | Partial | 1.40E-42 | -14.17 | 1323 |
Distinct SalePrice Difference by ExterQual
*Initial plan is to use F-test(ANOVA) to study distinctions of SalePrice by all categorical features. During coding there is a challenge to automate f-test for all categorical features. Scipy Library f-test function requires to pass all the subgroups at once, for example f_oneway(subgrp1, subgrp2, subgrp3). The number of subgroups are not identical for the categorical features. After run into the coding bottleneck, t-test function is used, and a customized function which perform t-test on all subgroup pairs is created, for example ttest_ind(subgrp1,subgrp2), then ttest_ind(subgrp1,subgrp3), then ttest_ind(subgrp2,subgrp3) etc
Excellent Condition
Fair Condition
Typical/Average Condition
Good Condition
Data Transformation is performed before fitting data for model training
1. Convert rating categorical feature into numeric value based on data description.
2. Binary Feature -> Label Encoder
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Index | ExterQual |
0 | NaN (N/A) |
1 | Po (Poor) |
2 | Fa (Fair) |
3 | Ta (Typical/ Average) |
4 | Gd (Good) |
5 | Ex (Excellent) |
Index | ExterQual |
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
Index | CentralAir |
0 | N |
1 | Y |
2 | Y |
3 | N |
4 | N |
Index | CentralAir |
0 | 0 |
1 | 1 |
2 | 1 |
3 | 0 |
4 | 0 |
3. Multiple value categorical feature -> one-hot encoder. New columns will be created
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Index | Foundation |
0 | BrkTil |
1 | CBlock |
2 | PConc |
3 | Slab |
4 | Stone |
5 | Wood |
Index | Foundation_BrkTil | Foundation_CBlock | Foundation_PConc | Foundation_Slab | Foundation_Stone | Foundation_Wood |
0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 1 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 1 |
4. Numerical feature -> robust-scaler encoder.
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Count | 1460 |
Mean | 1515.46 |
Median | 1464 |
Std | 525.48 |
Min | 334 |
Max | 5642 |
Count | 1460 |
Mean | 0.0795 |
Median | 0.0000 |
Std | 0.8119 |
Min | -1.7458 |
Max | 6.4550 |
Model Selection
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Random Train-Test Split and K-fold Cross Validation
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
1 | 2 | 3 | 4 | 5 |
1. Split train.csv randomly into 5 sets
train.csv
K=1 | 1 | 2 | 3 | 4 | 5 |
K=2 | 1 | 2 | 3 | 4 | 5 |
K=3 | 1 | 2 | 3 | 4 | 5 |
K=4 | 1 | 2 | 3 | 4 | 5 |
K=5 | 1 | 2 | 3 | 4 | 5 |
2. Split train.csv randomly into 5 sets for cross validation
Train
Data
Validation
Data
3. Assess the 5-fold model performance result
Train-Test Data Split
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Model Selection Outcome: GBR Model is selected
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Model | R2 | RMSE |
GradientBoost Reg | 0.8518 | $29,982 |
XGBReg | 0.8503 | $30,120 |
RandomForest Reg | 0.8380 | $31,532 |
LassoLars | 0.7653 | $36,483 |
Ridge | 0.7473 | $38,096 |
Lasso | 0.7284 | $39,127 |
DecisionTree Reg | 0.7128 | $41,853 |
SupportVector Reg | -0.0569 | $80,990 |
Linear Reg | -6721579… | $8156058… |
Lars | -22322870… | $5743938… |
1. Summary of average score: GradientBoost Regression has the lowest average RMSE in 5-fold assessment.
2. Box plot of 5-fold result shows that GBR is one of the models without abnormal widespread/variation in 5-fold CV.
Select RMSE as The Key Performance Metric
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Defining Acceptance Criteria for RMSE
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
count | 1460 |
mean | $180,921 |
std | $79,442 |
min | $34,900 |
25% | $129,975 |
50% | $163,000 |
75% | $214,000 |
max | $755,000 |
New Age Related Features Are Created
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Based on the year and month feature in dataset to estimate age related features:
Example of ‘SalePrice’ vs new feature estimated house age (month)
Assess Model Performance Using Train and Test Data
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Train Data RMSE: $26,967
Test Data RMSE: $31,027
Create Base GBR Model (‘Model1’) with Optuna
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Before Optimization RMSE: $31,027
After Optimization RMSE: $26,585
GBR Hyperparameter Importance of ‘Model1’
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Model 1 Hyperparameter
Feature Importance of ‘Model1’
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
2. More than half of the Features have zero importance
1. Top 20 Important Features
Optimized GBR Model
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
final_hp={'alpha': 0.9255961785026856,
'ccp_alpha': 91.41840526743391,
'criterion': 'squared_error',
'learning_rate': 0.08954087410410139,
'loss': 'huber',
'max_depth': 3,
'max_leaf_nodes': 13,
'min_impurity_decrease': 696.0681036379524,
'min_samples_leaf': 35,
'min_samples_split': 0.6231724861087317,
'min_weight_fraction_leaf': 0.01911842431225484,
'subsample': 0.8744747724971967,
'tol': 93.81456919882424,
'validation_fraction': 0.24390880714068283}
2. Final model has achieved RMSE $25,365 with optimized hyperparameter
1. Top 20 Features in Final Model
Model RMSE Is Below Borderline of Acceptance
Data Cleaning
&
Exploration
Data Transformation
Model Selection
Base Model Training
Feature Eng
&
Model Finetune
Conclusion
Model Enhancement Plan
Enhancement 2:
XGB Reg Model
Conclusion
Enhancement 1:
Time-Based Split Train-Test
Combining 1&2:
Time-Based Split XGB Reg
Typical model enhancement is to revisit model training workflow and deep dive on:
Above options might take longer time to develop. Based on experience, below enhancements are planned:
Time-Based Data Split vs Random Data Split
Split train.csv randomly into 5 sets:
K=1 | 1 | 2 | 3 | 4 | 5 |
K=2 | 1 | 2 | 3 | 4 | 5 |
K=3 | 1 | 2 | 3 | 4 | 5 |
K=4 | 1 | 2 | 3 | 4 | 5 |
K=5 | 1 | 2 | 3 | 4 | 5 |
Split train.csv randomly into 5 sets for cross validation
Train
Data
Validation
Data
T=1 | 1 | 2 | 3 | 4 | 5 |
T=2 | 1 | 2 | 3 | 4 | 5 |
T=3 | 1 | 2 | 3 | 4 | 5 |
T=4 | 1 | 2 | 3 | 4 | 5 |
T=5 | 1 | 2 | 3 | 4 | 5 |
Train
Data
Validation
Data
Cross validate the model with each time period (T) data
1 | 2 | 3 | 4 | 5 |
Split train.csv into 5 time periods.:
Time
80%
80%
80%
80%
20%
Not Used
Time
1
4
5
2
3
Time-Based Data Split
Random Data Split
1
2
Time-Based Data Split can be thought as Backtesting Model which retrains with rolling time period data:
Time
Time-Based Data Split is similar to Backtesting Model: Deploy model at T=1, and when T=2, retrain the model with new period data, and deploy again, etc.
Time-Based Data Split might prioritize model stability (i.e. consistent performance across different time periods)
Enhancement 2:
XGB Reg Model
Conclusion
Enhancement 1:
Time-Based Split Train-Test
Combining 1&2:
Time-Based Split XGB Reg
Time
Random Split Distribution
Time-Based Split Distribution
| Random Split | Time-Based Split | ||
Model | R2 | RMSE | R2 | RMSE |
GradientBoost Reg | 0.8518 | $29,982 | 0.7776 | $33,242 |
XGBReg | 0.8503 | $30,120 | 0.7211 | $37,899 |
RandomForest Reg | 0.8380 | $31,532 | 0.7992 | $33,376 |
LassoLars | 0.7653 | $36,483 | 0.6364 | $41,263 |
Ridge | 0.7473 | $38,096 | 0.6099 | $44,482 |
Lasso | 0.7284 | $39,127 | 0.4302 | $50,574 |
DecisionTree Reg | 0.7128 | $41,853 | 0.5649 | $55,748 |
SupportVector Reg | -0.0569 | $80,990 | -0.0495 | $83,509 |
Linear Reg | -6721579… | $8156058… | -54994157… | $314697… |
Lars | -22322870… | $5743938… | -86121627… | $104067… |
Enhancement 2:
XGB Reg Model
Conclusion
Enhancement 1:
Time-Based Split Train-Test
Combining 1&2:
Time-Based Split XGB Reg
Distinct distribution difference at high sale price
Time-Based Split GBR Model Meets The RMSE Criteria
Time
Enhancement 2:
XGB Reg Model
Conclusion
Enhancement 1:
Time-Based Split Train-Test
Combining 1&2:
Time-Based Split XGB Reg
Random Split XGR Model Meets The RMSE Criteria
Time
Enhancement 2:
XGB Reg Model
Conclusion
Enhancement 1:
Time-Based Split Train-Test
Combining 1&2:
Time-Based Split XGB Reg
Time-Based Split XGR Model also Meets The RMSE Criteria
Time
Enhancement 2:
XGB Reg Model
Conclusion
Enhancement 1:
Time-Based Split Train-Test
Combining 1&2:
Time-Based Split XGB Reg
Conclusion
Time
Enhancement 2:
XGB Reg Model
Conclusion
Enhancement 1:
Time-Based Split Train-Test
Combining 1&2:
Time-Based Split XGB Reg
Model | Random Split | Time-Based Split | ||||||||
Feature # | Train | Test | Feature # | Train | Test | |||||
RMSE | Adj. R2 | RMSE | Adj. R2 | RMSE | Adj. R2 | RMSE | Adj. R2 | |||
GBR | 67 | $21,012 | 0.9255 | $25,896 | 0.8637 | 53 | $22,671 | 0.8913 | $24,142 | 0.9050 |
XGBR | 53 | $8,493 | 0.9880 | $22,456 | 0.9036 | 44 | $22,536 | 0.9179 | $22,619 | 0.8957 |