1 of 33

After Repair Property Value Prediction Tool

(Patent Pending)

ARV Holdings LLC

1

2 of 33

Background and Theory

2

3 of 33

Real Estate Industry Background

  • Technology is rapidly revolutionizing every aspect of the real estate industry:
    • Property databases access
    • Remotely signed contracts
    • Market analyses online
    • Drones aerial photos

3

  • One thing still eludes automation:

Accurate pricing estimations

4 of 33

An Industry-Wide Problem:Zestimate® Model Accuracy is Insufficient for Investors

Zillow® Errors Plot

4

  • 18% of Zillow® Zestimates® have a greater than 20% error
    • Profit margins of property renovators typically only 10%-12%.

  • Why so hard?
    1. Property Location: (solvable)
    2. Property Features: (solvable)
    3. Property Condition: (insufficient data)

Median Error:

7.3% (± $20,200)

5 of 33

Solution:�Automation of After Repair Valuation (ARV)

What is ARV?

“After-Repaired Value” is a pricing metric of a property’s fully renovated value

Why ARV?

  • All professionally renovated properties have an identical condition (brand new).
    1. Property Location: (solvable)
    2. Property Features: (solvable)
    3. Property Condition: (insufficient data)

5

Hypothesis:

An ARV model will have reduced errors compared to a similar price prediction model that is using all other sold properties

6 of 33

Implementation

6

7 of 33

Data: Red Cedar Real Estate®

  • What: A small local Real Estate brokerage firm in Maryland
  • Data Access: 6 years (2013 – 2019) of Maryland sold property data:
    • 800 fields
    • 330,000 rows

7

8 of 33

8

ARV Estimator Process

SQL call to remotely import Data

Process Property Data

Derive Renovation Term weights using TfidfVectorizer

Predict renovation status with classification model

Geo-coding Lat/Lon to Census Tract Script

Variables normalized to by Tract

Predict ARV with regression model

Visualize ARVs on a Web Tool

Identifying Renos Vs. non-Renos

13% determined

9 of 33

Deriving Ground Truth for Renovation Status

Identifying Renovations

Identifying Non Renovations

  • While there is no “Renovated” field, there is a condition field:
    • 626 unique tags
    • Usually left blank
  • Only the “Renov/Remod” tag consistently refers to properties that have been newly renovated.
  • The 2% of properties with this tag became our ground truth for Renovation = 1.

  • Similarly, there were a set of less flattering tags that typically denoted a property in poor condition:
    • “Major Rehab Needed”
    • “Needs Work”
    • “As-Is Condition”
  • The 11% of properties with these tags became our ground truth for Renovation = 0.

9

10 of 33

Renovation Classification Model Results

10

11 of 33

A Peak Under the Hood:�Renovation Classification Models Tested

  • Linear Support Vector Classification (SVC) stood out as the best performing classifier.

  • F1Score: A single metric that balances the precision and recall metrics in one number (higher is better)
    • F1Score is the more important metric because of the unbalanced classes (85/15)
  • Accuracy: A metric that measures how close the model results are to the true values (higher is better)

11

Classification Model

F1score

Accuracy

Linear SVC

83.8%

95.0%

Logistic Regression Classifier

83.6%

94.8%

Extra Trees Classifier

81.8%

94.2%

SGD Classifier

81.8%

93.8%

Random Forest Classifier

81.7%

94.4%

12 of 33

A Peak Under the Hood:�Linear SVC Renovation Classification Performance

12

Accuracy: 0.950 ±0.008

F1score: 0.838 ±0.021

Methodology:

Results were summarized by averaging the scores obtained from a k-fold cross-validation run.

13 of 33

A Peak Under the Hood:�Linear SVC Renovation Feature Importances

13

Renovated.

Vibrant words like "granite", “stunning”, “gorgeous”, or ”stainless” are strong indicators of a renovated property.

Non-Renovated.

Characteristics of the sale itself like “estate sale”, “investor”, “opportunity”, or “sold” indicate a non-renovated property.

Property Descriptions fed into the TfidfVectorizer() function to derive weights indicating renovation status.

14 of 33

ARV Regression Model Results

14

15 of 33

A Peak Under the Hood: �Raw ARV Regression Models Compared

  • Extra Trees stood out as the best performer for the ARV Regression model.

(Results trained/tested on an 80/20 split)

  • Percentile errors metric:
    • Typically the only metric that other sources of home price prediction models are reported
  • R Squared (R2) metric:
    • The “correct” metric for regression model evaluation

15

Regression Model

R2 Score

50th Percentile of Absolute Errors

75th Percentile of Absolute Errors

Extra Trees Regression

0.851

6.62%

12.85%

Random Forest Regression

0.840

6.87%

13.24%

Gradient Boosting Regression

0.817

7.84%

14.28%

KNN Regression

0.737

8.85%

16.73%

Linear Regression

0.780

10.80%

19.97%

16 of 33

A Peak Under the Hood:�Extra Trees Regression Model Performance

16

Percentile Errors:

25%: 0.0297

50%: 0.0662

75%: 0.129

R2Score: 0.983

Methodology:

Results were obtained using a train-test method on an 80/20 split.

Percent Error

Number of Properties per Bin

Bins of Properties by Percent Error

The predicted value of most renovated properties are within 5% - 10% of the true value, with some properties having greater errors on the tail end.

17 of 33

Comparison of ARV Model with Generic Value Models

17

18 of 33

Full Model Errors vs ARV Model Errors

  • A large drop in the price prediction errors from 10.38% to 6.62% resulted when the regression model was performed on only the renovated data as opposed to the full data set
      • Why: The ARV model predicts a property’s post-renovation value, which is less error prone than predicting current value.

18

Data used for Extra Trees Regression Model

R2 Score

25th Percentile of Absolute Errors

50th Percentile of Absolute Errors

75th Percentile of Absolute Errors

Mean Dollar Error

Median Dollar Error

Full Data

0.854

4.55%

10.38%

21.66%

$45,100

$26,138

Renovated Data

0.851

2.93%

6.62%

12.85%

$42,214

$20,425

ARV Median Error Rate: 6.62%

Full Model Median Error Rate: 10.38%

19 of 33

Full Model Errors vs ARV Model Errors vs Researched Model Errors

19

Median Error Rates of researched price prediction models: 7.3% -12.27%

ARV Median Error Rate: 6.62%

  • The accuracy of the ARV model exceeded accuracy scores of all the price prediction models found in the researched literature, including Zillow’s Zestimate model.

20 of 33

The End Game

20

21 of 33

Potential ARV Visualizations: �Aggregated into a Tableau® Map

  • To the left is a snippet of the ARV values of census tracts in Baltimore.
    • Blue tracks have higher ARV $ amounts than red.

  • Notice the clear difference in values between Federal Hill (the blue peninsula) and West Baltimore.

21

Note: Data displayed in mouseover is currently pre-COVID

22 of 33

Potential Visualizations:�Posted Individually on Website

22

23 of 33

Big Lesson Learned

23

24 of 33

Big Lesson Learned:�Complexities in Value From Variable Interactions

  • It is very easy to determine the linear impact of a variable on property value for most regression models. What is more difficult, is determining the impact of the interactions between variables on property value.
  • It was found that big gains in price prediction accuracy could be found by experimenting with adding new derived variables that account for these interactions
  • See the next two slides that found big gains in prediction accuracy by simply deriving new variables that identify these interactions

24

25 of 33

Big Lesson Learned:�Complexities in Value From Variable Interactions

diffFrom_Med_ARV_SqftPerc

  • This variable calculates the percentage difference in square footage between the subject property and the average of its immediate neighbors.

  • Observation: Prices are disproportionately higher for a house that is larger than others immediately nearby than would be predicted using just square footage alone.

  • Insight: People are willing to pay disproportionately more for a property just because is bigger than their immediate neighbors.

25

26 of 33

Big Lesson Learned:�Complexities in Value From Variable Interactions

SqftPerBaths

  • This variable calculates the amount of square feet per bath

  • Observation: Prices are disproportionately higher for properties with smaller square feet per bath than would be predicted by just including the number of baths and square feet variables alone.

  • Lesson Learned: People are willing to pay disproportionately more to minimize the number of people that have to share a bathroom.

26

27 of 33

Future of ARV Prediction Tool

  • Anticipated Impact: Easier for real estate investors to find their next distressed house to renovate to its full potential.

  • Ideas for future optimizations:
    • Zero-shot classifier as substitute for matching description text to renovated properties
    • Use deep learning techniques from PyTorch to improve renovation classification model
    • Post code and create Kaggle competition to crowdsource the optimize for the ARV Regression model.

27

28 of 33

Questions?

  • I can be reached via email or LinkedIn for further discussion.
    • https://www.linkedin.com/in/joseph-girsch-6b664876/
    • realcashflowjoe@gmail.com

28

29 of 33

Appendix

29

30 of 33

Tools and Methods

  • SQL Code: Remote server interaction
  • Base Python: Data Cleaning
  • Natural Language Toolkit (NLTK): agent remarks processing
  • Subject Matter Expertise: Determining how much to adjust for different property features
  • Machine Learning models (scikit-learn): Predicting renovation status
  • Census Geocoder Package: Converts latitude and longitude coordinates to Census Tracts
  • Tableau: A visualization vector for the ARVs

30

31 of 33

Median Absolute Error Rates of All Models

31

Model

Median Absolute Error

Kummerow’s OLS model

12.27%

Clapp’s Local Regression

11.31%

This Project’s Model (Full Data)

10.38%

Freddie Mac Model-Use Requirement

10.00%

Dubin’s Kriging Model

8.34%

Case’s Homogeneous Districts

8.07%

Kintzel’s Random Forest

7.60%

Zillow

7.30%

This Project’s Model (Renovation Data)

6.62%

32 of 33

Reflections and Future Work

  • Most rewarding insight: Realization that collaboration with investors yielded unconventional performance improvements

  • Future Work: Build a similar model to identify properties that need “full-gut” repairs.

32

33 of 33

Works Cited

  • Russell, S. (2020, April 11). Are Zillow Zestimates Accurate? Retrieved April 26, 2021, from Freestone Properties: https://www.freestoneproperties.com/blog/truth-zillow-zestimates/
  • Kintzel, J. (2019). Price Prediction and Computer Vision in the Real Estate Marketplace. Retrieved March 25, 2021, from Harvard Library: https://dash.harvard.edu/bitstream/handle/1/37365260/KINTZEL-DOCUMENT-2019.pdf?sequence=1&isAllowed=y
  • Modern Machine Learning Algorithms: Strengths and Weaknesses. (2019). Retrieved November 28, 2020, from Elite Data Science: https://elitedatascience.com/machine-learning-algorithms#classification
  • Ortner, A. (2020, May 28). Top 10 Binary Classification Algorithms [a Beginner’s Guide]. Retrieved November 28, 2020, from Medium: https://medium.com/@alex.ortner.1982/top-10-binary-classification-algorithms-a-beginners-guide-feeacbd7a3e2
  • Wake, J. (2016, July 8). Zillow’s Typical Error Is $18,000. Retrieved 11 27, 2020, from Real Estate Decoded: https://realestatedecoded.com/zillows-typical-error/
  • Zillow: Machine learning and data disrupt real estate, Krigsman, https://www.zdnet.com/article/zillow-machine-learning-and-data-in-real-estate/
  • What Is The MLS? Multiple Listing Service 101: Real Estate Skills. (n.d.). Retrieved November 08, 2020, from https://www.realestateskills.com/blog/mls-multiple-listing-service

33