Kuala Lumpur Property Listing Clustering Analysis
Carlos Alberto Lembono
Table of contents
2
Analysis Background
3
Data Preparation
Import libraries and read files
4
Import Libraries
5
Reading File
6
Given one dataset
File is in Google Sheets and then use Python syntax to read the files
Do some pre-cleaning data in google sheets for size data in another sheet
Reading File
7
Read Original Dataset
Syntax
8
Read pre-cleaned ‘size’ data
Data Cleaning
Check and clean the datas using Python
9
Data Cleaning
10
Handling Missing Values
There are quite many missing value in the data especially for Car Parks and Furnishing Data
11
Handling Missing Values
12
Handling Missing Values
There several types of furnishing
Fill the missing value with ‘unknown’
13
Handling Missing Values
14
Do data check again for missing values
Since the number missing value data is not many, the data will be excluded
Handling Duplicate
15
There are some duplicated data, the data will be excluded
String Manipulation
16
Since all the property data are in Kuala Lumpur, the information about Kuala Lumpur will be deleted
Location Data
String Manipulation
The price data is in string type
The room data is in string type
17
Price Data
Room Data
String Manipulation
The room data is in string type
18
Room Data
String Manipulation
Removing the text in parentheses and also the parentheses
19
Size Data
Property Type Data
Converting Data Types
20
Handling Outlier
21
Handling Outliers
22
Price | |
Mean | 1882735 |
Median | 990000 |
Mode | 1200000 |
Variance | 1.736036e+14 |
Stdev | 13175870 |
Skewness | 125.33 |
Range | 1979999692 |
Minimum | 308 |
Maximum | 1980000000 |
Count | 46180 |
Handling Outliers
23
Bathroom | |
Mean | 3.107 |
Median | 3 |
Mode | 2 |
Variance | 2.733 |
Stdev | 1.65 |
Skewness | 1.418 |
Range | 19 |
Minimum | 1 |
Maximum | 20 |
Count | 46180 |
Handling Outliers
24
Car Parks | |
Mean | 2.021 |
Median | 2 |
Mode | 2 |
Variance | 1.24 |
Stdev | 1.14 |
Skewness | 5.53 |
Range | 29 |
Minimum | 1 |
Maximum | 30 |
Count | 46180 |
Handling Outliers
25
Size | |
Mean | 46180 |
Median | 1432 |
Mode | 1650 |
Variance | 2.715885e+09 |
Stdev | 52114 |
Skewness | 203.3877 |
Range | 10999999 |
Minimum | 1 |
Maximum | 11000000 |
Count | 46180 |
Handling Outliers
26
Total Room | |
Mean | 3.71 |
Median | 4 |
Mode | 3 |
Variance | 2.265 |
Stdev | 1.51 |
Skewness | 0.651 |
Range | 17 |
Minimum | 1 |
Maximum | 18 |
Count | 46180 |
Handling Outlier
27
Handling Outlier
28
Handling Outliers
Data after removing outliers
29
Clustering Analysis
Cluster the properties
30
Clustering Analysis
31
Decide the Features
Choose ‘Price', 'Bathrooms', 'Car Parks', 'Size', 'Total Room', 'Furnishing Type' as the features for clustering
32
Convert categorical data to numerical data
Convert 'Furnishing Type' to numerical data with
33
Create a dummy table for clustering
Create a dummy table for features only
34
Standardize the data
The range between price and other data is far, hence the data need to be standardize
35
Standardize the data
Create a copy of dummy data and use ‘preprocessing’ from sklearn to standardize the data
36
Determine Number of Cluster
Use Elbow method as one of options to determine number of cluster
37
Potential number of cluster
Determine Number of Cluster
Use visualisation for trial and error (try number of cluster 2-5)
38
Determine Number of Cluster
2 and 3 cluster are the potential cluster
Cluster separated nicely
Some cluster overlap with another cluster
39
2
3
5
4
Determine Number of Cluster
Check with silhouette score
Choose 3 cluster
40
Clustering Analysis
41
Conclusion
42
Thank you!
43