1 of 43

Kuala Lumpur Property Listing Clustering Analysis

Carlos Alberto Lembono

2 of 43

Table of contents

  • Analysis Background
  • Data Preparation
  • Data Cleaning
  • Clustering Analysis

Google Colab Link

2

3 of 43

Analysis Background

3

  • Given dataset of Property Listings in Kuala Lumpur
  • This is the continuation analysis of Property Listing data exploration
  • Since the analysis will use Python, all rows in the dataset will be included in the analysis
  • The property will be clustered using k-means method to give better targeting of customers

4 of 43

Data Preparation

Import libraries and read files

4

5 of 43

Import Libraries

5

6 of 43

Reading File

6

Given one dataset

  • Property Listings in Kuala Lumpur

File is in Google Sheets and then use Python syntax to read the files

Do some pre-cleaning data in google sheets for size data in another sheet

7 of 43

Reading File

7

Read Original Dataset

8 of 43

Syntax

8

Read pre-cleaned ‘size’ data

9 of 43

Data Cleaning

Check and clean the datas using Python

9

10 of 43

Data Cleaning

  • Handling Missing Values
  • Removing Duplicate Values
  • Handling Typos (String Manipulation)
  • Converting Data Types
  • Handling Outliers

10

11 of 43

Handling Missing Values

There are quite many missing value in the data especially for Car Parks and Furnishing Data

  • See the distribution of data for Car Parks and Bathrooms
  • There are a wide range between the minimum and the maximum value
  • Decide to fill the missing value with median of the data

11

12 of 43

Handling Missing Values

12

13 of 43

Handling Missing Values

There several types of furnishing

  • Fully Furnished
  • Partly Furnished
  • Unfurnished
  • Unknown

Fill the missing value with ‘unknown

13

14 of 43

Handling Missing Values

14

Do data check again for missing values

Since the number missing value data is not many, the data will be excluded

15 of 43

Handling Duplicate

15

There are some duplicated data, the data will be excluded

16 of 43

String Manipulation

16

Since all the property data are in Kuala Lumpur, the information about Kuala Lumpur will be deleted

Location Data

17 of 43

String Manipulation

The price data is in string type

  • Remove the ‘RM’ string
  • Remove the ‘,’ string
  • Convert the data to integer

The room data is in string type

  • The data is in ‘N+M’ format where N is the number of main room and M is number of additional room
  • Split the data with delimiter ‘+

17

Price Data

Room Data

18 of 43

String Manipulation

The room data is in string type

  • Make new column to the dataframe named ‘main_room’ and ‘additional_room
  • For main room data, replace several string with 1 (assume there is 1 one)
  • For additional room data, fill the none value with 0
  • Convert the data to integer

18

Room Data

19 of 43

String Manipulation

  • Remove data with 0 value of size (logically there is no property with 0 size)
  • Remove size data that contains ‘sq.m.’ and ‘acres

Removing the text in parentheses and also the parentheses

19

Size Data

Property Type Data

20 of 43

Converting Data Types

  • Several data such as Price and Rooms already converted to integer
  • Convert the bathroom and car parks data to integer
  • All the data already in the right type

20

21 of 43

Handling Outlier

  • Add information about total rooms by addition
  • Remove irrelevant values such as Room, Size, Main Room, and Additional Room
  • Show some descriptive statistics with .describe() for spotting outlier
  • Create box plot diagram for visualization

21

22 of 43

Handling Outliers

  • There are potential outliers in payment data
  • There is a wide range of difference from the max value and the min value
  • With standard deviation 13175870 shows that the data is far away from the mean
  • Data is highly positive skew, shown by positive skew value

22

Price

Mean

1882735

Median

990000

Mode

1200000

Variance

1.736036e+14

Stdev

13175870

Skewness

125.33

Range

1979999692

Minimum

308

Maximum

1980000000

Count

46180

23 of 43

Handling Outliers

  • There are potential outliers in payment data
  • There is a wide range of difference from the max value and the min value
  • With standard deviation 1.65 shows that the data is not located near the mean but not to far
  • Data is positive skew, shown by positive skew value

23

Bathroom

Mean

3.107

Median

3

Mode

2

Variance

2.733

Stdev

1.65

Skewness

1.418

Range

19

Minimum

1

Maximum

20

Count

46180

24 of 43

Handling Outliers

  • There are potential outliers in payment data
  • There is a wide range of difference from the max value and the min value
  • With standard deviation 1.14 shows that the data is not located near the mean but not to far
  • Data is positive skew, shown by positive skew value

24

Car Parks

Mean

2.021

Median

2

Mode

2

Variance

1.24

Stdev

1.14

Skewness

5.53

Range

29

Minimum

1

Maximum

30

Count

46180

25 of 43

Handling Outliers

  • There are potential outliers in payment data
  • There is a wide range of difference from the max value and the min value
  • With standard deviation 52114 shows that the data is far away from the mean
  • Data is highly positive skew, shown by positive skew value

25

Size

Mean

46180

Median

1432

Mode

1650

Variance

2.715885e+09

Stdev

52114

Skewness

203.3877

Range

10999999

Minimum

1

Maximum

11000000

Count

46180

26 of 43

Handling Outliers

  • There are potential outliers in payment data
  • There is a wide range of difference from the max value and the min value
  • With standard deviation 1.51 shows that the data is not located near the mean but not to far
  • Data is highly positive skew, shown by positive skew value

26

Total Room

Mean

3.71

Median

4

Mode

3

Variance

2.265

Stdev

1.51

Skewness

0.651

Range

17

Minimum

1

Maximum

18

Count

46180

27 of 43

Handling Outlier

  • Price, size, and car parks are highly skew

27

28 of 43

Handling Outlier

  • Remove the outlier using IQR method for Price and Size, while Car Parks data will be remove for value above 10

28

29 of 43

Handling Outliers

Data after removing outliers

29

30 of 43

Clustering Analysis

Cluster the properties

30

31 of 43

Clustering Analysis

  • Decide the Feature
  • Convert categorical data to numerical data
  • Create a dummy table for clustering
  • Standardize the data
  • Determine number of cluster
  • Clustering Analysis
  • Conclusion

31

32 of 43

Decide the Features

Choose Price', 'Bathrooms', 'Car Parks', 'Size', 'Total Room', 'Furnishing Type' as the features for clustering

32

33 of 43

Convert categorical data to numerical data

Convert 'Furnishing Type' to numerical data with

  • 0 = Fully Furnished
  • 1 = Partly Furnished
  • 2 = Unfurnished
  • 3 = Unknown

33

34 of 43

Create a dummy table for clustering

Create a dummy table for features only

34

35 of 43

Standardize the data

The range between price and other data is far, hence the data need to be standardize

35

36 of 43

Standardize the data

Create a copy of dummy data and use ‘preprocessing’ from sklearn to standardize the data

36

37 of 43

Determine Number of Cluster

Use Elbow method as one of options to determine number of cluster

37

Potential number of cluster

38 of 43

Determine Number of Cluster

Use visualisation for trial and error (try number of cluster 2-5)

38

39 of 43

Determine Number of Cluster

2 and 3 cluster are the potential cluster

Cluster separated nicely

Some cluster overlap with another cluster

39

2

3

5

4

40 of 43

Determine Number of Cluster

Check with silhouette score

Choose 3 cluster

40

41 of 43

Clustering Analysis

  • 0: Expensive, Partly Furnished, big space, more than 4 rooms, 4 bathrooms, 2 carpark (Spacious Big Residence)
  • 1: Medium Price, Furnished, small space, 2 rooms, 1-2 bathrooms, 1 carpark (Fully furnished Personal Residence)
  • 2: Low price, Unfurnished, medium space, 2-3 rooms, 2 bathrooms, 1 carpark (Affordable Regular Residence)

41

42 of 43

Conclusion

  • Cluster 0 (Spacious Big Residence) is suitable for customer that need a big space and many rooms. Possibly a higher SES is the target consumer who like to have a gathering with their colleagues or families
  • Cluster 1 (Fully furnished Personal Residence) is suitable for individual, an executive or office worker is a possible target consumer who need a compact size living space .
  • Cluster 2 (Affordable Regular Residence) is suitable for a customer who have 3-4 people and have interest in decorating their own living space

42

43 of 43

Thank you!

43