3 of 43

Analysis Background

Given dataset of Property Listings in Kuala Lumpur
This is the continuation analysis of Property Listing data exploration
Since the analysis will use Python, all rows in the dataset will be included in the analysis
The property will be clustered using k-means method to give better targeting of customers

4 of 43

Data Preparation

Import libraries and read files

5 of 43

Import Libraries

6 of 43

Reading File

Given one dataset

Property Listings in Kuala Lumpur

File is in Google Sheets and then use Python syntax to read the files

Do some pre-cleaning data in google sheets for size data in another sheet

7 of 43

Reading File

Read Original Dataset

8 of 43

Syntax

Read pre-cleaned ‘size’ data

9 of 43

Data Cleaning

Check and clean the datas using Python

10 of 43

Data Cleaning

Handling Missing Values
Removing Duplicate Values
Handling Typos (String Manipulation)
Converting Data Types
Handling Outliers

11 of 43

Handling Missing Values

There are quite many missing value in the data especially for Car Parks and Furnishing Data

See the distribution of data for Car Parks and Bathrooms
There are a wide range between the minimum and the maximum value
Decide to fill the missing value with median of the data

12 of 43

Handling Missing Values

13 of 43

Handling Missing Values

There several types of furnishing

Fully Furnished
Partly Furnished
Unfurnished
Unknown

Fill the missing value with ‘unknown’

14 of 43

Handling Missing Values

Do data check again for missing values

Since the number missing value data is not many, the data will be excluded

15 of 43

Handling Duplicate

There are some duplicated data, the data will be excluded

16 of 43

String Manipulation

Since all the property data are in Kuala Lumpur, the information about Kuala Lumpur will be deleted

Location Data

17 of 43

String Manipulation

The price data is in string type

Remove the ‘RM’ string
Remove the ‘,’ string
Convert the data to integer

The room data is in string type

The data is in ‘N+M’ format where N is the number of main room and M is number of additional room
Split the data with delimiter ‘+’

Price Data

Room Data

18 of 43

String Manipulation

The room data is in string type

Make new column to the dataframe named ‘main_room’ and ‘additional_room’
For main room data, replace several string with 1 (assume there is 1 one)
For additional room data, fill the none value with 0
Convert the data to integer

Room Data

19 of 43

String Manipulation

Remove data with 0 value of size (logically there is no property with 0 size)
Remove size data that contains ‘sq.m.’ and ‘acres’

Removing the text in parentheses and also the parentheses

Size Data

Property Type Data

20 of 43

Converting Data Types

Several data such as Price and Rooms already converted to integer
Convert the bathroom and car parks data to integer
All the data already in the right type

21 of 43

Handling Outlier

Add information about total rooms by addition
Remove irrelevant values such as Room, Size, Main Room, and Additional Room
Show some descriptive statistics with .describe() for spotting outlier
Create box plot diagram for visualization

22 of 43

Handling Outliers

There are potential outliers in payment data
There is a wide range of difference from the max value and the min value
With standard deviation 13175870 shows that the data is far away from the mean
Data is highly positive skew, shown by positive skew value

Price
Mean	1882735
Median	990000
Mode	1200000
Variance	1.736036e+14
Stdev	13175870
Skewness	125.33
Range	1979999692
Minimum	308
Maximum	1980000000
Count	46180

23 of 43

Handling Outliers

There are potential outliers in payment data
There is a wide range of difference from the max value and the min value
With standard deviation 1.65 shows that the data is not located near the mean but not to far
Data is positive skew, shown by positive skew value

Bathroom
Mean	3.107
Median	3
Mode	2
Variance	2.733
Stdev	1.65
Skewness	1.418
Range	19
Minimum	1
Maximum	20
Count	46180

24 of 43

Handling Outliers

There are potential outliers in payment data
There is a wide range of difference from the max value and the min value
With standard deviation 1.14 shows that the data is not located near the mean but not to far
Data is positive skew, shown by positive skew value

Car Parks
Mean	2.021
Median	2
Mode	2
Variance	1.24
Stdev	1.14
Skewness	5.53
Range	29
Minimum	1
Maximum	30
Count	46180

25 of 43

Handling Outliers

There are potential outliers in payment data
There is a wide range of difference from the max value and the min value
With standard deviation 52114 shows that the data is far away from the mean
Data is highly positive skew, shown by positive skew value

Size
Mean	46180
Median	1432
Mode	1650
Variance	2.715885e+09
Stdev	52114
Skewness	203.3877
Range	10999999
Minimum	1
Maximum	11000000
Count	46180

26 of 43

Handling Outliers

There are potential outliers in payment data
There is a wide range of difference from the max value and the min value
With standard deviation 1.51 shows that the data is not located near the mean but not to far
Data is highly positive skew, shown by positive skew value

Total Room
Mean	3.71
Median	4
Mode	3
Variance	2.265
Stdev	1.51
Skewness	0.651
Range	17
Minimum	1
Maximum	18
Count	46180

27 of 43

Handling Outlier

Price, size, and car parks are highly skew

28 of 43

Handling Outlier

Remove the outlier using IQR method for Price and Size, while Car Parks data will be remove for value above 10

29 of 43

Handling Outliers

Data after removing outliers

30 of 43

Clustering Analysis

Cluster the properties

31 of 43

Clustering Analysis

Decide the Feature
Convert categorical data to numerical data
Create a dummy table for clustering
Standardize the data
Determine number of cluster
Clustering Analysis
Conclusion

32 of 43

Decide the Features

Choose ‘Price', 'Bathrooms', 'Car Parks', 'Size', 'Total Room', 'Furnishing Type' as the features for clustering

33 of 43

Convert categorical data to numerical data

Convert 'Furnishing Type' to numerical data with

0 = Fully Furnished
1 = Partly Furnished
2 = Unfurnished
3 = Unknown

34 of 43

Create a dummy table for clustering

Create a dummy table for features only

35 of 43

Standardize the data

The range between price and other data is far, hence the data need to be standardize

36 of 43

Standardize the data

Create a copy of dummy data and use ‘preprocessing’ from sklearn to standardize the data

37 of 43

Determine Number of Cluster

Use Elbow method as one of options to determine number of cluster

Potential number of cluster

38 of 43

Determine Number of Cluster

Use visualisation for trial and error (try number of cluster 2-5)

39 of 43

Determine Number of Cluster

2 and 3 cluster are the potential cluster

Cluster separated nicely

Some cluster overlap with another cluster

40 of 43

Determine Number of Cluster

Check with silhouette score

Choose 3 cluster

41 of 43

Clustering Analysis

0: Expensive, Partly Furnished, big space, more than 4 rooms, 4 bathrooms, 2 carpark (Spacious Big Residence)
1: Medium Price, Furnished, small space, 2 rooms, 1-2 bathrooms, 1 carpark (Fully furnished Personal Residence)
2: Low price, Unfurnished, medium space, 2-3 rooms, 2 bathrooms, 1 carpark (Affordable Regular Residence)

42 of 43

Conclusion

Cluster 0 (Spacious Big Residence) is suitable for customer that need a big space and many rooms. Possibly a higher SES is the target consumer who like to have a gathering with their colleagues or families
Cluster 1 (Fully furnished Personal Residence) is suitable for individual, an executive or office worker is a possible target consumer who need a compact size living space .
Cluster 2 (Affordable Regular Residence) is suitable for a customer who have 3-4 people and have interest in decorating their own living space

1 of 43

2 of 43

3 of 43

4 of 43

5 of 43

6 of 43

7 of 43

8 of 43

9 of 43

10 of 43

11 of 43

12 of 43

13 of 43

14 of 43

15 of 43

16 of 43

17 of 43

18 of 43

19 of 43

20 of 43

21 of 43

22 of 43

23 of 43

24 of 43

25 of 43

26 of 43

27 of 43

28 of 43

29 of 43

30 of 43

31 of 43

32 of 43

33 of 43

34 of 43

35 of 43

36 of 43

37 of 43

38 of 43

39 of 43

40 of 43

41 of 43

42 of 43

43 of 43