Data Cleaning in SQL

Obtaining Nashville Housing Data from:

https://www.kaggle.com/datasets/tmthyjames/nashville-housing-data

Loading the data into SQL Server and observing what we are starting with:

From an initial observation, we can see:

Next step, we want to standardise the date format, so that we only have yyyy-MM-dd and not include the timestamp.

Standardising Dates

Dealing with NULLs

Within this data there are NULLs contained in the PropertyAddress. We know that the property should never change - with a very high probability - unless there is a unique situation of land subdivision. Additionally, within the data, there exists duplicate entries, but they all obtain unique IDs.

Example 1

Example 2

From here, we can begin to eliminate some NULL values by populating the PropertyAddress with any non-unique ParcelIDs. Pictured below, we perform a self join of the data on ParcelID and make sure that the UniqueIDs do not match. The output we’re interested in are the matching ParcelIDs and PropertyAddresses where there are repeats and one of them is NULL.

To update the data by copying the the address from one record to the other:

Running the former query yields no results in the output, meaning all NULLs in the query have been populated

Separating Address Information

Monitoring this output, the only delimiter involved is the comma between the street address and city.

Due to this, we know that we can perform the clean-up search function on this delimiter. This can be achieved using the functions SUBSTRING and CHARINDEX. Splitting the City into another column requires the same functions, however, instead of looking to obtain all information before the delimiter the end of the field contains the pertinent information.

Now that these two columns have been successfully created, we can update the dataset to include these columns (or attributes/fields). Having altered the table and created the new columns, we can output all data and observe that these two newly entered functions have been added to the end of the dataset.

Separating Owner Address Information

Turning our attention to the OwnerAddress field, we can see that it carries the same (if not similar) issues to the PropertyAddress column.

Instead of using the same SUBSTRING function, we can implement a PARSENAME approach. Given the nature of PARSENAME’s algorithm, we create three different states and integrate them in descending order.

We can now update the database with these columns.

Sold as Vacant Option Reduction

The SoldAsVacant column currently shows redundancies, as two affirmative and negative options. We can reduce these options to just ‘Yes’ and ‘No’. This option is chosen as a count yields: Y = 52, N = 399, Yes = 4623, and No = 51403. A brief time efficiency analysis yields it is quicker to change the single character options to three. Afterwards, this can be updated in the dataset.

Removing Duplicates

Partitioning the data will allow us to identify the duplicate rows. This can be achieved by the ROW_NUMBER function. The partition should be based on items that are unique to each row- similar to a composite key. The information that is should not contain duplicates are:

This output yields a ‘2’ in the row_num column. It shows all information, apart from UniqueID, are the same. Similar to a subquery, a common table expression (CTE) can be constructed to create the pertinent information to be deleted from the dataset. In the following query, the second entry (row_num = 2) will be deleted.

Deleting Unused Columns

It is important to note that this step was not, nor would it be, performed on a live database or with raw data. After observing what data is necessary for this analysis, the following columns have been chosen to be removed:

After this update, the dataset columns were reorganised making a particular view more accessible for the lookup values.

Should a real estate agent now use this information for KPI tracking, based on the amount of lost profit, location, and the type of land for 2016, they can do so with the following query.