5
Cleaning and Manipulating Data with pandas
Working with Pandas
Data Cleaning
this jupyter notebook will show how to:
Data Cleaning 101
most simple guidelines on how to clean data
The approach is dependent on the data you have, and your end goal.
Despite this difficulty, this is the most simple guidelines on how to clean data.
Does What You’re Looking At Make Sense?
Does What You’re Looking At Make Sense?
Ask these questions:
Correct Values If You Have Enough Information
Use The Tool That Makes Sense
Thinking about the tool you need to achieve your goal can save you time from doing the process over again. There are pros and cons to each tool, and below are my rules of thumb:
When you might use Excel:
When you might use Python or another scripting language:
This also depends on what tool you’re more comfortable with. When I was starting to learn to programme, I leaned on Excel pretty heavily for when I needed to clean data. As I started working with larger datasets and had to routinize my processes — I shifted towards Python.
Communicate With The Source
Don’t be afraid of picking up a phone or shooting an email to the source of the data.
This is especially true if you are the client of the data source because you are entitled to clear information.
Having that communication can save a lot of heartburn in the long run.
See this page for more info: https://towardsdatascience.com/data-cleaning-101-948d22a92e4
Practice it with the Task#2
Now let’s do the Task#2 Lemonade Sale with pandas
https://github.com/asukul/mbds21/blob/main/MBDS21_Task2_LemonadeSale_with_Pandas.ipynb
Remove duplicate
df.drop_duplicates()
Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters:
subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.
inplace: Boolean values, removes rows with duplicates if True.
Return type: DataFrame with removed duplicate rows depending on Arguments passed.
Reset index
How to reset the index after drop_duplicates()
use :
df = df.reset_index(drop=True)
Fix the missing value, the manual way
Manually fix the missing value
Using .loc
Find missing value
#or you can identify the NaN by this way
df['Leaflets'].notnull()
Find and fill missing value at once
Using .fillna()
df['Leaflets'].fillna(9999.99)
Calculate the value to fill, and put it together
Add columns
Don’t use .ix
How to selecting specific columns
.loc and .iloc are the best way to slicing your dataframe
How to select rows (like a SQL query)
#if I want only Park location
df.loc[df['Location']=='Park',:]