1 of 20

5

Cleaning and Manipulating Data with pandas

2 of 20

Working with Pandas

3 of 20

Data Cleaning

Working-with-Pandas.ipynb

this jupyter notebook will show how to:

  • Read data to DataFrame with custom column name,
    • pd.read_table('data_file', names=user_cols)
  • Data Cleaning

4 of 20

Data Cleaning 101

5 of 20

most simple guidelines on how to clean data

The approach is dependent on the data you have, and your end goal.

Despite this difficulty, this is the most simple guidelines on how to clean data.

Does What You’re Looking At Make Sense?

6 of 20

Does What You’re Looking At Make Sense?

Ask these questions:

  • Does what I’m looking at making sense?
  • Does the data match the column label?
  • Does the data abide by the appropriate rules for its field?
  • Compute summary statistics for the numerical data. Do they make sense?
  • Look at how many values are nulls? Is the number of nulls acceptable? Is there a pattern as to where there are null values?
  • Are there duplicates and is that okay?

7 of 20

Correct Values If You Have Enough Information

8 of 20

Use The Tool That Makes Sense

Thinking about the tool you need to achieve your goal can save you time from doing the process over again. There are pros and cons to each tool, and below are my rules of thumb:

When you might use Excel:

  • You have fewer than 1 million records
  • You need to do the job quick and easy
  • There is a logical pattern for cleaning the data and it’s easy enough to clean using Excel functions
  • The logical pattern for cleaning the data is hard to define, and you need to clean the data manually

When you might use Python or another scripting language:

  • You need to document your process
  • You plan on doing the job on a repeat basis
  • There is a logical pattern for cleaning the data, but it is hard to implement with Excel functions

This also depends on what tool you’re more comfortable with. When I was starting to learn to programme, I leaned on Excel pretty heavily for when I needed to clean data. As I started working with larger datasets and had to routinize my processes — I shifted towards Python.

9 of 20

Communicate With The Source

Don’t be afraid of picking up a phone or shooting an email to the source of the data.

This is especially true if you are the client of the data source because you are entitled to clear information.

Having that communication can save a lot of heartburn in the long run.

10 of 20

Practice it with the Task#2

Now let’s do the Task#2 Lemonade Sale with pandas

https://github.com/asukul/mbds21/blob/main/MBDS21_Task2_LemonadeSale_with_Pandas.ipynb

11 of 20

Remove duplicate

df.drop_duplicates()

Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters:

subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.

keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.

  • If ‘first’, it considers first value as unique and rest of the same values as duplicate.
  • If ‘last’, it considers last value as unique and rest of the same values as duplicate.
  • If False, it consider all of the same values as duplicates

inplace: Boolean values, removes rows with duplicates if True.

Return type: DataFrame with removed duplicate rows depending on Arguments passed.

12 of 20

Reset index

How to reset the index after drop_duplicates()

use :

df = df.reset_index(drop=True)

13 of 20

Fix the missing value, the manual way

Manually fix the missing value

Using .loc

14 of 20

Find missing value

#or you can identify the NaN by this way

df['Leaflets'].notnull()

15 of 20

Find and fill missing value at once

Using .fillna()

df['Leaflets'].fillna(9999.99)

16 of 20

Calculate the value to fill, and put it together

17 of 20

Add columns

18 of 20

Don’t use .ix

19 of 20

How to selecting specific columns

.loc and .iloc are the best way to slicing your dataframe

20 of 20

How to select rows (like a SQL query)

#if I want only Park location

df.loc[df['Location']=='Park',:]