Week 2
Practical Data Transformation
Agenda
Break
Grading
70% - Final Project
20% - Assignments
10% - Live Quizzes
+10 Points for every invited person.
What is Google Collab?
What data are we gonna work today?
Common Data Science Formats
Spreadsheet Format
In spreadsheet format, data is stored in cells. Each cell is organized in rows and columns. A column in the spreadsheet file can have different types.
Street Name | Amount of Rooms | Area | Type | Price |
Glacisstrasse | 5 | 220 | Luxury | 2500000 |
Inffeldgasse | 3 | 120 | Standart | 450000 |
Kasernstrasse | 2 | 45 | Econom | 100000 |
Kreuzgasse | 6 | 250 | Luxury | 3000000 |
Common File Formats for Spreadsheets
How to read all this data formats?
How to read all this data formats?
How to read all this data formats?
CSV file format
Each line in CSV file represents an observation or commonly called a record. Each record may contain one or more fields which are separated by a comma.
CSV file format
Sometimes you may come across files where fields are not separated by using a comma but with other delimiter. For example: ‘ ; ’, ‘ \t ’, ‘ # ’ etc.
How to read CSV file?
XLSX file format
XLSX is a Microsoft Excel Open XML file format. It is an XML-based file format created by Microsoft Excel.
How to read XLSX?
JSON file format
JavaScript Object Notation(JSON) is a text-based open standard designed for exchanging the data over web.
How to read JSON?
HTML file format
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser.
How to read HTML table?
Pandas IO tools
Ex: Load Dataset
df.shape quiz
Linear Regression - House Price/Number Rooms
Linear Regression - House Price/Number Rooms
Major Data Quality Issues
Duplicates
How to deal with duplicates?
Ex. shape of the array after removing dub
Missing Data
Type of Missing Values
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing not at Random (MNAR)
How to count Missing Data?
How to remove Missing Data?
SUM of missing values in column “X”
Incorrect Data
Inconsistent Formats
Insecure Data
Data security & privacy laws are being put into place giving business extra financial incentive to follow these newly placed laws.
With steep fines for non-compliance, insecure data is quickly becoming one of the most dangerous types of dirty data.
Insecure Data
How to deal with Insecure Data?
DROP NAME
Most common data types?
dtypes | Example |
float | 3.14, 5.16, 0.1111 |
int | 23, 111, 45, 69 |
datetime | 2019.01.01 30.03.01 |
timedelta | 23:59 |
Strings | ‘This’, ‘course’, ‘is amazing’ |
bool | True, False |
Exploratory Data Analysis
What is EDA?
Cleaning your data
Summarizing your data
Finding patterns
What is EDA?
Cleaning your data
Summarizing your data
Finding patterns
Telling a story with you data.
Why EDA?
Understand the quality of your data.
Gain some quick insights.
Find potential patterns.
Descriptive Statistics
Descriptive Statistics
The art of describing the basic features of your data.
The basis of all quantitative analysis of data.
Univariate Analysis
Salary |
54.000€ |
90.000€ |
67.000€ |
96.000€ |
-10.000€ |
67.000€ |
MAX: 9.000€
MEAN: 79.560€
MIN: 54.000€
Why is there a salary of -10.000€?
Technique #1 - Missing values?
Determine how many values are missing.
Salary |
54.000€ |
? |
67.000€ |
96.000€ |
? |
67.000€ |
2 Missing values
Sometimes missing values, are not obvious
JOB |
Data Scientist |
Not Known |
Data Scientist |
Potato Scientist |
Not Known |
Not Known |
3 Missing values
Technique #2 - Distribution of the data
Describe your data with basic statistical functions
Salary |
54.000€ |
65.000€ |
67.000€ |
96.000€ |
79.000€ |
67.000€ |
MAX: 96.000€
MEAN: 79.560€
MEDIAN: 79.560€
MIN: 54.000€
STD: 15.000€
What is the mean of this table?
Salary |
Data Scientist |
Data Scientist |
Data Scientist |
Potato Farmer |
Potato Farmer |
Potato Farmer |
???
What is the mean of this table?
Categorical data statistics are different.
Salary |
Data Scientist |
Data Scientist |
Data Scientist |
Potato Farmer |
Potato Farmer |
Potato Farmer |
Farmer Scientist?
???
What is the mean of this table?
Categorical data statistics are different.
Salary |
Data Scientist |
Data Scientist |
Data Scientist |
Potato Farmer |
Potato Farmer |
Potato Farmer |
Potato Scientist?
???
Categorical data statistics are different.
JOB |
Data Scientist |
Data Engineer |
Data Scientist |
Potato Scientist |
Data Engineer |
Data Engineer |
Unique Values:
Data Scientist
Potato Scientist
Data Engineer
Categorical data statistics are different.
JOB |
Data Scientist |
Data Engineer |
Data Scientist |
Potato Scientist |
Data Engineer |
Data Engineer |
Number of
Unique Values:
3
Categorical data statistics are different.
JOB |
Data Scientist |
Data Engineer |
Data Scientist |
Potato Scientist |
Data Engineer |
Data Engineer |
Value Counts:
Data Scientist: 3
Data Engineer: 3
Potato Scientist: 1
Technique #3 - Histogram Plot
Visualizing the shape tells you more about the data.
Salary |
54.000€ |
56.425€ |
67.000€ |
96.000€ |
69.420€ |
67.000€ |
Technique #4 - Count Plot
Visualizing the shape tells you more about the data.
JOB |
Data Scientist |
Data Engineer |
Data Scientist |
Potato Scientist |
Data Engineer |
Data Engineer |
Technique #5 - Identifying Extreme Points
Some data points stick out.
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
500.000€ |
The median salary is 74.000€?
500.000€ is way over the median.
We can use the previous techniques to identify these points as well.
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
500.000€ |
We can use the previous techniques to identify these points.
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
500.000€ |
Why is the salary so high?
It is hard to answer questions this question with one variable only.
Bivariate Analysis
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
|
JOB |
Data Scientist |
Data Scientist |
Data Engineer |
Data Engineer |
Data Analyst |
|
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
500.000€ |
JOB |
Data Scientist |
Data Scientist |
Data Engineer |
Data Engineer |
Data Analyst |
CEO |
Technique #6 - Compare Statistics
We can group previously analyzed statistics.
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
500.000€ |
JOB |
Data Scientist |
Data Scientist |
Data Engineer |
Data Engineer |
Data Analyst |
CEO |
Technique #6 - Compare Statistics
We can group previously analyzed statistics.
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
500.000€ |
JOB |
Data Scientist |
Data Scientist |
Data Engineer |
Data Engineer |
Data Analyst |
CEO |
Data Scientist
MAX: 74.000€
MEAN: 72.000€
CEO
MAX: 500.000€
MEAN: 500.000€
Technique #7 - Scatterplot
Great way to compare visually 2 continuous variables
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
56.000€ |
Age |
30 |
26 |
40 |
35 |
33 |
28 |
Technique #7 - Scatterplot
Great way to compare visually 2 continuous variables
SALARY |
70.000€ |
74.000€ |
79.000€ |
69.000€ |
65.000€ |
56.000€ |
Age |
30 |
26 |
40 |
35 |
33 |
28 |
Technique #8 - Linear Correlation
Linear correlation refers to straight-line relationships between two variables.
Ok, so what.
What does this help me?
Ok, so what.
What does this help me?
Correlation measures the extent to which variables:
Rainfall is positively correlated with amount of vegetation
Pollen count is positively correlated with bee activity.
Pollen count is positively correlated with bee activity.
Is bee activity the cause of the pollen count?