1 of 90

Week 2

Practical Data Transformation

2 of 90

Agenda

  1. Data Reading & Data Cleaning

Break

  • Exploratory Data Analysis

3 of 90

Grading

70% - Final Project

20% - Assignments

10% - Live Quizzes

+10 Points for every invited person.

4 of 90

What is Google Collab?

5 of 90

What data are we gonna work today?

6 of 90

7 of 90

Common Data Science Formats

  • Comma-separated values (csv)
  • XLSX
  • ZIP
  • Plain Text (txt)
  • JSON
  • HTML
  • SQL

  • Images
  • Hierarchical Data Format
  • PDF
  • DOCX
  • MP3
  • MP4

8 of 90

Spreadsheet Format

In spreadsheet format, data is stored in cells. Each cell is organized in rows and columns. A column in the spreadsheet file can have different types.

Street Name

Amount of Rooms

Area

Type

Price

Glacisstrasse

5

220

Luxury

2500000

Inffeldgasse

3

120

Standart

450000

Kasernstrasse

2

45

Econom

100000

Kreuzgasse

6

250

Luxury

3000000

9 of 90

Common File Formats for Spreadsheets

  • CSV
  • XLSX
  • JSON
  • HTML

10 of 90

How to read all this data formats?

11 of 90

How to read all this data formats?

12 of 90

How to read all this data formats?

13 of 90

CSV file format

Each line in CSV file represents an observation or commonly called a record. Each record may contain one or more fields which are separated by a comma.

14 of 90

CSV file format

Sometimes you may come across files where fields are not separated by using a comma but with other delimiter. For example: ‘ ; ’, ‘ \t ’, ‘ # ’ etc.

15 of 90

How to read CSV file?

16 of 90

XLSX file format

XLSX is a Microsoft Excel Open XML file format. It is an XML-based file format created by Microsoft Excel.

17 of 90

How to read XLSX?

18 of 90

JSON file format

JavaScript Object Notation(JSON) is a text-based open standard designed for exchanging the data over web.

19 of 90

How to read JSON?

20 of 90

HTML file format

Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser.

21 of 90

How to read HTML table?

22 of 90

Pandas IO tools

23 of 90

Ex: Load Dataset

24 of 90

df.shape quiz

25 of 90

26 of 90

Linear Regression - House Price/Number Rooms

27 of 90

Linear Regression - House Price/Number Rooms

28 of 90

29 of 90

30 of 90

31 of 90

32 of 90

Major Data Quality Issues

  1. Duplicates
  2. Missing Data
  3. Incorrect Data
  4. Inconsistent Formats
  5. Insecure Data

33 of 90

Duplicates

34 of 90

How to deal with duplicates?

35 of 90

Ex. shape of the array after removing dub

36 of 90

Missing Data

37 of 90

Type of Missing Values

  • Missing Completely at Random (MCAR)
  • Missing at Random (MAR)
  • Missing not at Random (MNAR)

38 of 90

Missing Completely at Random (MCAR)

  • Probability for a data point to be missing is completely random.
  • There’s no relationship between whether a data point is missing and any values in the data set, missing or observed.
  • The missing data are just a random subset of the data.

39 of 90

Missing at Random (MAR)

  • Probability for a data point to be missing is not related to the missing data, but it is related to some of the observed data.
  • Example: ‘Whether or not someone answered #13 on your survey has nothing to do with the missing values, but it does have to do with the values of some other variable.’

40 of 90

Missing not at Random (MNAR)

  • There is a relationship between the probability of a value to be missing and its values
  • Example: ‘Survey with regard to drug usage. Individuals being surveyed could potentially leave fields blank if they used drugs that are currently illegal out of fear of being prosecuted.’

41 of 90

How to count Missing Data?

42 of 90

How to remove Missing Data?

43 of 90

SUM of missing values in column “X”

44 of 90

Incorrect Data

45 of 90

Inconsistent Formats

46 of 90

Insecure Data

Data security & privacy laws are being put into place giving business extra financial incentive to follow these newly placed laws.

With steep fines for non-compliance, insecure data is quickly becoming one of the most dangerous types of dirty data.

47 of 90

Insecure Data

48 of 90

How to deal with Insecure Data?

49 of 90

DROP NAME

50 of 90

Most common data types?

dtypes

Example

float

3.14, 5.16, 0.1111

int

23, 111, 45, 69

datetime

2019.01.01 30.03.01

timedelta

23:59

Strings

‘This’, ‘course’, ‘is amazing’

bool

True, False

51 of 90

Exploratory Data Analysis

52 of 90

What is EDA?

Cleaning your data

Summarizing your data

Finding patterns

53 of 90

What is EDA?

Cleaning your data

Summarizing your data

Finding patterns

Telling a story with you data.

54 of 90

Why EDA?

Understand the quality of your data.

Gain some quick insights.

Find potential patterns.

55 of 90

Descriptive Statistics

56 of 90

Descriptive Statistics

The art of describing the basic features of your data.

The basis of all quantitative analysis of data.

57 of 90

Univariate Analysis

Salary

54.000€

90.000€

67.000€

96.000€

-10.000€

67.000€

MAX: 9.000€

MEAN: 79.560€

MIN: 54.000€

Why is there a salary of -10.000€?

58 of 90

Technique #1 - Missing values?

Determine how many values are missing.

Salary

54.000€

?

67.000€

96.000€

?

67.000€

2 Missing values

59 of 90

Sometimes missing values, are not obvious

JOB

Data Scientist

Not Known

Data Scientist

Potato Scientist

Not Known

Not Known

3 Missing values

60 of 90

Technique #2 - Distribution of the data

Describe your data with basic statistical functions

Salary

54.000€

65.000€

67.000€

96.000€

79.000€

67.000€

MAX: 96.000€

MEAN: 79.560€

MEDIAN: 79.560€

MIN: 54.000€

STD: 15.000€

61 of 90

What is the mean of this table?

Salary

Data Scientist

Data Scientist

Data Scientist

Potato Farmer

Potato Farmer

Potato Farmer

???

62 of 90

What is the mean of this table?

Categorical data statistics are different.

Salary

Data Scientist

Data Scientist

Data Scientist

Potato Farmer

Potato Farmer

Potato Farmer

Farmer Scientist?

???

63 of 90

What is the mean of this table?

Categorical data statistics are different.

Salary

Data Scientist

Data Scientist

Data Scientist

Potato Farmer

Potato Farmer

Potato Farmer

Potato Scientist?

???

64 of 90

Categorical data statistics are different.

JOB

Data Scientist

Data Engineer

Data Scientist

Potato Scientist

Data Engineer

Data Engineer

Unique Values:

Data Scientist

Potato Scientist

Data Engineer

65 of 90

Categorical data statistics are different.

JOB

Data Scientist

Data Engineer

Data Scientist

Potato Scientist

Data Engineer

Data Engineer

Number of

Unique Values:

3

66 of 90

Categorical data statistics are different.

JOB

Data Scientist

Data Engineer

Data Scientist

Potato Scientist

Data Engineer

Data Engineer

Value Counts:

Data Scientist: 3

Data Engineer: 3

Potato Scientist: 1

67 of 90

Technique #3 - Histogram Plot

Visualizing the shape tells you more about the data.

Salary

54.000€

56.425€

67.000€

96.000€

69.420€

67.000€

68 of 90

Technique #4 - Count Plot

Visualizing the shape tells you more about the data.

JOB

Data Scientist

Data Engineer

Data Scientist

Potato Scientist

Data Engineer

Data Engineer

69 of 90

Technique #5 - Identifying Extreme Points

Some data points stick out.

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

500.000€

The median salary is 74.000€?

500.000€ is way over the median.

70 of 90

We can use the previous techniques to identify these points as well.

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

500.000€

71 of 90

We can use the previous techniques to identify these points.

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

500.000€

72 of 90

Why is the salary so high?

It is hard to answer questions this question with one variable only.

73 of 90

Bivariate Analysis

74 of 90

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

JOB

Data Scientist

Data Scientist

Data Engineer

Data Engineer

Data Analyst

75 of 90

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

500.000€

JOB

Data Scientist

Data Scientist

Data Engineer

Data Engineer

Data Analyst

CEO

76 of 90

77 of 90

78 of 90

Technique #6 - Compare Statistics

We can group previously analyzed statistics.

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

500.000€

JOB

Data Scientist

Data Scientist

Data Engineer

Data Engineer

Data Analyst

CEO

79 of 90

Technique #6 - Compare Statistics

We can group previously analyzed statistics.

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

500.000€

JOB

Data Scientist

Data Scientist

Data Engineer

Data Engineer

Data Analyst

CEO

Data Scientist

MAX: 74.000€

MEAN: 72.000€

CEO

MAX: 500.000€

MEAN: 500.000€

80 of 90

Technique #7 - Scatterplot

Great way to compare visually 2 continuous variables

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

56.000€

Age

30

26

40

35

33

28

81 of 90

Technique #7 - Scatterplot

Great way to compare visually 2 continuous variables

SALARY

70.000€

74.000€

79.000€

69.000€

65.000€

56.000€

Age

30

26

40

35

33

28

82 of 90

Technique #8 - Linear Correlation

Linear correlation refers to straight-line relationships between two variables.

83 of 90

84 of 90

Ok, so what.

What does this help me?

85 of 90

Ok, so what.

What does this help me?

Correlation measures the extent to which variables:

  • depend on one another
  • predict one another

86 of 90

Rainfall is positively correlated with amount of vegetation

87 of 90

Pollen count is positively correlated with bee activity.

88 of 90

Pollen count is positively correlated with bee activity.

Is bee activity the cause of the pollen count?

89 of 90

90 of 90