What is Data Science?
STEM Fellowship
The Fourth Paradigm
2
Data
3
Big Data
4
Preprocessing for Data
5
Clean Data
When compiling a dataset, there may be accidental occurrences of incomplete and/or corrupted data.
Filter Data
In datasets, we can create subgroups that can help break up the data to fully understand the complexity of the dataset.
Classify Data
In particularly large datasets, classifying data that adheres to specific parameters and criterion can help with processing.
Remove Bias
Data that is collected may not be representative of the total population the data is intended to represent.
Before conducting any analysis on data, you need to:
Datasets
6
Data Science
7
Machine Learning: A Quick Overview
8
Data Science Process
9
Data Science Process
10
I define data scientist as someone who finds solutions to problems by analyzing big or small data using appropriate tools and then tells stories to communicate her findings to relevant stakeholders.
11
Murtaza Haider, Professor of Data Science at Ryerson University
“
Data Science Applications
12
Medical Field
Sports
Banking and Finance
Social Media
Marketing
Social Media
Data Visualization
13
Bar Graph/ Histogram
Very useful for quantitative data to show distribution between various data points.
The above bar graph shows the comparison of children per women and the percentage of per children per women between 1960 and 2010.
This simple graph shows us that the Total Fertility Rate worldwide has decreased from 1960.
Line Plots
This visualization is mostly used to show change over time
The above line plot shows the change in life expectancy in years from 1970 to 2010
The plot shows the upward trend in life expectancy, meaning that life expectancy has increased.
Scatterplot
This visualization is used to show values in correlation to 2 different variables.
The above scatterplot shows the correlation between children per woman and the child mortality rate.
This scatterplot shows a positive correlation between the total fertility rate and the CMR.
Dot Diagram
This visualization is used to represent the incidence of a value via the size of the dot.
The above dot diagram shows the total fertility rate and child mortality rate throughout the various regions of the world in 1960.
One takeaway from this diagram is that most countries had a TFR above 5.
Analyzing Data: Tools
Interpreting and Presenting Findings
19
Any Questions?