1 of 20

What is Data Science?

STEM Fellowship

2 of 20

The Fourth Paradigm

  1. Empirical Science�
  2. Theoretical Science�
  3. Computational Science
    • the ability to “query” the world
  4. Data Science
    • the ability to “download” the world

2

3 of 20

Data

  • Data is simply information
    • It can be quantitative, qualitative, categorical, numerical and more�
  • Metadata: data about data
    • Can answer how, where, when, etc. the data was collected

3

4 of 20

Big Data

  • Companies like Facebook, Instagram, Google, YouTube generate massive amounts of data called big data
  • Allows you to have more confidence in any analysis you conduct

4

5 of 20

Preprocessing for Data

5

Clean Data

When compiling a dataset, there may be accidental occurrences of incomplete and/or corrupted data.

Filter Data

In datasets, we can create subgroups that can help break up the data to fully understand the complexity of the dataset.

Classify Data

In particularly large datasets, classifying data that adheres to specific parameters and criterion can help with processing.

Remove Bias

Data that is collected may not be representative of the total population the data is intended to represent.

Before conducting any analysis on data, you need to:

6 of 20

Datasets

  • Data can be found everywhere�
  • Compiled datasets can be found in several location like:
    • Kaggle.com
    • Data.gov
    • toolbox.google.com/datasetsearch

6

7 of 20

Data Science

  • There are 3 main components of data science:�
    • Data analytics
    • Machine learning
    • Traditional software

7

8 of 20

Machine Learning: A Quick Overview

8

9 of 20

Data Science Process

9

10 of 20

Data Science Process

10

11 of 20

I define data scientist as someone who finds solutions to problems by analyzing big or small data using appropriate tools and then tells stories to communicate her findings to relevant stakeholders.

11

Murtaza Haider, Professor of Data Science at Ryerson University

12 of 20

Data Science Applications

12

Medical Field

Sports

Banking and Finance

Social Media

Marketing

Social Media

13 of 20

Data Visualization

  • Data can be found everywhere�
  • Compiled datasets can be found in several location like:
    • Kaggle.com
    • Data.gov
    • toolbox.google.com/datasetsearch

13

14 of 20

Bar Graph/ Histogram

Very useful for quantitative data to show distribution between various data points.

The above bar graph shows the comparison of children per women and the percentage of per children per women between 1960 and 2010.

This simple graph shows us that the Total Fertility Rate worldwide has decreased from 1960.

15 of 20

Line Plots

This visualization is mostly used to show change over time

The above line plot shows the change in life expectancy in years from 1970 to 2010

The plot shows the upward trend in life expectancy, meaning that life expectancy has increased.

16 of 20

Scatterplot

This visualization is used to show values in correlation to 2 different variables.

The above scatterplot shows the correlation between children per woman and the child mortality rate.

This scatterplot shows a positive correlation between the total fertility rate and the CMR.

17 of 20

Dot Diagram

This visualization is used to represent the incidence of a value via the size of the dot.

The above dot diagram shows the total fertility rate and child mortality rate throughout the various regions of the world in 1960.

One takeaway from this diagram is that most countries had a TFR above 5.

18 of 20

Analyzing Data: Tools

  • Python
  • RStudio
  • SQL
  • Julia
  • CODAP
    • https://codap.concord.org/
  • MATLAB
    • https://matlab.mathworks.com/
  • Jupyter Notebook
    • https://jupyter.org/

19 of 20

Interpreting and Presenting Findings

19

20 of 20

Any Questions?