1 of 69

1

Please go to the following website to vote!

2 of 69

2

3 of 69

Pick your favorite buzzword:

Data Science, Big Data, Machine Learning, Data Science , Big Data, Machine Learning

Sahir Bhatnagar

@syfi_24

sahirbhatnagar.com

EBOH Research Day Student Keynote Presentation

March 16, 2018

http://etc.ch/Zf8v

4 of 69

4

5 of 69

5

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

6 of 69

Best Job in America Rankings - #1

6

https://www.glassdoor.com/List/Best-Jobs-in-America-LST_KQ0,20.htm

2016, 2017, 2018

7 of 69

7

https://www.simplilearn.com/big-data-and-analytics/senior-data-scientist-masters-program-training

8 of 69

8

https://www.simplilearn.com/big-data-and-analytics/senior-data-scientist-masters-program-training

9 of 69

9

http://midas.umich.edu/dsi/

10 of 69

What is a Data Scientist?

10

11 of 69

11

http://www.datascienceassn.org/code-of-conduct.html

Data Scientist (n.)

A professional who uses scientific methods to liberate and create meaning from raw data.

12 of 69

12

13 of 69

13

http://www.oralytics.com/2012/06/data-science-is-multidisciplinary.html

14 of 69

14

15 of 69

15

http://joelgrus.com/2013/06/09/post-prism-data-science-venn-diagram/

16 of 69

And it just keeps getting more ridiculous

  • “I think data-scientist is a sexed up term for a statistician” (Nate Silver)
  • "Applied statistics, but in San Francisco."
  • "The field of people who decide to print 'Data Scientist' on their business cards and get a salary bump."

16

17 of 69

17

18 of 69

Two different searches

  • Data Science, Data Scientist
  • (Bio)Statistics, (Bio)Statistician

18

Natural Language Processing

19 of 69

“Control” Dataset: Epidemiology, Epidemiologist

19

20 of 69

Data Science, Data Scientist

(Bio)Statistics, (Bio)Statistician

20

21 of 69

21

Marie Davidian - North Carolina State University (2013)

22 of 69

22

Bin Yu - University of California at Berkeley (2014)

23 of 69

Data science is statistics

����

23

Karl Broman - University of Wisconsin–Madison (2013)

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

24 of 69

How did we get here?

24

25 of 69

25

26 of 69

Donoho’s 6 Divisions of Data Science

1. Data Gathering, Preparation, and Exploration�2. Data Representation and Transformation�3. Computing with Data�4. Data Modeling (Breiman’s 2 Cultures)�5. Data Visualization and Presentation�6. Science about Data Science

26

Donoho D. 50 Years of Data Science. Journal of Computational and Graphical Statistics. 2017 Oct 2;26(4):745-66.

27 of 69

The Focus of Traditional Statistics has been on:

1. Data Gathering, Preparation, and Exploration�2. Data Representation and Transformation�3. Computing with Data�4. Data Modeling5. Data Visualization and Presentation�6. Science about Data Science

27

Donoho D. 50 Years of Data Science. Journal of Computational and Graphical Statistics. 2017 Oct 2;26(4):745-66.

28 of 69

28

Hadley Wickham - RStudio (2015)

The fact that data science exists as a field is a colossal failure of statistics.

Data munging and manipulation is hard and statistics has just said that’s not our domain.�

To me, that is what statistics is all about. It is gaining insight from data using modelling and visualization.

29 of 69

  1. Not Enough Emphasis on Programming and Sharing Code

29

30 of 69

How Many Papers Provide Code?

  • Scraped Pubmed abstracts from:
    • Biometrika, JASA, Ann Stat, Ann Appl Stat, JRSSB, Biostatistics, Biometrics, Stat Med, Stat. Methods Med. Res
  • Searched for exact matches using regex with:
    • package(s), software, script(s), code, R software, R package, R code, Comprehensive R Archive Network, �CRAN, Bioconductor, GitHub, Bitbucket, Python, Julia, Matlab, Matlab toolkit, SAS, SAS macro
  • Code: https://github.com/sahirbhatnagar/talks

30

31 of 69

14,614 Abstracts Scraped

31

32 of 69

1,312 (9%) of Abstracts with a match

32

33 of 69

Distribution of the 1,312 matches

33

34 of 69

2. Outdated Course Material

34

35 of 69

First Day of a Data Science Course

35

36 of 69

First Day in a Statistics Course

36

37 of 69

37

Please go to the following website to vote!

38 of 69

38

39 of 69

3. Marketing and Perception

39

40 of 69

A Statistician's Perspective

40

Statistician

Data Scientist

41 of 69

A Data Scientist’s Perspective

41

Statistician

Data Scientist

42 of 69

Everyone Else’s Perspective

42

Statistician

Data Scientist

43 of 69

43

(Bio)Statistician / Epidemiologist

Objective: Mortality risk score to screen for palliative vs. curative care

  • Logistic regression
  • 20 features (age, gender, clinical features)
  • AUC = 0.87
  • Validated in independent cohort

Machine Learner

  • Deep Learning
  • 14,000 features from Electronic Health Records
  • AUC = 0.93
  • Algorithm not provided, No validation

This example was inspired by: http://www.fharrell.com/post/medml/#fn:2

44 of 69

Mesmerized by Machine Learning?

44

45 of 69

How to become a Data Scientist in Canada?

45

46 of 69

46

16

Length of program (months)

12

14

10

Cost for Domestic Students (CAD in thousands)

50

40

30

20

10

Data Science Related Masters in Canada

UBC

Master of Data Science (32k)

Queen’s Master of Management Analytics (45k)

SFU Professional Master’s in Big Data (30k)

UofT

MSc in Applied Computing - Concentration in Data Science (22k)

Saint Mary’s

MSc in Computing & Data Analytics (17k)

Ryerson

MSc in Data Science and Analytics (11k)

Trent

MSc in Big Data Analytics (10k)

Waterloo

MSc in Statistics - Data Science Specialisation (9k)

Western

Master of Data Analytics (30k)

47 of 69

https://www.mcgill.ca/datascience/

47

48 of 69

  • “Data Science at Carleton is not a stand-alone program. Rather, it is a specialization that can be taken in combination with a participating Master’s degree”
  • Biology, Business (Business Analytics), Cognitive, Computer Science, Communications, Economics, Engineering, Electronics, Geography, Health Sciences, History, Information Technology and Psychology

48

49 of 69

What is being taught in these Masters programs?

49

50 of 69

50

https://masterdatascience.science.ubc.ca/

51 of 69

51

https://masterdatascience.science.ubc.ca/program/courses

52 of 69

What is being taught in Bootcamps?

52

53 of 69

Galvanize Data Science: 13 weeks, $16k USD

53

54 of 69

Metis Data Science Bootcamp: 12 weeks, $16k USD

54

55 of 69

NYC Data Science Academy: 12 weeks, $18k USD

55

56 of 69

What to do going forward?

56

57 of 69

57

Larry Wasserman - Carnegie Mellon (https://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/)

Data Science: The End of Statistics?

2. We need to make sure our students are competitive ... serious computing, data structures, distributed computing and multiple programming languages�

  1. Whining won’t help. We can complain that “data scientists” are ignoring biases, not computing standard errors, not stating and checking assumption and so on. No one is listening.

58 of 69

  1. We must demonstrate the value of our degree to stay relevant and be valued

58

59 of 69

2. The importance of programming needs to be acknowledged

59

60 of 69

Acknowledgements

  • Kevin McGregor
  • Maxime Turgeon
  • Sahar Saeed
  • Julie Rouette
  • Devin Abrahami and Carla Doyle
  • EBOSS

60

Code: github.com/sahirbhatnagar/talks

Slides: sahirbhatnagar.com

61 of 69

Im not denying CS people are doing good work. We are too. Stick to your epi/biostats guns

61

62 of 69

62

63 of 69

Big Good Quality Data

63

64 of 69

Dream Selling

64

65 of 69

The activities of GDS are classified into six divisions

1. Data Gathering, Preparation, and Exploration�2. Data Representation and Transformation�3. Computing with Data�4. Data Modeling�5. Data Visualization and Presentation�6. Science about Data Science�

65

66 of 69

Big Data

66

67 of 69

67

68 of 69

68

69 of 69

It’s all about perception

How many of us will end up in academia vs industry? The people in this room can appreciate the value of our degree.. But what about industry? What perception do they have about the value of our degree?

Why is it that they want to hire data scientists?

Here is one thought: (then show the slides about netflix/FB/google vs. Urns)

69