1 of 45

The data science curriculum

Ariel Rokem

HSI STEM HUB + WBDIH Data Science Training and Collaboration workshop, September 16th 2019

2 of 45

Data science : the fourth paradigm

Jim Gray

3 of 45

The first paradigm: empirical research

Experimental and observational

4 of 45

The second paradigm: mathematical theory

Maxwell’s laws

5 of 45

The third paradigm: computational simulations

6 of 45

The fourth paradigm: data-intensive research

Sloan Digital Sky Survey

Large Hadron Collider

7 of 45

  • The DNA in our cells encodes all of our hereditary information
  • 1953: Structure first elucidated
  • 1984: US government decides to sequence the human genome
  • 3.3 B base-pairs!
  • 1990: the Human Genome Project launches
  • 2001: first draft published

8 of 45

9 of 45

Tremendous impact (HGP fact sheet)

  • >1,800 human disease-related genes
  • >2,000 genetic tests for human conditions
  • >350 biotechnology companies

Recent estimate : $3.8B investment that drove $796B in economic impact.

10 of 45

But that was just the start!

11 of 45

<= laptop

<= x1000

<= x1M

<= x1B

12 of 45

Data-driven discovery is everywhere!

Social science

Josh Blumenstock et al. (Berkeley ISchool)

13 of 45

Data science for social good

Summer program

~16 students working on 4 projects for 10 weeks

Together with program lead + data scientist

14 of 45

Biomedical science

Nick Reder (UW Pathology), Adam Glaser, Jon Liu (UW Mech E)

15 of 45

Biomedical science

16 of 45

Even in the humanities!

Adam Anderson, UC Berkeley

17 of 45

Meanwhile, in industry

Web-scale data + new computing paradigms => data science

  • Business transactions => recommender systems
  • Social networks => social graphs
  • Search engines => page rank (Larry Page and Sergey Brin => Google)
  • Large corpora of text: natural language processing => autotranslation, etc.
  • Large corpora of images => image search, (self-driving cars?)

2009: Halevy, Norvig, Pereira (Google): “The unreasonable effectiveness of data”

~2008: Jeff Hammerbacher and DJ Patil (LinkedIn) invent the term “data science”

18 of 45

The data science venn diagram

Drew Conway (2013)

19 of 45

20 of 45

21 of 45

50 years of data science (Donoho, 2017)

John Tukey

1962

1970

22 of 45

Tukey, 1962

23 of 45

24 of 45

25 of 45

Sounds good! But is it?

26 of 45

The data science curriculum

Statistics and machine learning�

Computing

Data visualization and data explanation

The human aspect of data science

27 of 45

Statistical learning and data-driven discovery

  • Factor analysis: Spearman (1920s)
  • Regularized regression: Tikhonov (1960s), Hoerl and Kennard (1970), Tibshirani (1996)
  • Support vector machines: Vapnik et al. (1960s - 1990s)
  • Breiman (2001): The two cultures of statistical modeling
  • Neural networks: Perceptrons (1960s), PDP (1980s), Backpropagation (1980s), Deep learning (2010s).

28 of 45

Data management

  • Database systems
  • Map reduce => Distributed systems (e.g., Hadoop, Spark, …)
  • Software
    • Computer science
    • Engineering
  • Systems and operations
    • High-performance computing
    • Cloud computing
  • Standards and best practices
    • Reproducibility and open science
  • Open-source software

29 of 45

30 of 45

Open source software for science

31 of 45

32 of 45

33 of 45

34 of 45

35 of 45

Tools for teaching and learning

36 of 45

Understanding and explaining data

  • Data visualization

37 of 45

Understanding and explaining data

Explainable machine learning

Olah et al, 2018

38 of 45

The human side of data science

  • Data science in practice
    • Roles and responsibilities
    • Collaboration patterns and workflows
  • Human-centered design and engineering
    • “Data storytelling”
    • HCI
    • UX
  • Data science and ethics

39 of 45

How?

Standard curriculum (coming up next!) : courses, degrees, ...

Nimble curriculum:

  • Workshops and bootcamps
  • Hackathons
  • Hackweeks

40 of 45

41 of 45

42 of 45

  • Astrohackweek
  • Geohackweek
  • Neurohackweek

43 of 45

NeuroHackademy

  • With Tal Yarkoni (UT Austin)
  • Two weeks
  • 60 participants
  • ~20 instructors

44 of 45

Maybe data science is not a discipline?

  • A framework for thinking about:
    • Communities of practice
    • Interactions between fields of research
    • And between sectors

45 of 45

Questions?