1 of 133

Mutually intensive data science learning as an economic and public health intervention

Jeff Leek

@jtleek

2 of 133

www.jtleek.com/talks

3 of 133

Talent is equally distributed . . .

4 of 133

CBDS+ Cohort 2 Scholars

Source: Aboozar Hadavand

5 of 133

. . . opportunity is not.

  • Leila Janah

6 of 133

Poverty is pervasive in East Baltimore, JHU neighborhood

Source: https://www.opportunityatlas.org/ via jtleek.com/talks

The median family income in this neighborhood is $18,000 for individuals in their mid-thirties.

7 of 133

8 of 133

Thanks

Students

Simina Boca, Hilary Parker, Andrew Jaffe, Alyssa Frazee, Nick Carchedi, Leo Collado Torres, Leslie Myint, Prasad Patil, Claire Ruberman, Jack Fu, Sara Wang, Kayode Sosina, Sarah McClymont + many visitors/interns/student collaborators!

Postdocs

Abhi Nellore, Kai Kammers, Shannon Ellis, Aboozar Hadavand, Lucy D’Agostino McGowan

Genomics Collaborators

Ben Langmead, Andrew Jaffe, Kasper Hansen, Margaret Taub, all of Hopkins Genomics + many more

JHU DaSL Collaborators

Roger Peng, Brian Caffo, Stephanie Hicks, John Muschelli, Leah Jager

JHU DaSL Staff

Ira Gooding, Jessica Crowell, Sean Kross, Nick Carchedi, Ashley Johnson, Simone Sawyer

Hopkins Admin

Karen Bandeen-Roche, Christy Wyskiel, Sukon Kanchanaraksa, Mike Klag, Ellen MacKenzie, Ron Daniels + many others

Hebcac/YO/HeartSmiles

Ed Sabatino, Liz Torres Brown, Joni Hollifield

Problem Forward

Jamie McGovern, Kenny Morales, and more...

9 of 133

Thanks

Students

Simina Boca, Hilary Parker, Andrew Jaffe, Alyssa Frazee, Nick Carchedi, Leo Collado Torres, Leslie Myint, Prasad Patil, Claire Ruberman, Jack Fu, Sara Wang, Kayode Sosina, Sarah McClymont + many visitors/interns/student collaborators!

Postdocs

Abhi Nellore, Kai Kammers, Shannon Ellis, Aboozar Hadavand, Lucy D’Agostino McGowan

Mentors

Rafa Irizarry, Giovanni Parmigiani, John Storey,

James Powell

Genomics Collaborators

Ben Langmead, Andrew Jaffe, Kasper Hansen, Margaret Taub, all of Hopkins Genomics + many more

JHU DaSL Collaborators

Roger Peng, Brian Caffo, Stephanie Hicks, John Muschelli, Leah Jager

JHU DaSL Staff

Ira Gooding, Jessica Crowell, Sean Kross, Nick Carchedi, Ashley Johnson

Hopkins Admin

Karen Bandeen-Roche, Christy Wyskiel, Sukon Kanchanaraksa, Mike Klag, Ellen MacKenzie + many others

Family/Friends

Way too many to name, but Kasper and Margaret and especially Leah, Dex and Hank for putting up with my lunatic schedule and being the best :)

10 of 133

Income inequality as a public health problem

11 of 133

Source: http://www.equality-of-opportunity.org/data/

12 of 133

Source: http://www.equality-of-opportunity.org/data/

13 of 133

Source: http://www.equality-of-opportunity.org/data/

14 of 133

15 of 133

16 of 133

17 of 133

Income mobility

18 of 133

Source: https://www.opportunityatlas.org/

19 of 133

Poverty is pervasive in East Baltimore, JHU neighborhood

19

Source: https://www.opportunityatlas.org/

April 2019

The median family income in this neighborhood is $18,000 for individuals in their mid-thirties.

20 of 133

21 of 133

Source: https://www.opportunityatlas.org/

22 of 133

What does this have to do with

education?

23 of 133

Mobility Rate = (Access) x (Top Quintile Success Rate)

Source: https://www.nber.org/papers/w23618

24 of 133

Mobility Rate = (Access) x (Top Quintile Success Rate)

Source: https://www.nber.org/papers/w23618

Fraction of students with parents in bottom quintile of income

Fraction of students at upper quintile of income by age 34

25 of 133

26 of 133

27 of 133

28 of 133

29 of 133

College

Mobility Rate

Access

Success

Cal State University - LA

9.9%

33.1%

29.9%

Pace University - New York

8.4%

15.2%

55.6%

SUNY - Stony Brook

8.4%

16.4%

51.2%

Technical Career Institutes

8.0%

40.3%

19.8%

University of Texas - Pan American

7.6%

38.7%

19.8%

30 of 133

What does this have to do with

scalable data science education?

31 of 133

Statistical Genomics

Online Ed

Ed Tech

Human Behavioral Data Science

Time

32 of 133

Source: https://spectrum.ieee.org/view-from-the-valley/at-work/tech-careers/desperate-for-data-scientists

33 of 133

Data Science

10 Courses

4.4M+ Enrollments

Executive Data Science

5 Courses

150K+ Enrollments

Genomic Data Science

9 Courses

230k+ Enrollments

MSD in R

6 Courses

35K+ Enrollments

34 of 133

Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695

35 of 133

Coursera JHU

Survey Sample

Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695

36 of 133

Coursera JHU

DSS Students

Coursera JHU

Survey Sample

Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695

37 of 133

Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695

38 of 133

Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695

39 of 133

40 of 133

41 of 133

But, MOOCs typically benefit the already-well-educated

41

April 2019

Our Coursera MOOCs and most other data science training programs

42 of 133

max{Mobility Rate} + λ1 (Cost) + λ2 (Time)

Source: https://www.nber.org/papers/w23618

43 of 133

max{Mobility Rate} + λ1 (Cost) + λ2 (Time)

Source: https://www.nber.org/papers/w23618

44 of 133

Know about

data science

Income security

Expensive

computer

Appropriate programs

Access to

instruction

Right jobs being posted

Access to connections

45 of 133

Know about

data science

Income security

Expensive

computer

Appropriate programs

Access to

instruction

Right jobs being posted

Access to connections

46 of 133

Know about

data science

Income security

Expensive

computer

Appropriate programs

Access to

instruction

Right jobs being posted

Access to connections

47 of 133

$450

48 of 133

Source: https://simplystatistics.org/2017/08/29/data-science-on-a-chromebook/

49 of 133

Source: http://slides.google.com

50 of 133

Source: http://sheets.google.com

51 of 133

Source: https://rstudio.cloud

52 of 133

Know about

data science

Income security

Expensive

computer

Appropriate programs

Access to

instruction

Right jobs being posted

Access to connections

53 of 133

Ph.D

Other Professional Degrees

Master's Degree

College Degree

Some College

Associate Degree

High School

Less than High School

54 of 133

Ph.D

Other Professional Degrees

Master's Degree

College Degree

Some College

Associate Degree

High School

Less than High School

55 of 133

Source: https://hebcac.org/

56 of 133

Know about

data science

Income security

Expensive

computer

Appropriate programs

Access to

instruction

Right jobs being posted

Access to connections

57 of 133

“The best minds of my generation are deleting commas from log files, and that makes me sad. A Ph.D. is a terrible thing to waste.”

Source: https://adage.com/article/digitalnext/dear-madison-avenue-set-data-scientists-free/298676/

58 of 133

rmarkdown

---�title: "My awesome website"�output: � html_document:� toc: true� toc_float: true� theme: cerulean�---�# This is Jeff's awesome website��![](https://media.giphy.com/media/drXGoW1iudhKw/giphy.gif)

59 of 133

flexdashboard

---�title: "How does your BMI measure up?"�output: flexdashboard::flex_dashboard�runtime: shiny�---��Inputs {.sidebar}�-------------------------------------��```{r}�library(flexdashboard); library(NHANES); library(plotly);library(dplyr)�sliderInput("height", "Height in inches",0,100,72)�sliderInput("weight", "Weight in pounds",0,500,100)�sliderInput("age", "Age in years",0,120,50)��```� �Column�-------------------------------------� �### Chart 1� �```{r}�nhanes = sample_n(NHANES,100)�renderPlotly({� df = data.frame(bmi = c(nhanes$BMI,input$weight*0.45/(input$height*0.025)^2),� age = c(nhanes$Age,input$age),� who = c(rep("nhanes",100),"you"))� ggplotly(ggplot(df) + � geom_point(aes(x=age,y=bmi,color=who)) +� scale_x_continuous(limits=c(0,90)) + � scale_y_continuous(limits=c(0,60)) +� theme_minimal()� )�})�```�

60 of 133

dbplyr

library(bigrquery)

set_service_token("file.json"))

con <- dbConnect(

bigquery(),

project = "project_name",

dataset = "dataset_name"

)

unique_elements = con %>%

tbl("dataset1") %>%

count()

unique_elments�Running job 'job_id.US'...�Complete�Billed: 32.51 MB�Downloading 10 rows in 1 pages.�# Source: lazy query [?? x 2] �# Database: BigQueryConnection

n� <int>�1 3700675

61 of 133

httr

library(httr)�library(dplyr)��username = 'janeeverydaydoe'��url_git = 'https://api.github.com/'���api_response =

GET(url = paste0(url_git, 'users/', username, '/repos'))

content(api_response)[[1]]

�$id�[1] 130377298�$node_id�[1] "MDEwOlJlcG9zaXRvcnkxMzAzNzcyOTg="�$name�[1] "first_project"�$full_name�[1] "JaneEverydayDoe/first_project"�$owner$gravatar_id�[1] ""�$owner$url�[1] "https://api.github.com/users/JaneEverydayDoe"

62 of 133

max{Mobility Rate} + λ1 (Cost) + λ2 (Time)

Source: https://www.nber.org/papers/w23618

63 of 133

Source: https://www.youtube.com/watch?v=_h1ooyyFkF0

64 of 133

Source: https://www.amazon.com/Episodic-Career-Thrive-Work-Disruption/dp/147675151X

65 of 133

Goals

  1. Program must be maintainable
  2. Program must be updatable
  3. Program must be creatable in distributed way
  4. Program must be accessible
  5. Program must be free or low cost

66 of 133

Goals

  • Program must be maintainable
  • Program must be updatable
  • Program must be creatable in distributed way
  • Program must be accessible
  • Program must be free or low cost

67 of 133

“The comment is about your presentation, and in particular the way you speak. You nearly always start squeezing your voice as a sentence goes on, many times with ‘vocal fry’ near the end.”

68 of 133

Goals

  • Program must be maintainable
  • Program must be updatable
  • Program must be creatable in distributed way
  • Program must be accessible
  • Program must be free or low cost

69 of 133

Source: https://github.com/search?utf8=%E2%9C%93&q=getting+and+cleaning+data&type=

70 of 133

Source: Wang et al. in prep

71 of 133

Source: Wang et al. in prep

Thanks Hadley!

72 of 133

Goals

  • Program must be maintainable
  • Program must be updatable
  • Program must be creatable in distributed way
  • Program must be accessible
  • Program must be free or low cost

73 of 133

Content Development

Administration & Tutoring

Technology

Leek

Ellis

Hadavand

Muschelli

Kross

Myint

Collado-

Torres

McClymont

Jager

Johnson

74 of 133

Goals

  • Program must be maintainable
  • Program must be updatable
  • Program must be creatable in distributed way
  • Program must be accessible
  • Program must be free or low cost

75 of 133

Source: https://www.insidehighered.com/news/2017/03/06/u-california-berkeley-delete-publicly-available-educational-content

76 of 133

Goals

  • Program must be maintainable
  • Program must be updatable
  • Program must be creatable in distributed way
  • Program must be accessible
  • Program must be free or low cost

77 of 133

“But after promising a reordering of higher education, we see the field instead coalescing around a different, much older business model: helping universities outsource their online master's degrees for professionals”

Source: http://science.sciencemag.org/content/363/6423/130

78 of 133

Source: https://leanpub.com/datastyle

79 of 133

Leanpub

### A Data Science Project Example ��For this example, we're going to use an example analysis from a data scientist named [Hilary Parker](https://hilaryparker.com/about-hilary-parker/). Her work can be found [on her blog](https://hilaryparker.com), and the specific project we'll be working through here is from 2013 and titled ["Hilary: the most poisoned baby name in US history"](https://hilaryparker.com/2013/01/30/hilary-the-most-poisoned-baby-name-in-us-history/).

80 of 133

Leanpub

{quiz, id: quiz_003_data_science_process, random-question-order: true}��### The Data Science Process quiz��{choose-answers: 4}�?1 Which of these is NOT an effective way to communicate the findings of your analysis?��C) save code locally on your computer�C) print code out and store in a desk drawer�o) write a blog post�o) publish a paper �o) publish a news article �o) write a report and share it with your team�o) write a report for your boss�o) give a talk at a conference and make materials available online

81 of 133

Source: https://aws.amazon.com/polly/

82 of 133

ari

library(ari)�# set up poly credentials��ari_spin(images = list.files(“./img”), � paragraphs = readLines(‘script.md’),� output = ‘output.mp4’, � voice = ‘joanna’)

83 of 133

didactr

library(didactr)�course_dir=".”�check_structure(course_dir)��out <- check_course(course_dir)�created <- create_images(out)

out = check_course(course_dir)�vids_created <- create_videos(out)��out = check_course(course_dir)�vids_uploaded = vids_to_youtube(out, Course=Course)��a = update_youtube_link(out)��

84 of 133

Source: https://swirlstats.com/

85 of 133

swirl

- Class: meta� Course: CBDS Data Analysis� Lesson: L04 Desccriptive Analysis Q02 Swirl� Author: Shannon Ellis� Type: Standard� Organization: Chromebook Data Science� Version: 2.4.3��- Class: text� Output: Now that we know there are data from 76 different college majors in this dataset, let's describe the shape of the variable we're interested in.��- Class: cmd_question� Output: Load the `ggplot2` package so that we can look at the shape of the variable `percWomen`.� CorrectAnswer: library(ggplot2) � AnswerTests: any_of_exprs('library(ggplot2)','library("ggplot2")')� Hint: The package has already been installed. Load `ggplot2` from the package library.

��

86 of 133

Course 1:

How To Use a Chromebook

Course 2:

Google & The Cloud

Course 3:

Organizing Data Science Projects

Course 4:

Version Control

Course 5:

Introduction to R

Course 10:

Written and Oral Communication in Data Science

Course 8:

Getting Data

Course 7:

Data Visualization

Course 6:

Data Tidying

Course 9:

Data Analysis

Course 11:

Getting a Job in Data Science

Course 0:

Introduction to CBDS

CBDS

(Cloud Based Data Science)

https://www.clouddatascience.org/

87 of 133

Course 1:

How To Use a Chromebook

Course 2:

Google & The Cloud

Course 3:

Organizing Data Science Projects

Course 4:

Version Control

Course 5:

Introduction to R

Course 10:

Written and Oral Communication in Data Science

Course 8:

Getting Data

Course 7:

Data Visualization

Course 6:

Data Tidying

Course 9:

Data Analysis

Course 11:

Getting a Job in Data Science

Course 0:

Introduction to CBDS

CBDS

(Chromebook Data Science)

https://www.clouddatascience.org/

88 of 133

Course 1:

How To Use a Chromebook

Course 2:

Google & The Cloud

Course 3:

Organizing Data Science Projects

Course 4:

Version Control

Course 5:

Introduction to R

Course 10:

Written and Oral Communication in Data Science

Course 8:

Getting Data

Course 7:

Data Visualization

Course 6:

Data Tidying

Course 9:

Data Analysis

Course 11:

Getting a Job in Data Science

Course 0:

Introduction to CBDS

CBDS

(Chromebook Data Science)

https://www.clouddatascience.org/

89 of 133

Course 1:

How To Use a Chromebook

Course 2:

Google & The Cloud

Course 3:

Organizing Data Science Projects

Course 4:

Version Control

Course 5:

Introduction to R

Course 10:

Written and Oral Communication in Data Science

Course 8:

Getting Data

Course 7:

Data Visualization

Course 6:

Data Tidying

Course 9:

Data Analysis

Course 11:

Getting a Job in Data Science

Course 0:

Introduction to CBDS

CBDS

(Chromebook Data Science)

https://www.clouddatascience.org/

90 of 133

Course 1:

How To Use a Chromebook

Course 2:

Google & The Cloud

Course 3:

Organizing Data Science Projects

Course 4:

Version Control

Course 5:

Introduction to R

Course 10:

Written and Oral Communication in Data Science

Course 8:

Getting Data

Course 7:

Data Visualization

Course 6:

Data Tidying

Course 9:

Data Analysis

Course 11:

Getting a Job in Data Science

Course 0:

Introduction to CBDS

CBDS

(Chromebook Data Science)

https://www.clouddatascience.org/

91 of 133

The Timeline

Feb 2018

content development starts

April

May

Oct

May 21st: Learning Begins!

Aug 31:

Projected Course Set Completion

first meeting with Yo

Sept

#1

#2

#3

#4

#5

0-3

4-5

6-7

10-11

July

Aug

Sept

June

8-9

92 of 133

The Timeline

Feb 2018

content development starts

April

May

Oct

May 21st: Learning Begins!

Aug 31:

Projected Course Set Completion

first meeting with Yo

Sept

#1

#2

#3

#4

#5

0-3

4-5

6-7

10-11

July

Aug

Sept

June

8-9

93 of 133

Course 1:

How To Use a Chromebook

Course 2:

Google & The Cloud

Course 3:

Organizing Data Science Projects

Course 4:

Version Control

Course 5:

Introduction to R

Course 10:

Written and Oral Communication in Data Science

Course 8:

Getting Data

Course 7:

Data Visualization

Course 6:

Data Tidying

Course 9:

Data Analysis

Course 11:

Getting a Job in Data Science

Course 0:

Introduction to CBDS

CBDS

(Chromebook Data Science)

94 of 133

95 of 133

Content Development

Administration & Tutoring

Technology

Leek

Ellis

Hadavand

Muschelli

Kross

Myint

Collado-

Torres

McClymont

Jager

Johnson

96 of 133

97 of 133

Source: https://rstudio.cloud

98 of 133

Data Tidying

Data Visualization

Data Analysis

Credit: CBDS+ Learner

Credit: CBDS+ Learner

Credit: CBDS+ Learner

CBDS Projects

99 of 133

Know about

data science

Income security

Expensive

computer

Appropriate programs

Access to

instruction

Right jobs being posted

Access to connections

100 of 133

Job Search Assistance

Free

Laptops

Online Support

Payment

To Complete

In-Person

Office Hours

CBDS+

101 of 133

Source: https://hebcac.org/

102 of 133

Community Members we Serve

Our Ideal Candidate

Eligibility Requirements

  • Baltimore City resident between the ages of 18-24
  • Registered member of Youth Opportunity (YO!) Baltimore (via HEBCAC)
  • Possesses a high school diploma or GED
  • Interest in computers, programming, coding or related subject matters
  • Ability to complete a 14-week course of study
  • Interested in full-time employment upon completion
  • Able to pass a background check and drug screen
  • Strong communication skills
  • Perseverant, self-motivated learner

103 of 133

CBDS+, a Scalable Public Health Intervention

103

April 2019

104 of 133

Pilot

105 of 133

The Timeline

Feb 2018

content development starts

April

May

Oct

May 21st: Learning Begins!

Aug 31:

Projected Course Set Completion

first meeting with Yo

Sept

#1

#2

#3

#4

#5

0-3

4-5

6-7

10-11

July

Aug

Sept

June

8-9

106 of 133

107 of 133

The Timeline

Feb 2018

content development starts

April

May

Oct

May 21st: Learning Begins!

Aug 31:

Projected Course Set Completion

first meeting with Yo

Sept

#1

#2

#3

#4

#5

0-3

4-5

6-7

10-11

July

Aug

Sept

June

8-9

108 of 133

The Timeline

Feb 2018

content development starts

April

May

Oct

May 21st: Learning Begins!

first meeting with Yo

Sept

July

Aug

Sept

June

#1

#2

#3

#4

#5

0-3

4-5

6-7

8-9

10-11

Oct 5th

Learners finish coursework!!!

109 of 133

Addressing Barriers to Mobility via Data Science Education

80%

success rate to date!

110 of 133

111 of 133

112 of 133

113 of 133

114 of 133

115 of 133

116 of 133

“Data engineering & science as a service”

117 of 133

Source: https://www.opportunityatlas.org/

118 of 133

Know about

data science

Income security

Expensive

computer

Appropriate programs

Access to

instruction

Right jobs being posted

Access to connections

119 of 133

Know about

data science

Income security

Expensive

computer

Appropriate programs

Access to

instruction

Right jobs being posted

Access to connections

120 of 133

121 of 133

122 of 133

College

Mobility Rate

Access

Success

Cal State University - LA

9.9%

33.1%

29.9%

Pace University - New York

8.4%

15.2%

55.6%

SUNY - Stony Brook

8.4%

16.4%

51.2%

Technical Career Institutes

8.0%

40.3%

19.8%

CBDS

83.3%

100%

83.3%

123 of 133

The Realities: Locally

Known Factors

  • Countless uber-talented, un(der)employed Black young adults (~ages 18-24) in Baltimore City
  • Limited income, resources, connections, available opportunities
  • Digital divide
  • Largely unaware of the industry

Observed Circumstances

  • Noted drive and applied excellence, all the while contending with:
  • Homelessness
  • Challenging family dynamics
  • Parenthood / Childcare
  • Transportation
  • Financial instability
  • Mental health
  • Tragic loss of family and friends
  • Challenges transitioning to the professional climate

124 of 133

The Rewards: Locally

General Outcomes

  • Develop skilled talent in a professional career track with growth potential
  • Increase household income and eventually the community’s and city’s income bases
  • Take incremental steps at narrowing the digital divide within Baltimore City
  • Increase scholars’ exposure to other networks and opportunities

Observed, Personal Outcomes

  • Notable increase in self-confidence and self-efficacy
  • Housing and financial stability; in some cases, first independent home
  • Decreased stress and anxiety regarding associated life burdens
  • Increased interest in higher education
  • Health insurance for self and family

125 of 133

126 of 133

Talent is equally distributed . . .

127 of 133

CBDS+ Cohort 2 Scholars

Source: Aboozar Hadavand

128 of 133

. . . opportunity is not.

  • Leila Janah

129 of 133

Poverty is pervasive in East Baltimore, JHU neighborhood

Source: https://www.opportunityatlas.org/ via jtleek.com/talks

The median family income in this neighborhood is $18,000 for individuals in their mid-thirties.

130 of 133

131 of 133

132 of 133

133 of 133

Thank you!