Mutually intensive data science learning as an economic and public health intervention
Jeff Leek
@jtleek
www.jtleek.com/talks
Talent is equally distributed . . .
CBDS+ Cohort 2 Scholars
Source: Aboozar Hadavand
. . . opportunity is not.
Poverty is pervasive in East Baltimore, JHU neighborhood
Source: https://www.opportunityatlas.org/ via jtleek.com/talks
The median family income in this neighborhood is $18,000 for individuals in their mid-thirties.
Thanks
Students
Simina Boca, Hilary Parker, Andrew Jaffe, Alyssa Frazee, Nick Carchedi, Leo Collado Torres, Leslie Myint, Prasad Patil, Claire Ruberman, Jack Fu, Sara Wang, Kayode Sosina, Sarah McClymont + many visitors/interns/student collaborators!
Postdocs
Abhi Nellore, Kai Kammers, Shannon Ellis, Aboozar Hadavand, Lucy D’Agostino McGowan
Genomics Collaborators
Ben Langmead, Andrew Jaffe, Kasper Hansen, Margaret Taub, all of Hopkins Genomics + many more
JHU DaSL Collaborators
Roger Peng, Brian Caffo, Stephanie Hicks, John Muschelli, Leah Jager
JHU DaSL Staff
Ira Gooding, Jessica Crowell, Sean Kross, Nick Carchedi, Ashley Johnson, Simone Sawyer
Hopkins Admin
Karen Bandeen-Roche, Christy Wyskiel, Sukon Kanchanaraksa, Mike Klag, Ellen MacKenzie, Ron Daniels + many others
Hebcac/YO/HeartSmiles
Ed Sabatino, Liz Torres Brown, Joni Hollifield
Problem Forward
Jamie McGovern, Kenny Morales, and more...
Thanks
Students
Simina Boca, Hilary Parker, Andrew Jaffe, Alyssa Frazee, Nick Carchedi, Leo Collado Torres, Leslie Myint, Prasad Patil, Claire Ruberman, Jack Fu, Sara Wang, Kayode Sosina, Sarah McClymont + many visitors/interns/student collaborators!
Postdocs
Abhi Nellore, Kai Kammers, Shannon Ellis, Aboozar Hadavand, Lucy D’Agostino McGowan
Mentors
Rafa Irizarry, Giovanni Parmigiani, John Storey,
James Powell
Genomics Collaborators
Ben Langmead, Andrew Jaffe, Kasper Hansen, Margaret Taub, all of Hopkins Genomics + many more
JHU DaSL Collaborators
Roger Peng, Brian Caffo, Stephanie Hicks, John Muschelli, Leah Jager
JHU DaSL Staff
Ira Gooding, Jessica Crowell, Sean Kross, Nick Carchedi, Ashley Johnson
Hopkins Admin
Karen Bandeen-Roche, Christy Wyskiel, Sukon Kanchanaraksa, Mike Klag, Ellen MacKenzie + many others
Family/Friends
Way too many to name, but Kasper and Margaret and especially Leah, Dex and Hank for putting up with my lunatic schedule and being the best :)
Income inequality as a public health problem
Source: http://www.equality-of-opportunity.org/data/
Source: http://www.equality-of-opportunity.org/data/
Source: http://www.equality-of-opportunity.org/data/
Income mobility
Source: https://www.opportunityatlas.org/
Poverty is pervasive in East Baltimore, JHU neighborhood
19
Source: https://www.opportunityatlas.org/
April 2019
The median family income in this neighborhood is $18,000 for individuals in their mid-thirties.
Source: https://www.opportunityatlas.org/
What does this have to do with
education?
Mobility Rate = (Access) x (Top Quintile Success Rate)
Source: https://www.nber.org/papers/w23618
Mobility Rate = (Access) x (Top Quintile Success Rate)
Source: https://www.nber.org/papers/w23618
Fraction of students with parents in bottom quintile of income
Fraction of students at upper quintile of income by age 34
College | Mobility Rate | Access | Success |
Cal State University - LA | 9.9% | 33.1% | 29.9% |
Pace University - New York | 8.4% | 15.2% | 55.6% |
SUNY - Stony Brook | 8.4% | 16.4% | 51.2% |
Technical Career Institutes | 8.0% | 40.3% | 19.8% |
University of Texas - Pan American | 7.6% | 38.7% | 19.8% |
What does this have to do with
scalable data science education?
Statistical Genomics
Online Ed
Ed Tech
Human Behavioral Data Science
Time
Source: https://spectrum.ieee.org/view-from-the-valley/at-work/tech-careers/desperate-for-data-scientists
Data Science
10 Courses
4.4M+ Enrollments
Executive Data Science
5 Courses
150K+ Enrollments
Genomic Data Science
9 Courses
230k+ Enrollments
MSD in R
6 Courses
35K+ Enrollments
Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695
Coursera JHU
Survey Sample
Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695
Coursera JHU
DSS Students
Coursera JHU
Survey Sample
Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695
Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695
Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3260695
But, MOOCs typically benefit the already-well-educated
41
April 2019
Our Coursera MOOCs and most other data science training programs
max{Mobility Rate} + λ1 (Cost) + λ2 (Time)
Source: https://www.nber.org/papers/w23618
max{Mobility Rate} + λ1 (Cost) + λ2 (Time)
Source: https://www.nber.org/papers/w23618
Know about
data science
Income security
Expensive
computer
Appropriate programs
Access to
instruction
Right jobs being posted
Access to connections
Know about
data science
Income security
Expensive
computer
Appropriate programs
Access to
instruction
Right jobs being posted
Access to connections
Know about
data science
Income security
Expensive
computer
Appropriate programs
Access to
instruction
Right jobs being posted
Access to connections
$450
Source: https://simplystatistics.org/2017/08/29/data-science-on-a-chromebook/
Source: http://slides.google.com
Source: http://sheets.google.com
Source: https://rstudio.cloud
Know about
data science
Income security
Expensive
computer
Appropriate programs
Access to
instruction
Right jobs being posted
Access to connections
Ph.D
Other Professional Degrees
Master's Degree
College Degree
Some College
Associate Degree
High School
Less than High School
Ph.D
Other Professional Degrees
Master's Degree
College Degree
Some College
Associate Degree
High School
Less than High School
Source: https://hebcac.org/
Know about
data science
Income security
Expensive
computer
Appropriate programs
Access to
instruction
Right jobs being posted
Access to connections
“The best minds of my generation are deleting commas from log files, and that makes me sad. A Ph.D. is a terrible thing to waste.”
Source: https://adage.com/article/digitalnext/dear-madison-avenue-set-data-scientists-free/298676/
rmarkdown
---�title: "My awesome website"�output: � html_document:� toc: true� toc_float: true� theme: cerulean�---�# This is Jeff's awesome website��
flexdashboard
---�title: "How does your BMI measure up?"�output: flexdashboard::flex_dashboard�runtime: shiny�---��Inputs {.sidebar}�-------------------------------------��```{r}�library(flexdashboard); library(NHANES); library(plotly);library(dplyr)�sliderInput("height", "Height in inches",0,100,72)�sliderInput("weight", "Weight in pounds",0,500,100)�sliderInput("age", "Age in years",0,120,50)��```� �Column�-------------------------------------� �### Chart 1� �```{r}�nhanes = sample_n(NHANES,100)�renderPlotly({� df = data.frame(bmi = c(nhanes$BMI,input$weight*0.45/(input$height*0.025)^2),� age = c(nhanes$Age,input$age),� who = c(rep("nhanes",100),"you"))� ggplotly(ggplot(df) + � geom_point(aes(x=age,y=bmi,color=who)) +� scale_x_continuous(limits=c(0,90)) + � scale_y_continuous(limits=c(0,60)) +� theme_minimal()� )�})�```�
dbplyr
library(bigrquery)
set_service_token("file.json"))
con <- dbConnect(
bigquery(),
project = "project_name",
dataset = "dataset_name"
)
unique_elements = con %>%
tbl("dataset1") %>%
count()
�
unique_elments�Running job 'job_id.US'...�Complete�Billed: 32.51 MB�Downloading 10 rows in 1 pages.�# Source: lazy query [?? x 2] �# Database: BigQueryConnection
n� <int>�1 3700675
httr
library(httr)�library(dplyr)��username = 'janeeverydaydoe'��url_git = 'https://api.github.com/'���api_response =
GET(url = paste0(url_git, 'users/', username, '/repos'))
content(api_response)[[1]]
�
�$id�[1] 130377298�$node_id�[1] "MDEwOlJlcG9zaXRvcnkxMzAzNzcyOTg="�$name�[1] "first_project"�$full_name�[1] "JaneEverydayDoe/first_project"�$owner$gravatar_id�[1] ""�$owner$url�[1] "https://api.github.com/users/JaneEverydayDoe"
…
max{Mobility Rate} + λ1 (Cost) + λ2 (Time)
Source: https://www.nber.org/papers/w23618
Source: https://www.youtube.com/watch?v=_h1ooyyFkF0
Source: https://www.amazon.com/Episodic-Career-Thrive-Work-Disruption/dp/147675151X
Goals
Goals
“The comment is about your presentation, and in particular the way you speak. You nearly always start squeezing your voice as a sentence goes on, many times with ‘vocal fry’ near the end.”
Goals
Source: https://github.com/search?utf8=%E2%9C%93&q=getting+and+cleaning+data&type=
Source: Wang et al. in prep
Source: Wang et al. in prep
Thanks Hadley!
Goals
Content Development | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ |
Administration & Tutoring | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ |
Technology | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ |
Leek
Ellis
Hadavand
Muschelli
Kross
Myint
Collado-
Torres
McClymont
Jager
Johnson
Goals
Source: https://www.insidehighered.com/news/2017/03/06/u-california-berkeley-delete-publicly-available-educational-content
Goals
“But after promising a reordering of higher education, we see the field instead coalescing around a different, much older business model: helping universities outsource their online master's degrees for professionals”
Source: http://science.sciencemag.org/content/363/6423/130
Source: https://leanpub.com/datastyle
Leanpub
### A Data Science Project Example ��For this example, we're going to use an example analysis from a data scientist named [Hilary Parker](https://hilaryparker.com/about-hilary-parker/). Her work can be found [on her blog](https://hilaryparker.com), and the specific project we'll be working through here is from 2013 and titled ["Hilary: the most poisoned baby name in US history"](https://hilaryparker.com/2013/01/30/hilary-the-most-poisoned-baby-name-in-us-history/).
Leanpub
{quiz, id: quiz_003_data_science_process, random-question-order: true}��### The Data Science Process quiz��{choose-answers: 4}�?1 Which of these is NOT an effective way to communicate the findings of your analysis?��C) save code locally on your computer�C) print code out and store in a desk drawer�o) write a blog post�o) publish a paper �o) publish a news article �o) write a report and share it with your team�o) write a report for your boss�o) give a talk at a conference and make materials available online
Source: https://aws.amazon.com/polly/
ari
library(ari)�# set up poly credentials��ari_spin(images = list.files(“./img”), � paragraphs = readLines(‘script.md’),� output = ‘output.mp4’, � voice = ‘joanna’)
didactr
library(didactr)�course_dir=".”�check_structure(course_dir)��out <- check_course(course_dir)�created <- create_images(out)
out = check_course(course_dir)�vids_created <- create_videos(out)��out = check_course(course_dir)�vids_uploaded = vids_to_youtube(out, Course=Course)��a = update_youtube_link(out)��
Source: https://swirlstats.com/
swirl
- Class: meta� Course: CBDS Data Analysis� Lesson: L04 Desccriptive Analysis Q02 Swirl� Author: Shannon Ellis� Type: Standard� Organization: Chromebook Data Science� Version: 2.4.3��- Class: text� Output: Now that we know there are data from 76 different college majors in this dataset, let's describe the shape of the variable we're interested in.��- Class: cmd_question� Output: Load the `ggplot2` package so that we can look at the shape of the variable `percWomen`.� CorrectAnswer: library(ggplot2) � AnswerTests: any_of_exprs('library(ggplot2)','library("ggplot2")')� Hint: The package has already been installed. Load `ggplot2` from the package library.�
��
Course 1:
How To Use a Chromebook
Course 2:
Google & The Cloud
Course 3:
Organizing Data Science Projects
Course 4:
Version Control
Course 5:
Introduction to R
Course 10:
Written and Oral Communication in Data Science
Course 8:
Getting Data
Course 7:
Data Visualization
Course 6:
Data Tidying
Course 9:
Data Analysis
Course 11:
Getting a Job in Data Science
Course 0:
Introduction to CBDS
CBDS
(Cloud Based Data Science)
https://www.clouddatascience.org/
Course 1:
How To Use a Chromebook
Course 2:
Google & The Cloud
Course 3:
Organizing Data Science Projects
Course 4:
Version Control
Course 5:
Introduction to R
Course 10:
Written and Oral Communication in Data Science
Course 8:
Getting Data
Course 7:
Data Visualization
Course 6:
Data Tidying
Course 9:
Data Analysis
Course 11:
Getting a Job in Data Science
Course 0:
Introduction to CBDS
CBDS
(Chromebook Data Science)
https://www.clouddatascience.org/
Course 1:
How To Use a Chromebook
Course 2:
Google & The Cloud
Course 3:
Organizing Data Science Projects
Course 4:
Version Control
Course 5:
Introduction to R
Course 10:
Written and Oral Communication in Data Science
Course 8:
Getting Data
Course 7:
Data Visualization
Course 6:
Data Tidying
Course 9:
Data Analysis
Course 11:
Getting a Job in Data Science
Course 0:
Introduction to CBDS
CBDS
(Chromebook Data Science)
https://www.clouddatascience.org/
Course 1:
How To Use a Chromebook
Course 2:
Google & The Cloud
Course 3:
Organizing Data Science Projects
Course 4:
Version Control
Course 5:
Introduction to R
Course 10:
Written and Oral Communication in Data Science
Course 8:
Getting Data
Course 7:
Data Visualization
Course 6:
Data Tidying
Course 9:
Data Analysis
Course 11:
Getting a Job in Data Science
Course 0:
Introduction to CBDS
CBDS
(Chromebook Data Science)
https://www.clouddatascience.org/
Course 1:
How To Use a Chromebook
Course 2:
Google & The Cloud
Course 3:
Organizing Data Science Projects
Course 4:
Version Control
Course 5:
Introduction to R
Course 10:
Written and Oral Communication in Data Science
Course 8:
Getting Data
Course 7:
Data Visualization
Course 6:
Data Tidying
Course 9:
Data Analysis
Course 11:
Getting a Job in Data Science
Course 0:
Introduction to CBDS
CBDS
(Chromebook Data Science)
https://www.clouddatascience.org/
The Timeline
Feb 2018
content development starts
April
May
Oct
May 21st: Learning Begins!
Aug 31:
Projected Course Set Completion
first meeting with Yo
Sept
#1
#2
#3
#4
#5
0-3
4-5
6-7
10-11
July
Aug
Sept
June
8-9
The Timeline
Feb 2018
content development starts
April
May
Oct
May 21st: Learning Begins!
Aug 31:
Projected Course Set Completion
first meeting with Yo
Sept
#1
#2
#3
#4
#5
0-3
4-5
6-7
10-11
July
Aug
Sept
June
8-9
Course 1:
How To Use a Chromebook
Course 2:
Google & The Cloud
Course 3:
Organizing Data Science Projects
Course 4:
Version Control
Course 5:
Introduction to R
Course 10:
Written and Oral Communication in Data Science
Course 8:
Getting Data
Course 7:
Data Visualization
Course 6:
Data Tidying
Course 9:
Data Analysis
Course 11:
Getting a Job in Data Science
Course 0:
Introduction to CBDS
CBDS
(Chromebook Data Science)
Content Development | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ |
Administration & Tutoring | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ |
Technology | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ | ⬤ |
Leek
Ellis
Hadavand
Muschelli
Kross
Myint
Collado-
Torres
McClymont
Jager
Johnson
Source: https://rstudio.cloud
Data Tidying
Data Visualization
Data Analysis
Credit: CBDS+ Learner
Credit: CBDS+ Learner
Credit: CBDS+ Learner
CBDS Projects
Know about
data science
Income security
Expensive
computer
Appropriate programs
Access to
instruction
Right jobs being posted
Access to connections
Job Search Assistance
Free
Laptops
Online Support
Payment
To Complete
In-Person
Office Hours
CBDS+
Source: https://hebcac.org/
Community Members we Serve
Our Ideal Candidate
Eligibility Requirements
CBDS+, a Scalable Public Health Intervention
103
April 2019
Pilot
The Timeline
Feb 2018
content development starts
April
May
Oct
May 21st: Learning Begins!
Aug 31:
Projected Course Set Completion
first meeting with Yo
Sept
#1
#2
#3
#4
#5
0-3
4-5
6-7
10-11
July
Aug
Sept
June
8-9
The Timeline
Feb 2018
content development starts
April
May
Oct
May 21st: Learning Begins!
Aug 31:
Projected Course Set Completion
first meeting with Yo
Sept
#1
#2
#3
#4
#5
0-3
4-5
6-7
10-11
July
Aug
Sept
June
8-9
The Timeline
Feb 2018
content development starts
April
May
Oct
May 21st: Learning Begins!
first meeting with Yo
Sept
July
Aug
Sept
June
#1
#2
#3
#4
#5
0-3
4-5
6-7
8-9
10-11
Oct 5th
Learners finish coursework!!!
Addressing Barriers to Mobility via Data Science Education
80%
success rate to date!
“Data engineering & science as a service”
Source: https://www.opportunityatlas.org/
Know about
data science
Income security
Expensive
computer
Appropriate programs
Access to
instruction
Right jobs being posted
Access to connections
Know about
data science
Income security
Expensive
computer
Appropriate programs
Access to
instruction
Right jobs being posted
Access to connections
College | Mobility Rate | Access | Success |
Cal State University - LA | 9.9% | 33.1% | 29.9% |
Pace University - New York | 8.4% | 15.2% | 55.6% |
SUNY - Stony Brook | 8.4% | 16.4% | 51.2% |
Technical Career Institutes | 8.0% | 40.3% | 19.8% |
CBDS | 83.3% | 100% | 83.3% |
The Realities: Locally
Known Factors
Observed Circumstances
The Rewards: Locally
General Outcomes
Observed, Personal Outcomes
Talent is equally distributed . . .
CBDS+ Cohort 2 Scholars
Source: Aboozar Hadavand
. . . opportunity is not.
Poverty is pervasive in East Baltimore, JHU neighborhood
Source: https://www.opportunityatlas.org/ via jtleek.com/talks
The median family income in this neighborhood is $18,000 for individuals in their mid-thirties.
Thank you!