1 of 108

Reimagining Research Training Through Holistic Data Science Education

UW Biostats Seminar, May 4th

Carrie Wright

2 of 108

Presentations → Talks

CC-BY hutchdatascience.org

3 of 108

Research Education Innovator

Curious + Anxious + Perfectionist + Focused on Big Picture = Why do we do science like we do??

4 of 108

How did I get here?

Informatics Researcher → Research Educator Innovator

CC-BY hutchdatascience.org

5 of 108

About me

Background in Biomedical Science

    • psychiatric genomics
    • gene regulation (noncoding RNA)
    • methodology development for RNA quantification
    • imaging genetics

“Wet bench” and “Dry bench” research

6 of 108

My postdoc

https://media.giphy.com/media/q6RoNkLlFNjaw/giphy.gif

Psychiatric Genomics

7 of 108

I was always torn…

https://c.tenor.com/_5dWaIMHXwIAAAAM/torn-apart.gif

domain-specific work/knowledge

training myself to do it better with all the supportive practical skills

Perfectionism

8 of 108

Time for “extra” training???

https://media.giphy.com/media/26n6xBpxNXExDfuKc/giphy.gif

Anxious

9 of 108

Is there a better way?

https://media.istockphoto.com/id/1203616323/photo/old-way-or-new-way-roadmarking.jpg?b=1&s=170667a&w=0&k=20&c=04oO9l6sqZu84dG5UQBrxoUr3kcsm-h6UMGu5aaUh2Y=

Curious

10 of 108

Sometimes your side projects become a big part of your career!

CC-BY hutchdatascience.org

11 of 108

I learned some lessons…

12 of 108

Train Smarter

https://media.giphy.com/media/7d8tndK1hVRNYtOWTJ/giphy.gif

Spending an “extra” hour today, could save countless hours tomorrow

13 of 108

We can always get better…

Improvement Points:

↑Rigor & reproducibility

↑ Efficiency

↑ Ethical consciousness

↑ Proper use of new methods

↑ Flexibility

14 of 108

Self-training has holes

https://cdn.pixabay.com/photo/2014/12/21/23/34/cheese-575540__480.png

15 of 108

We don’t know what we don’t know

https://media.giphy.com/media/lgRdsvejP97CShX4UN/giphy.gif

16 of 108

We need expert reviewed (or written) training materials

17 of 108

We need accessible training materials (accessible to those outside the field)

18 of 108

It’s not just me

CC-BY hutchdatascience.org

19 of 108

We all need help!

Even the brightest students don’t know how to train themselves on all the supportive skills they need

PIs have limited time to train trainees

PIs have limited time to train themselves

20 of 108

Examples

CC-BY hutchdatascience.org

21 of 108

Flexibility with Data

Because of the pandemic, many researchers/trainees from many disciplines wanted to use less ideal data sources quickly.

  • How do I import and wrangle data from a PDF reproducibly?
  • How do I import data from various websites quickly?

22 of 108

Expanding Methods

Researchers want to ask questions using new methods.

  • How can they be supported to do it well (either through collaboration or on their own)?

https://en.wikipedia.org/wiki/Spatial_transcriptomics

23 of 108

Expanding data sources

Researchers want to ask questions using data from new populations.

  • How can they be supported to do work with these populations/data responsively and responsibly?

24 of 108

The Problem: Researchers want to maximize the efforts but don’t always know how?

      • Self-directed training leads to holes
      • Many graduate programs lack exposure to comprehensive education
      • Research requires continual learning

25 of 108

A Possible Solution

Provide training opportunities to empower learners with more complete exposure to skills needed day-to-day for research

26 of 108

A Solution?

      • Expert-directed training
      • More comprehensive exposure
      • More sustained support through resources
      • Resources should be designed by experts (domain and education) and shared

      • Self-directed training leads to holes and is inefficient
      • Many graduate programs lack exposure to comprehensive education
      • Research requires continual learning
      • Designing effective training material takes time and expertise

27 of 108

For me, it all comes down to Data Science

CC-BY hutchdatascience.org

28 of 108

My Definition of Data Science

Everything surrounding working with data to extract meaningful information and to utilize or communicate that information

29 of 108

Image created by Carrie Wright

Statistics

Computer Science

Scientific

Communication

Data Cleaning

Ethics

Informatics

Data

Security

Data Sharing

Data Science is Multifaceted

Hard and Soft skills

CC-BY hutchdatascience.org

30 of 108

Innovate new education tools:

Image by ar130405 from Pixabay

More challenging/realistic examples

Greater exposure to context and active experiences

More resources for those actively doing research

→ More realistic perspective on real-world applications

→ More comprehensive understanding

→ More continued

support

CC-BY hutchdatascience.org

31 of 108

Mission: Enhance data science thinking everywhere and make data science accessible.

CC-BY hutchdatascience.org

32 of 108

Innovative Education Initiatives

Aims to equip young-adult from underserved communities with the necessary skills to work in data science.

🤩

33 of 108

JHDSL Reach

  • Biostatistics
  • Biomedical Engineering
  • Biomedical Sciences
  • Public health

High school students

High school graduates or equivalent

Undergraduate students

Postdocs

Graduate students

Researchers / Clinicians

Lay audiences and more…

MOOCS - collectively reached 8 million learners

34 of 108

MPH Capstone Advisor

Intro to R for Public Health

University of Washington Short Course

* Including Education Research and Tools

* Including Education Tool Development

CC-BY hutchdatascience.org

35 of 108

Fred Hutch

Mission: Coordinate data science activities, build community, make data easier to use, and create value for Fred Hutch scientists through data resources, partnerships, philanthropy and infrastructure.

36 of 108

Who does DaSL serve?

Data Science Journey

Lay Audiences &

Citizen Scientists

Community-based Organizations & Nonprofits

Pre-baccalaureate & GED Earners

Undergraduates

Scientists

(Professional Development)

Research Trainees

(Graduate Students, Postdocs)

Instructors

(“Train the Trainer”)

Self-guided Learners

37 of 108

What topics do we cover?

Pedagogy &

Meta-research

(“Research on Research”)

Data Ethics

  • Sharing
  • Sovereignty
  • IDARE

(Inclusion, Diversity,

Anti-Racism, Equity)

Programming Skills

  • R, Python, LaTeX , WDL, etc.
  • Data processing
  • Software Development

Research Practices

  • Reproducibility
  • Workflows
  • Data Management
  • Cloud Computing

Informatics

  • Genomics
  • Public Health
  • Imaging
  • Clinical data

38 of 108

Democratizing education material for informatics holds great power to improve diversity in science and medicine

https://c.tenor.com/lOM2TVfL0joAAAAM/democracy-mypostcard.gif

39 of 108

Challenges

  • Most learners have limited time
  • Some learners are extremely intimidated
  • Most learners do not realize what they do not know
  • Many of these skills take time and practice
  • Many of these skills are learned through the experience of more realistic application

40 of 108

Innovating Education Initiatives

  • Contextualized training materials and experiences
  • Provide resources (open source) for broader use & modification
  • Improve existing material based on feedback - collaborate
  • Modularize training to improve accessibility
  • Create resources to help individuals while they work
  • Equip researchers and trainees with broader skills and knowledge, and more conscious mindsets about the implications of their work
  • Utilize technology to assist with the education process

41 of 108

Three Initiatives

Experiential data analysis guides with a focus on public health

Students work with community-based organizations to address social issues

Cancer informatics resources and workshops

42 of 108

CC-BY hutchdatascience.org

43 of 108

The overall need …

Empower students with skills for:

  • more flexibility
  • more efficiency
  • more reproducibility
  • more effective communication of results

... with their own data!

Use Unusual sources & Difficult data

Work with Multiple files simultaneously

Write code for others to Easily Reuse

New ways to visualize data!

CC-BY hutchdatascience.org

44 of 108

Open Case Studies = educational archive of case studies

What is a case study?

  • An experiential guide
  • Relevant and timely problem
  • Complete (decision process)
  • Navigable
  • Vetted
  • Easy to share!

A possible solution:

CC-BY hutchdatascience.org

45 of 108

For: Instructors - Students - Independent Learners

Inside the classroom

Outside the classroom

  • High school students/Undergraduates/Graduates/Self-learners
  • Public Health/Statistics/Data Science/Scientific Communication/Programming/…

CC-BY hutchdatascience.org

46 of 108

Leah Jager

Margaret Taub

Carrie Wright

Stephanie Hicks

John Muschelli

CC-BY hutchdatascience.org

47 of 108

Bloomberg American Health Initiative High Impact Project

10 Public Health Focused Case Studies

https://americanhealth.jhu.edu/open-case-studies

CC-BY hutchdatascience.org

48 of 108

Obesity and the Food System

  • Obesity across Rural and Urban Regions
  • Dietary Behaviors and Health Risks

Environmental Challenges

Addiction and Overdose

Violence

Adolescent Health

  • CO2 Emissions across Time
  • Predicting Air Pollution
  • Opioids in the United States
  • Vaping Behaviors in American Youth
  • Right-to-Carry Gun Laws
  • School Shootings in the United States
  • Disparities in Youth Disconnection
  • Mental Health of American Youth

CC-BY hutchdatascience.org

49 of 108

Question Type

Data Type

Wrangling Methods

Extracting data from a PDF

Geocoding data

Joining data

Filtering data

Reshaping data

Transforming data

Working with text data

Repetitive processes!

PDF

CSV

Website

Excel

Text in images

API

Google sheets

Survey data / Code books

Multiple files!

Changes over time?

Differences in groups? regions?

Differences in groups over time?

Differences in paired groups?

Predict outcomes for new data?

Does this influence my data?

Relationships between variables?

Display data for others to find, interpret and easily use?

CC-BY hutchdatascience.org

50 of 108

Intro Data Viz

Data Viz

Analysis Methods

Data visualization styles

Facet plots

Adding labels & annotations

Adding error bars

Combining multiple plots

Interactive plots

Interactive maps

Interactive dashboards!

Percentages with missing data

t-tests

Correlation and causation

ANOVA

Linear regression

Chi-squared test of independence

Mann-Kendall Trend test

Machine learning!

Interpretable tables

Scatter plots, line plots, bar plots

Pie chart / waffle plots

Heat maps

Correlation plots

Visualize missing data

Creating maps of your own data!

CC-BY hutchdatascience.org

51 of 108

External Review Panel

JHSPH Faculty Experts

  • Jessica Fanzo, PhD
  • Brendan Saloner, PhD
  • Megan Latshaw, MHS, PhD
  • Renee M. Johnson, PhD, MPH
  • Daniel Webster, MPH, ScD
  • Elizabeth Stuart, PhD
  • Joshua Sharfstein, MD
  • Leslie Myint, PhD – Macalester College
  • Shannon E. Ellis, PhD – University of California – San Diego
  • Christina Knudson, PhD – University of St. Thomas
  • Michael Love, PhD – University of North Carolina
  • Nicholas Horton, ScD – Amherst College
  • Mine Çetinkaya-Rundel, PhD – University of Edinburgh, Duke University, RStudio

Each case study is reviewed by 2 external statistics and data science reviewers

At least one JHSPH faculty member helped with the major direction of each case study from a public health point of view

CC-BY hutchdatascience.org

52 of 108

Can easily Navigate

Entice and show students with what they will learn

CC-BY hutchdatascience.org

53 of 108

Explain why we care! - Each case study is motivated by a recent report or study.

CC-BY hutchdatascience.org

54 of 108

Data Science Skills

Statistical Skills

Public Health Questions

*?*

CC-BY hutchdatascience.org

55 of 108

  • More accessible

  • More usable

  • More engaging

  • Assess Open Case Study use:
    • Survey
    • Google Analytics

Thesis: https://jscholarship.library.jhu.edu/handle/1774.2/66820

CC-BY hutchdatascience.org

56 of 108

Getting started with OCS

  • Example uses of case studies - Shannon Ellis, Jeff Leek, Roger Peng
  • How to modify a case study
  • How to create a new case study using our template
  • Guidance on how to contribute case studies to the OCS libraries
    • Official Library
    • Community Library

57 of 108

https://www.opencasestudies.org/OCS_Guide/

CC-BY hutchdatascience.org

58 of 108

Easier to find!

CC-BY hutchdatascience.org

59 of 108

Easier to translate!

CC-BY hutchdatascience.org

60 of 108

https://github.com/opencasestudies/OCSdata

  • Enables easy use of our 10 public health datasets
  • Enables modular use of case studies
  • Make practicing data import easier

Easier to use!

3600+ downloads!

CC-BY hutchdatascience.org

61 of 108

CC-BY hutchdatascience.org

62 of 108

Interactive Feedback!

More Engaging!

CC-BY hutchdatascience.org

63 of 108

Interactive feedback for code - including hints!

Easier to use!

More Engaging!

CC-BY hutchdatascience.org

64 of 108

Self-learners

CC-BY hutchdatascience.org

65 of 108

Educators

CC-BY hutchdatascience.org

66 of 108

Students

CC-BY hutchdatascience.org

67 of 108

Students

CC-BY hutchdatascience.org

68 of 108

“Open Case Studies is something I wish I had back in college! … Open Case Studies is intuitive, informative, and easy to access. I am excited to see where the project goes and will definitely use this for my personal research and education.”

CC-BY hutchdatascience.org

69 of 108

Next Steps

  • More analysis of surveys
  • More analysis of use in courses and comparison to other methods of teaching
  • Manuscript under review

Wright, C., Meng Q., Breshock, M.R., Atta, L., Taub, M.A., Jager, L.R., Muschelli, J., and Hicks, S.C. Open Case Studies: Statistics and Data Science Education through Real-World Applications (2023). arXiv.2301.05298

70 of 108

Baltimore Community Data Science (BCDS)

Michael Rosenblum

Ava Hoffman

CC-BY hutchdatascience.org

71 of 108

SOURCE: the community engagement and service-learning center at the JHSPH

Community engagement should focus on:

Social Change / Authentic Relationships / Redistributing Power

72 of 108

Rationale

Data science should be performed with intention and mindfulness, while critically reflecting on our position in society and its influence on our work.

73 of 108

Philosophy

Partner with Community Based Organizations (CBOs) to address their data-related needs

Create data products for immediate benefit

Create infrastructure and provide documentation and education for CBOs to take their own data-related goals further

Change the way the students approach work - appreciate other perspectives & reflect on their impact

74 of 108

Benefits

  • Contribute to a meaningful project
  • Data science product
  • Hands-on experience in a different context
  • Training on new topics in data science and data ethics
    • Context of JHU in Baltimore
    • What is the impact of the work?
    • How can the CBO sustain data products?

75 of 108

Skills and Concepts

Critical Reflection

Text extraction (OCR)

Critical Service Learning

Sentiment analysis

Contextualization

Social media data analysis

Project/Time management

Shiny app development

Collaborative work

Data Visualization

Organization

Interactive plots/maps (GIS and R)

Communication

Version control (Git/GitHub)

Instruction/Training

Sustainable Design

Data Privacy/Ethics

Reproducibility

Problem Solving

Documentation

CC-BY hutchdatascience.org

76 of 108

Advocating equitable access to public transit

Narrative on Gun Violence

Leadership opportunities for underserved youth

CC-BY hutchdatascience.org

77 of 108

Class Structure

Training

Preparation

Product Development

Implementation

Sustainability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Weeks

78 of 108

  • Survey data wrangling
  • Data visualizations
  • Shiny app
  • Interactive maps
  • Data communication

Products:

  1. Brochure of narrative
  2. Shiny app for exploring data

  • Social Media Data Analysis
  • Testimony analysis (text screenshots)
  • Word cloud generation
  • Merging data across google forms

Products:

  1. Website revitalization
  2. Map for optimal pickup routes
  3. Data collection infrastructure
  • Gathering data
  • Joining and wrangling data
  • Shiny app
  • Interactive maps

Products:

  • Shiny app for exploring data of surrounding Baltimore city

CC-BY hutchdatascience.org

79 of 108

Considerations & Challenges

  • Need to take time with CBO to identifying data needs
  • Need to time getting data from CBOs
    • Special considerations for data privacy
    • Adapt to adjustments
  • Need to work with CBOs consciously
    • CBO needs come first
    • Consideration of their time & effort
    • Meeting them where they are & honoring their perspective
    • Sustainability
  • Students need to learn to focus on solving dynamic & evolving problem with brief amount of time

80 of 108

Student Reflection

“[The class has] broaden[ed] my academic perspective…in spades.”

“I’ve done more reflection in this class than I have probably done in my entire academic career to date.”

81 of 108

Next Steps

  • More evaluation of long term impact on students
  • Create resources on lessons learned
  • Expand paradigm if useful

82 of 108

ITCR Training Network (ITN)

CC-BY hutchdatascience.org

83 of 108

What is the ITN?

ITCR Training Network

Catalyzing informatics research through training opportunities

84 of 108

Informatics Technology for Cancer Research (ITCR)

85 of 108

User preparedness

Gap

Tool usability

Cancer informatics is hindered by a gap between different types of experts

CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a

CC-BY hutchdatascience.org

86 of 108

User preparedness

Gap

Tool usability

Catalyzing Informatics for Cancer Research

CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a

CC-BY hutchdatascience.org

87 of 108

Informatics Technology for Cancer Research (ITCR) Training Network

Enhance awareness & access for informatics resources

Enhance usability for

cancer informatics tools

Improve practices for informatics work

CC-BY jhudatascience.org

88 of 108

Elements of ITN:

  1. Make courses about informatics

  • Make infrastructure and tools for course creation

  • Provide live education opportunities

  • Enhance community engagement in cancer research

CC-BY hutchdatascience.org

89 of 108

The ITN Team

CC-BY hutchdatascience.org

90 of 108

Main Goals

  • Collaborate with experts to create courses
  • Update and improve courses over time
  • Create tools to help others create and update education content

91 of 108

Wanted to reach the widest audience possible!

CC-BY jhudatascience.org

CC-BY hutchdatascience.org

92 of 108

Wanted to allow other scientists to create content!

CC-BY jhudatascience.org

CC-BY hutchdatascience.org

93 of 108

Wanted updates to be easy for us!

CC-BY jhudatascience.org

updated!

updated!

updated!

updated!

updated!

updated!

updated!

updated!

CC-BY hutchdatascience.org

94 of 108

Wanted updates to be easy for others!

CC-BY jhudatascience.org

updated!

updated!

updated!

updated!

updated!

updated!

updated!

updated!

CC-BY hutchdatascience.org

95 of 108

OTTR: Open-source Tools for Training Resources

CC-BY hutchdatascience.org

96 of 108

Write once publish three times!

CC-BY hutchdatascience.org

97 of 108

OTTR: Open-source Tools for Training Resources

Templates

R packages

Guides

GitHub, GitHub Actions, Docker, YAML, Bookdown/RMarkdown → Maybe Quarto!

CC-BY hutchdatascience.org

98 of 108

OTTR: Open-source Tools for Training Resources

And follow this guide: https://www.ottrproject.org/

CC-BY hutchdatascience.org

99 of 108

OTTR: Open-source Tools for Training Resources

OTTR requires Pull Requests

CC-BY hutchdatascience.org

100 of 108

OTTR relies on GitHub Actions

CC-BY hutchdatascience.org

101 of 108

CC-BY hutchdatascience.org

102 of 108

CC-BY hutchdatascience.org

103 of 108

Current ITN Classes:

  • Leadership for Cancer Informatics Research
  • Computing for Cancer Informatics
  • Documentation & Usability
  • Introduction to Reproducibility
  • Advanced Reproducibility
  • Ethical Data Handling
  • Choosing Genomics tools
  • Scientific Software Development Beyond Coding
  • Overleaf & LaTeX for Scientific Articles

CC-BY jhudatascience.org

104 of 108

Proposed courses

  • Cancer Imaging Informatics
  • Cancer Clinical Informatics
  • Cancer Informatics Data Visualization
  • Machine Learning for Cancer Informatics
  • Dissemination and Engagement

105 of 108

Next Steps

  • Update courses with additional expert reviewers
  • Hands-on workshops (with surveys)
  • Google Analytics and survey analysis about course material
  • Continue to think of new ways of increasing awareness of ITCR tools, to prepare tool users, and to make tools more accessible

106 of 108

Thank you!

CC-BY hutchdatascience.org

107 of 108

Presentations → Talks

CC-BY hutchdatascience.org

108 of 108

Abstract

As science continues to become more multidisciplinary, cultivating a versatile set of skills beyond field-specific expertise empowers researchers and trainees to be more adaptable. However, training for researchers has predominantly focused on domain knowledge with limited emphasis on ethics and data science. Yet trainees and researchers in biomedical engineering, biostatistics, biomedical science, and public health can benefit from wider exposure to adjacent and supportive hard and soft skill sets. Greater understanding on how to critically reflect on the downstream implications of work, can lead to more conscious and responsible science. Greater appreciation and capacity to adequately support data privacy and sovereignty can lead to more responsive science. Greater flexibility with novel and difficult data sources can lead to faster breakthroughs without the limitations of traditional data sources. Greater awareness about practices to enhance transparency, reproducibility, and scientific communication can ultimately lead to greater trust among varied audiences, more rapid advancement, and greater rigor. Greater capacity to leverage cutting edge tools for working with data can lead to faster advancement in multidisciplinary areas.

Through several education initiatives, we have introduced more contextualized training materials and experiences to help equip researchers and trainees with broader skills and knowledge, as well as more conscious mindsets about the implications of their work. For example, initiatives like the Informatics Technology for Cancer Research (ITCR) Training Network (https://www.itcrtraining.org/), the Open Case Studies project (https://www.opencasestudies.org/), and the Baltimore Community Data Science course (https://jhudatascience.org/Baltimore_Community_Course/), offer resources and opportunities for interdisciplinary learning. The ITCR Training Network provides access to a wide range of cancer informatics resources and workshops, the Open Case Studies project creates experiential data analysis guides with a focus on public health, and the Baltimore Community Data Science course allows students to work with community-based organizations to address social issues. Through analysis of such novel training initiatives, we hope to ultimately better support researchers and trainees to maximize their research endeavors.