Reimagining Research Training Through Holistic Data Science Education
UW Biostats Seminar, May 4th
Carrie Wright
Presentations → Talks
CC-BY hutchdatascience.org
Research Education Innovator
Curious + Anxious + Perfectionist + Focused on Big Picture = Why do we do science like we do??
How did I get here?
Informatics Researcher → Research Educator Innovator
CC-BY hutchdatascience.org
About me
Background in Biomedical Science
“Wet bench” and “Dry bench” research
My postdoc
https://media.giphy.com/media/q6RoNkLlFNjaw/giphy.gif
Psychiatric Genomics
I was always torn…
https://c.tenor.com/_5dWaIMHXwIAAAAM/torn-apart.gif
domain-specific work/knowledge
training myself to do it better with all the supportive practical skills
Perfectionism
Time for “extra” training???
https://media.giphy.com/media/26n6xBpxNXExDfuKc/giphy.gif
Anxious
Is there a better way?
https://media.istockphoto.com/id/1203616323/photo/old-way-or-new-way-roadmarking.jpg?b=1&s=170667a&w=0&k=20&c=04oO9l6sqZu84dG5UQBrxoUr3kcsm-h6UMGu5aaUh2Y=
Curious
Sometimes your side projects become a big part of your career!
CC-BY hutchdatascience.org
I learned some lessons…
Train Smarter
https://media.giphy.com/media/7d8tndK1hVRNYtOWTJ/giphy.gif
Spending an “extra” hour today, could save countless hours tomorrow
We can always get better…
Improvement Points:
↑Rigor & reproducibility
↑ Efficiency
↑ Ethical consciousness
↑ Proper use of new methods
↑ Flexibility
…
Self-training has holes
https://cdn.pixabay.com/photo/2014/12/21/23/34/cheese-575540__480.png
We don’t know what we don’t know
https://media.giphy.com/media/lgRdsvejP97CShX4UN/giphy.gif
We need expert reviewed (or written) training materials
We need accessible training materials (accessible to those outside the field)
It’s not just me
CC-BY hutchdatascience.org
We all need help!
Even the brightest students don’t know how to train themselves on all the supportive skills they need
PIs have limited time to train trainees
PIs have limited time to train themselves
Examples
CC-BY hutchdatascience.org
Flexibility with Data
Because of the pandemic, many researchers/trainees from many disciplines wanted to use less ideal data sources quickly.
Expanding Methods
Researchers want to ask questions using new methods.
https://en.wikipedia.org/wiki/Spatial_transcriptomics
Expanding data sources
Researchers want to ask questions using data from new populations.
The Problem: Researchers want to maximize the efforts but don’t always know how?
A Possible Solution
Provide training opportunities to empower learners with more complete exposure to skills needed day-to-day for research
A Solution?
For me, it all comes down to Data Science
CC-BY hutchdatascience.org
My Definition of Data Science
Everything surrounding working with data to extract meaningful information and to utilize or communicate that information
Image created by Carrie Wright
Statistics
Computer Science
Scientific
Communication
Data Cleaning
Ethics
Informatics
Data
Security
Data Sharing
Data Science is Multifaceted
Hard and Soft skills
CC-BY hutchdatascience.org
Innovate new education tools:
More challenging/realistic examples
Greater exposure to context and active experiences
More resources for those actively doing research
→ More realistic perspective on real-world applications
→ More comprehensive understanding
→ More continued
support
CC-BY hutchdatascience.org
Mission: Enhance data science thinking everywhere and make data science accessible.
CC-BY hutchdatascience.org
Innovative Education Initiatives
Aims to equip young-adult from underserved communities with the necessary skills to work in data science.
🤩
JHDSL Reach
High school students
High school graduates or equivalent
Undergraduate students
Postdocs
Graduate students
Researchers / Clinicians
Lay audiences and more…
MOOCS - collectively reached 8 million learners
MPH Capstone Advisor
Intro to R for Public Health
University of Washington Short Course
* Including Education Research and Tools
* Including Education Tool Development
CC-BY hutchdatascience.org
Fred Hutch
Mission: Coordinate data science activities, build community, make data easier to use, and create value for Fred Hutch scientists through data resources, partnerships, philanthropy and infrastructure.
Who does DaSL serve?
Data Science Journey
Lay Audiences &
Citizen Scientists
Community-based Organizations & Nonprofits
Pre-baccalaureate & GED Earners
Undergraduates
Scientists
(Professional Development)
Research Trainees
(Graduate Students, Postdocs)
Instructors
(“Train the Trainer”)
Self-guided Learners
What topics do we cover?
Pedagogy &
Meta-research
(“Research on Research”)
Data Ethics
(Inclusion, Diversity,
Anti-Racism, Equity)
Programming Skills
Research Practices
Informatics
Democratizing education material for informatics holds great power to improve diversity in science and medicine
https://c.tenor.com/lOM2TVfL0joAAAAM/democracy-mypostcard.gif
Challenges
Innovating Education Initiatives
Three Initiatives
Experiential data analysis guides with a focus on public health
Students work with community-based organizations to address social issues
Cancer informatics resources and workshops
CC-BY hutchdatascience.org
The overall need …
Empower students with skills for:
... with their own data!
Use Unusual sources & Difficult data
Work with Multiple files simultaneously
Write code for others to Easily Reuse
New ways to visualize data!
CC-BY hutchdatascience.org
Open Case Studies = educational archive of case studies
What is a case study?
A possible solution:
CC-BY hutchdatascience.org
For: Instructors - Students - Independent Learners
Inside the classroom
Outside the classroom
CC-BY hutchdatascience.org
Leah Jager
Margaret Taub
Carrie Wright
Stephanie Hicks
John Muschelli
CC-BY hutchdatascience.org
Bloomberg American Health Initiative High Impact Project
10 Public Health Focused Case Studies
https://americanhealth.jhu.edu/open-case-studies
CC-BY hutchdatascience.org
Obesity and the Food System
Environmental Challenges
Addiction and Overdose
Violence
Adolescent Health
CC-BY hutchdatascience.org
Question Type
Data Type
Wrangling Methods
Extracting data from a PDF
Geocoding data
Joining data
Filtering data
Reshaping data
Transforming data
Working with text data
Repetitive processes!
CSV
Website
Excel
Text in images
API
Google sheets
Survey data / Code books
Multiple files!
Changes over time?
Differences in groups? regions?
Differences in groups over time?
Differences in paired groups?
Predict outcomes for new data?
Does this influence my data?
Relationships between variables?
Display data for others to find, interpret and easily use?
CC-BY hutchdatascience.org
Intro Data Viz
Data Viz
Analysis Methods
Data visualization styles
Facet plots
Adding labels & annotations
Adding error bars
Combining multiple plots
Interactive plots
Interactive maps
Interactive dashboards!
Percentages with missing data
t-tests
Correlation and causation
ANOVA
Linear regression
Chi-squared test of independence
Mann-Kendall Trend test
Machine learning!
Interpretable tables
Scatter plots, line plots, bar plots
Pie chart / waffle plots
Heat maps
Correlation plots
Visualize missing data
Creating maps of your own data!
CC-BY hutchdatascience.org
External Review Panel
JHSPH Faculty Experts
Each case study is reviewed by 2 external statistics and data science reviewers
At least one JHSPH faculty member helped with the major direction of each case study from a public health point of view
CC-BY hutchdatascience.org
Can easily Navigate
Entice and show students with what they will learn
CC-BY hutchdatascience.org
Explain why we care! - Each case study is motivated by a recent report or study.
CC-BY hutchdatascience.org
Data Science Skills
Statistical Skills
Public Health Questions
*?*
CC-BY hutchdatascience.org
Thesis: https://jscholarship.library.jhu.edu/handle/1774.2/66820
CC-BY hutchdatascience.org
Getting started with OCS
https://www.opencasestudies.org/OCS_Guide/
CC-BY hutchdatascience.org
Easier to find!
CC-BY hutchdatascience.org
Easier to translate!
CC-BY hutchdatascience.org
https://github.com/opencasestudies/OCSdata
Easier to use!
3600+ downloads!
CC-BY hutchdatascience.org
CC-BY hutchdatascience.org
Interactive Feedback!
More Engaging!
CC-BY hutchdatascience.org
Interactive feedback for code - including hints!
Easier to use!
More Engaging!
CC-BY hutchdatascience.org
Self-learners
CC-BY hutchdatascience.org
Educators
CC-BY hutchdatascience.org
Students
CC-BY hutchdatascience.org
Students
CC-BY hutchdatascience.org
“Open Case Studies is something I wish I had back in college! … Open Case Studies is intuitive, informative, and easy to access. I am excited to see where the project goes and will definitely use this for my personal research and education.”
CC-BY hutchdatascience.org
Next Steps
Wright, C., Meng Q., Breshock, M.R., Atta, L., Taub, M.A., Jager, L.R., Muschelli, J., and Hicks, S.C. Open Case Studies: Statistics and Data Science Education through Real-World Applications (2023). arXiv.2301.05298
Baltimore Community Data Science (BCDS)
Michael Rosenblum
Ava Hoffman
CC-BY hutchdatascience.org
SOURCE: the community engagement and service-learning center at the JHSPH
Community engagement should focus on:
Social Change / Authentic Relationships / Redistributing Power
Rationale
Data science should be performed with intention and mindfulness, while critically reflecting on our position in society and its influence on our work.
Photo by Toa Heftiba on Unsplash
Philosophy
Partner with Community Based Organizations (CBOs) to address their data-related needs
Create data products for immediate benefit
Create infrastructure and provide documentation and education for CBOs to take their own data-related goals further
Change the way the students approach work - appreciate other perspectives & reflect on their impact
Benefits
Photo by Alexander Sinn on Unsplash
Skills and Concepts | |
Critical Reflection | Text extraction (OCR) |
Critical Service Learning | Sentiment analysis |
Contextualization | Social media data analysis |
Project/Time management | Shiny app development |
Collaborative work | Data Visualization |
Organization | Interactive plots/maps (GIS and R) |
Communication | Version control (Git/GitHub) |
Instruction/Training | Sustainable Design |
Data Privacy/Ethics | Reproducibility |
Problem Solving | Documentation |
CC-BY hutchdatascience.org
Advocating equitable access to public transit
Narrative on Gun Violence
Leadership opportunities for underserved youth
CC-BY hutchdatascience.org
Class Structure
Training
Preparation
Product Development
Implementation
Sustainability
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Weeks
Products:
Products:
Products:
CC-BY hutchdatascience.org
Considerations & Challenges
Student Reflection
“[The class has] broaden[ed] my academic perspective…in spades.”
“I’ve done more reflection in this class than I have probably done in my entire academic career to date.”
Next Steps
ITCR Training Network (ITN)
CC-BY hutchdatascience.org
What is the ITN?
ITCR Training Network
Catalyzing informatics research through training opportunities
Informatics Technology for Cancer Research (ITCR)
User preparedness
Gap
Tool usability
Cancer informatics is hindered by a gap between different types of experts
CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a
CC-BY hutchdatascience.org
User preparedness
Gap
Tool usability
Catalyzing Informatics for Cancer Research
CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a
CC-BY hutchdatascience.org
Informatics Technology for Cancer Research (ITCR) Training Network
Enhance awareness & access for informatics resources
Enhance usability for
cancer informatics tools
Improve practices for informatics work
CC-BY jhudatascience.org
Elements of ITN:
CC-BY hutchdatascience.org
The ITN Team
CC-BY hutchdatascience.org
Main Goals
Wanted to reach the widest audience possible!
CC-BY jhudatascience.org
CC-BY hutchdatascience.org
Wanted to allow other scientists to create content!
CC-BY jhudatascience.org
CC-BY hutchdatascience.org
Wanted updates to be easy for us!
CC-BY jhudatascience.org
updated!
updated!
updated!
updated!
updated!
updated!
updated!
updated!
CC-BY hutchdatascience.org
Wanted updates to be easy for others!
CC-BY jhudatascience.org
updated!
updated!
updated!
updated!
updated!
updated!
updated!
updated!
CC-BY hutchdatascience.org
OTTR: Open-source Tools for Training Resources
CC-BY hutchdatascience.org
Write once publish three times!
CC-BY hutchdatascience.org
OTTR: Open-source Tools for Training Resources
Templates
R packages
Guides
GitHub, GitHub Actions, Docker, YAML, Bookdown/RMarkdown → Maybe Quarto!
CC-BY hutchdatascience.org
OTTR: Open-source Tools for Training Resources
And follow this guide: https://www.ottrproject.org/
CC-BY hutchdatascience.org
OTTR: Open-source Tools for Training Resources
OTTR requires Pull Requests
CC-BY hutchdatascience.org
OTTR relies on GitHub Actions
CC-BY hutchdatascience.org
CC-BY hutchdatascience.org
CC-BY hutchdatascience.org
Current ITN Classes:
CC-BY jhudatascience.org
Proposed courses
Next Steps
Thank you!
CC-BY hutchdatascience.org
Presentations → Talks
CC-BY hutchdatascience.org
Abstract
As science continues to become more multidisciplinary, cultivating a versatile set of skills beyond field-specific expertise empowers researchers and trainees to be more adaptable. However, training for researchers has predominantly focused on domain knowledge with limited emphasis on ethics and data science. Yet trainees and researchers in biomedical engineering, biostatistics, biomedical science, and public health can benefit from wider exposure to adjacent and supportive hard and soft skill sets. Greater understanding on how to critically reflect on the downstream implications of work, can lead to more conscious and responsible science. Greater appreciation and capacity to adequately support data privacy and sovereignty can lead to more responsive science. Greater flexibility with novel and difficult data sources can lead to faster breakthroughs without the limitations of traditional data sources. Greater awareness about practices to enhance transparency, reproducibility, and scientific communication can ultimately lead to greater trust among varied audiences, more rapid advancement, and greater rigor. Greater capacity to leverage cutting edge tools for working with data can lead to faster advancement in multidisciplinary areas.
Through several education initiatives, we have introduced more contextualized training materials and experiences to help equip researchers and trainees with broader skills and knowledge, as well as more conscious mindsets about the implications of their work. For example, initiatives like the Informatics Technology for Cancer Research (ITCR) Training Network (https://www.itcrtraining.org/), the Open Case Studies project (https://www.opencasestudies.org/), and the Baltimore Community Data Science course (https://jhudatascience.org/Baltimore_Community_Course/), offer resources and opportunities for interdisciplinary learning. The ITCR Training Network provides access to a wide range of cancer informatics resources and workshops, the Open Case Studies project creates experiential data analysis guides with a focus on public health, and the Baltimore Community Data Science course allows students to work with community-based organizations to address social issues. Through analysis of such novel training initiatives, we hope to ultimately better support researchers and trainees to maximize their research endeavors.