Stat 159/259
Collaborative and Reproducible Data Science
Fernando Pérez
Jimmy Butler
Sequoia Andrade
Your awesome TAs
A bit about me
IPython and Jupyter - a lesson in the power of open communities
IPython: "an afternoon hack", Oct 2001
But...
Janko Hauser
@lugensade
Nathan Gray
@n8gray
Notebooks: Computational Narratives
Image credit: Scriberia, for the Turing Way Book Dashes.
Jupyter is an open community dedicated to�modular, platform-agnostic tools for interactive computing
Scientific Software: more than code
Content and Services (Notebooks)
Software
Standards and Protocols
Community
Content & Services
Explosive Adoption
The Jupyter architecture in a nutshell
Taken from our IPython in Depth tutorial, where you can find a lot more details.
A language agnostic protocol
~100 different kernels: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
Community: formalized governance
Large, inclusive, multi-stakeholder community. jupyter.org/governance
Governance structure
Community recognition
2024 Executive Council Members
Community: where all the credit goes
And thousands more open source contributors!
Scientific Python: a whole ecosystem
More than code, woven into science
Services and content: impact
Software
Standards and Protocols: ecosystem
Community: innovation & resiliency
People
Ideas
Tools
Stories
Tools for the Life-Cycle of research ideas
Markdown, code, computation…
Data
Community-extensible: NeuroHackademy 2018
Credit:
Anisha Keshavan
Nate Vack
Chris Gorgolewski
Real-Time Collaboration!
Notebooks, text files, …
Formatting this video so it looks great on any device. Check back later to view it.
Jupyter-AI: vendor agnostic, fully open…
Production: cloud or HPC!
Harnessing the power of cloud computing to study the whole Earth interactively
Interactivity
Distributed computing
Data models / numerics
https://mystmd.org
Slides credit:
Rowan Cockett (CurveNote)
The future of publishing/sharing
New publication workflows…
Impact:
science, education, industry, …
Black holes: LIGO, Sept 14, 2015
2019: More Black Holes! Swiss Open Science Day @ EPFL
Katie Bouman
Caltech
Data8:
Computational and Inferential Thinking
The focus of the course is on reasoning, visualization, and interpretation, rather than calculations or the use of software packages. This approach is inspired by the boldly innovative Statistics by Freedman et al. (1978), a textbook that transformed the way the field of statistics was introduced to undergraduates at Berkeley and around the world.
[...]
However, the presentation of these classical lessons changed sharply when we adopted computation as a central tool
[...]
When students open up a browser-based Jupyter Notebook in a data science class, that action is as natural to them as breathing, regardless of their background or academic interests.
Data Science Education at Scale
Data 100:
~1,300
Data 8:
~2,000 students
Funding
(Open Science Program)
Credit: Maryam Zaringhalam, NIH/NLM/OSTP
Embraced by Industry (large & small!)
OpenAI card for o1 model
Security evaluation, p. 17.
With this open architecture,
we are building in new spaces!
Agile Metabolic Health & JupyterHealth
Jupyter Health architecture: research and operations
Pilot: Yr 3 and 4
Platform Build
Prototyping
SMART-on-FHIR visualization app
Jupyter Health
Runtime �AI/ML platform
FHIR server
“Dev”
“Ops”
Actual workflow with my data
Jupyter, in the cloud, with my BP data
Collaborative, cloud-native 3d modeling
GeoJupyter
Open Platform for Earth/Geospatial Science
GeoSpatial data: the "crude oil" of modern data.
Image: GPT-4o
My concern: proprietary capture!
US hazards data is locked behind proprietary ESRI tools, b/c FEMA integrated their Hazards Model for Natural Catastrophes into the ESRI ecosystem.
JupyterGIS - rapid prototyping
Try it in your browser! github.com/geojupyter/jupytergis
2023
NASA’s Year of
Open Science
49
NASA Transform to Open Science Mission
Dr. Chelle Gentemann, Science Lead
Yvonne Ivey, Equity Lead
Cyndi Hall, Community Coordinator
Isabella Martinez, Content Coordinator
Dr. Yaitza Luna-Cruz, TOPS Program Officer
Dr. Paige Martin, TOPS Program Officer
Kevin Murphy, Chief Science Data Officer SMD
Katie Baynes, Deputy Chief Science Data Officer SMD
Dr. Steve Crawford, Science Policy Officer SMD
Andy Mitchell,
Dr. Elena Steponaitis, SMD Development Program Executive
Amy (Uyen) Truong, Chief Science Data Office Coordinator
Dr. Rachel Paseka, OSSI Program Officer
Dr. J.L. Galache, OSSI Program Officer
Dr. Demitri Muna, OSSI Program Officer
Molly Adams, OSSI Coordinator
TOPS Email List
TOPS Website
SCIENCE MISSION DIRECTORATE
Open-source science at the forefront
50
A quick reflection on the explosive growth of the AI/ML scene
(obviously the $$$ of industry investment helped a little 😉)
A culture of rapid, open iteration
Code
Data
Publications
Platforms
"Open Source is eating our lunch"
GitHub: Python, Jupyter and AI
Jupyter-AI: vendor agnostic, fully open…
STAT 159/259: Collaborative and Reproducible Data Science
STAT 159/259: Collaborative and Reproducible Data Science
Philip Stark
Statistics ('21)
Eli Ben-Michael
Statistics (‘17)
Daniel McAndrews
iSchool (‘21)
Facu Sapienza
Statistics (‘22)
Facu is at AGU in person. Talk to him!
Why? Ideas: readings
What/How?
Practical skills, underlying concepts
Fundamental Idea | Technical Implementation Today |
Version Control | Git and GitHub |
Programming | Python |
Process Automation | Make |
Data Analysis | Numpy, Pandas, Matplotlib, Xarray,... |
Software Testing | PyTest |
Documentation and Publishing | Markdown, Sphinx and JupyterBook |
Continuous Integration | GitHub Actions |
Reproducible Containers | Binder (uses Docker) |
Computational hygiene: a daily habit
GitHub Classroom for everything
Explicit dependency management
Open publishing with Jupyter Book
Open interactive textbooks with Jupyter Notebooks
How can we make the notebook a publishable document?
mybinder.org: shareable reproducibility
Explicit Dependencies
+
+
Continuous Integration with GH Actions
We use a (standard, canned) workflow, and simply edit, commit and push…
GH Actions automatically does the rest: run our deploy-book workflow, which triggers a bot's GH Pages website deployment.
A theme for the course: earth & climate
Homework: Real-World Reproducibility
Final Project: full research compendium
A “Standard Playbook”
A complete scientific narrative
Backed by analysis details
With supporting tools and tests
Github repo, reproducible on Binder
Made possible by our amazing team!
And many more!
Thank You!
Scan & fill out if you need to be added to bcourses: https://forms.gle/rqox3mTXp3EGvvZk6