1 of 74

Stat 159/259

Collaborative and Reproducible Data Science

Fernando Pérez

Jimmy Butler

Sequoia Andrade

2 of 74

Your awesome TAs

3 of 74

A bit about me

Physics undergrad in Colombia:

Open science (Linux, ArXiV, …).

Particle Physics (QCD) in Colorado:

Software "on the side"

A long journey through applied math, neuroscience, data science…

Woven with open source software

Today faculty @ Berkeley, LBNL, direct several initiatives where open source tools are essential.

4 of 74

IPython and Jupyter - a lesson in the power of open communities

5 of 74

IPython: "an afternoon hack", Oct 2001

But...

IPython
Interactive Python Prompt - J. Hauser
Lazy Python - N. Gray

Janko Hauser

@lugensade

Nathan Gray

@n8gray

6 of 74

Notebooks: Computational Narratives

Interactive exploration is...
a conversation with the computer,
and with humans,
woven into a shareable, reproducible narrative.

10.1109/MCSE.2021.3059263

7 of 74

Image credit: Scriberia, for the Turing Way Book Dashes.

Jupyter is an open community dedicated to�modular, platform-agnostic tools for interactive computing

8 of 74

Scientific Software: more than code

Content and Services (Notebooks)

Software

Standards and Protocols

Community

9 of 74

Content & Services

earthcube.org/notebooks

10 of 74

Explosive Adoption

github.com/earthcube2022/ec22_markowsky_etal

github.blog/news-insights/octoverse/octoverse-2024

11 of 74

The Jupyter architecture in a nutshell

Taken from our IPython in Depth tutorial, where you can find a lot more details.

12 of 74

A language agnostic protocol

~100 different kernels: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels

13 of 74

Community: formalized governance

Large, inclusive, multi-stakeholder community. jupyter.org/governance

Governance structure

Elected Executive Council (6 people)
Software Steering Council (large)
Jupyter Foundation Governance Board (external organizations - funding and partnerships)

Community recognition

Distinguished Contributors

2024 Executive Council Members

14 of 74

Community: where all the credit goes

And thousands more open source contributors!

15 of 74

Scientific Python: a whole ecosystem

16 of 74

More than code, woven into science

Services and content: impact

Software

Standards and Protocols: ecosystem

Community: innovation & resiliency

People

Ideas

Tools

Stories

17 of 74

Tools for the Life-Cycle of research ideas

Exploring:
Collaborating:
Production:
Sharing:
Teaching:

18 of 74

Markdown, code, computation…

19 of 74

Data

20 of 74

Community-extensible: NeuroHackademy 2018

Credit:

Anisha Keshavan

Nate Vack

Chris Gorgolewski

21 of 74

Real-Time Collaboration!

Notebooks, text files, …

Formatting this video so it looks great on any device. Check back later to view it.

22 of 74

Jupyter-AI: vendor agnostic, fully open…

blog.jupyter.org/generative-ai-in-jupyter-3f7174824862

23 of 74

Production: cloud or HPC!

24 of 74

Harnessing the power of cloud computing to study the whole Earth interactively

Interactivity

Distributed computing

Data models / numerics

25 of 74

https://mystmd.org

Markup Language
Specification
Open Source Tools

Slides credit:

Rowan Cockett (CurveNote)

26 of 74

27 of 74

The future of publishing/sharing

28 of 74

New publication workflows…

29 of 74

Impact:

science, education, industry, …

30 of 74

Black holes: LIGO, Sept 14, 2015

31 of 74

2019: More Black Holes! Swiss Open Science Day @ EPFL

Katie Bouman

Caltech

32 of 74

Data8:

Computational and Inferential Thinking

The focus of the course is on reasoning, visualization, and interpretation, rather than calculations or the use of software packages. This approach is inspired by the boldly innovative Statistics by Freedman et al. (1978), a textbook that transformed the way the field of statistics was introduced to undergraduates at Berkeley and around the world.

[...]

However, the presentation of these classical lessons changed sharply when we adopted computation as a central tool

[...]

When students open up a browser-based Jupyter Notebook in a data science class, that action is as natural to them as breathing, regardless of their background or academic interests.

hdsr.mitpress.mit.edu/pub/e69066t4/release/3

33 of 74

Data Science Education at Scale

Data 100:

~1,300

Data 8:

~2,000 students

34 of 74

data.berkeley.edu/2023workshop

35 of 74

Non-profit spinoff

(Berkeley, Columbia, UBC)

Business model: service provider for interactive cloud infrastructure.
Serves research and education, supports Open Source.

Funding

(Open Science Program)

36 of 74

Fortran (1957)
FFT (1965)
Biological databases (1965)
Global Circulation Model (1969)
BLAS - Numerical Linear Algebra (1979)
NIH Image (1987)
BLAST - Gene sequencing (1990)
arXiv.org - Preprints (1991)
IPython/Jupyter notebook (2011)
AlexNet - Deep Learning (2012)

https://www.nature.com/articles/d41586-021-00075-2

37 of 74

Credit: Maryam Zaringhalam, NIH/NLM/OSTP

38 of 74

Embraced by Industry (large & small!)

OpenAI card for o1 model

Security evaluation, p. 17.

39 of 74

With this open architecture,

we are building in new spaces!

40 of 74

Agile Metabolic Health & JupyterHealth

41 of 74

Jupyter Health architecture: research and operations

Pilot: Yr 3 and 4

Platform Build

Prototyping

SMART-on-FHIR visualization app

Jupyter Health

Runtime �AI/ML platform

FHIR server

“Dev”

“Ops”

42 of 74

Actual workflow with my data

43 of 74

Jupyter, in the cloud, with my BP data

jupyter-health.2i2c.cloud

44 of 74

Collaborative, cloud-native 3d modeling

45 of 74

GeoJupyter

Open Platform for Earth/Geospatial Science

46 of 74

GeoSpatial data: the "crude oil" of modern data.

Science

Climate
Ecology
Remote sensing
Engineering
…

Government & policy

Urban planners
Transportation managers
Emergency response agencies
…

Industry

Agriculture.
Finance
Logistics
…

Image: GPT-4o

47 of 74

My concern: proprietary capture!

US hazards data is locked behind proprietary ESRI tools, b/c FEMA integrated their Hazards Model for Natural Catastrophes into the ESRI ecosystem.

48 of 74

JupyterGIS - rapid prototyping

Try it in your browser! github.com/geojupyter/jupytergis

49 of 74

2023

NASA’s Year of

Open Science

49

NASA Transform to Open Science Mission

Dr. Chelle Gentemann, Science Lead

Yvonne Ivey, Equity Lead

Cyndi Hall, Community Coordinator

Isabella Martinez, Content Coordinator

Dr. Yaitza Luna-Cruz, TOPS Program Officer

Dr. Paige Martin, TOPS Program Officer

Kevin Murphy, Chief Science Data Officer SMD

Katie Baynes, Deputy Chief Science Data Officer SMD

Dr. Steve Crawford, Science Policy Officer SMD

Andy Mitchell,

Dr. Elena Steponaitis, SMD Development Program Executive

Amy (Uyen) Truong, Chief Science Data Office Coordinator

Dr. Rachel Paseka, OSSI Program Officer

Dr. J.L. Galache, OSSI Program Officer

Dr. Demitri Muna, OSSI Program Officer

Molly Adams, OSSI Coordinator

TOPS Email List

TOPS Website

SCIENCE MISSION DIRECTORATE

50 of 74

Open-source science at the forefront

github.com/learnopenscience

50

ChelleGentemann

51 of 74

A quick reflection on the explosive growth of the AI/ML scene

(obviously the $$$ of industry investment helped a little 😉)

52 of 74

A culture of rapid, open iteration

Code

Data

Publications

Platforms

53 of 74

"Open Source is eating our lunch"

Leaked internal Google Document, May 4, 2023

semianalysis.com/p/google-we-have-no-moat-and-neither

54 of 74

GitHub: Python, Jupyter and AI

55 of 74

Jupyter-AI: vendor agnostic, fully open…

blog.jupyter.org/generative-ai-in-jupyter-3f7174824862

56 of 74

STAT 159/259: Collaborative and Reproducible Data Science

Why?

An essential concern of modern computational research.
Social and scientific implications of lack of reproducibility.
Frame the problem in terms practical, ethical and epistemological.

What?

Core ideas: data access, computation, statistical analysis and publication.

How?

Skills and habits necessary to make a practice of reproducibility.
An everyday practice, not a “publication time” concept.

57 of 74

STAT 159/259: Collaborative and Reproducible Data Science

Schedule: 3h Lectures, 2h lab
Prerequisites: foundations in computation, probability and statistical modeling
Enrollment: ~ 50 undergrads, 10-20 grads, multiple majors.
Grading: weekly readings, quizzes, homework and 3 projects.

bit.ly/stat159-f17, bit.ly/stat159-sp21, bit.ly/stat159-sp22

Philip Stark

Statistics ('21)

Eli Ben-Michael

Statistics (‘17)

Daniel McAndrews

iSchool (‘21)

Facu Sapienza

Statistics (‘22)

Facu is at AGU in person. Talk to him!

58 of 74

Why? Ideas: readings

A few papers/blog posts/videos each week.
Reflections on computing, reproducible research and open science.
Tackle

epistemological questions on scientific validity
challenges in reproducibility
changes in modern scientific practices
questions about the scientific community, incentives, …

Basis for weekly ~½ h discussion during lecture.

59 of 74

What/How?

Practical skills, underlying concepts

Fundamental Idea	Technical Implementation Today
Version Control	Git and GitHub
Programming	Python
Process Automation	Make
Data Analysis	Numpy, Pandas, Matplotlib, Xarray,...
Software Testing	PyTest
Documentation and Publishing	Markdown, Sphinx and JupyterBook
Continuous Integration	GitHub Actions
Reproducible Containers	Binder (uses Docker)

60 of 74

Computational hygiene: a daily habit

GitHub Classroom for everything

61 of 74

Explicit dependency management

62 of 74

Open publishing with Jupyter Book

Open interactive textbooks with Jupyter Notebooks

How can we make the notebook a publishable document?

63 of 74

mybinder.org: shareable reproducibility

Explicit Dependencies

+

64 of 74

Continuous Integration with GH Actions

We use a (standard, canned) workflow, and simply edit, commit and push…

GH Actions automatically does the rest: run our deploy-book workflow, which triggers a bot's GH Pages website deployment.

65 of 74

A theme for the course: earth & climate

A unifying thread: open science, reproducibility, science in the cloud
More than ML for industry
Beyond CSVs / Data Frames
First-principles models meet data-driven methods

ChelleGentemann

66 of 74

Homework: Real-World Reproducibility

67 of 74

Final Project: full research compendium

Main narrative notebook ("the paper"): summarizes and discusses results.

All analysis notebooks and custom code included.

Data: included in repo or linked if too large.
Tested, installable library support code.
Reproducibility support: Makefile and environment.yml
Good repository practices: README.md, LICENSE, .gitignore.

A “Standard Playbook”

68 of 74

A complete scientific narrative

69 of 74

Backed by analysis details

70 of 74

With supporting tools and tests

71 of 74

Github repo, reproducible on Binder

72 of 74

Made possible by our amazing team!

And many more!

73 of 74

Thank You!

74 of 74

Scan & fill out if you need to be added to bcourses: https://forms.gle/rqox3mTXp3EGvvZk6