1 of 74

Stat 159/259

Collaborative and Reproducible Data Science

Fernando Pérez

Jimmy Butler

Sequoia Andrade

2 of 74

Your awesome TAs

3 of 74

A bit about me

  • Physics undergrad in Colombia:
    • Open science (Linux, ArXiV, …).
  • Particle Physics (QCD) in Colorado:
    • Software "on the side"
  • A long journey through applied math, neuroscience, data science…
    • Woven with open source software
  • Today faculty @ Berkeley, LBNL, direct several initiatives where open source tools are essential.

4 of 74

IPython and Jupyter - a lesson in the power of open communities

5 of 74

IPython: "an afternoon hack", Oct 2001

But...

  • IPython
  • Interactive Python Prompt - J. Hauser
  • Lazy Python - N. Gray

Janko Hauser

@lugensade

Nathan Gray

@n8gray

6 of 74

Notebooks: Computational Narratives

  • Interactive exploration is...
  • a conversation with the computer,
  • and with humans,
  • woven into a shareable, reproducible narrative.

7 of 74

Image credit: Scriberia, for the Turing Way Book Dashes.

Jupyter is an open community dedicated to�modular, platform-agnostic tools for interactive computing

8 of 74

Scientific Software: more than code

Content and Services (Notebooks)

Software

Standards and Protocols

Community

9 of 74

Content & Services

10 of 74

Explosive Adoption

11 of 74

The Jupyter architecture in a nutshell

Taken from our IPython in Depth tutorial, where you can find a lot more details.

12 of 74

A language agnostic protocol

13 of 74

Community: formalized governance

Large, inclusive, multi-stakeholder community. jupyter.org/governance

Governance structure

  • Elected Executive Council (6 people)
  • Software Steering Council (large)
  • Jupyter Foundation Governance Board (external organizations - funding and partnerships)

Community recognition

  • Distinguished Contributors

2024 Executive Council Members

14 of 74

Community: where all the credit goes

And thousands more open source contributors!

15 of 74

Scientific Python: a whole ecosystem

16 of 74

More than code, woven into science

Services and content: impact

Software

Standards and Protocols: ecosystem

Community: innovation & resiliency

People

Ideas

Tools

Stories

17 of 74

Tools for the Life-Cycle of research ideas

  • Exploring:
  • Collaborating:
  • Production:
  • Sharing:
  • Teaching:

18 of 74

Markdown, code, computation…

19 of 74

Data

20 of 74

Community-extensible: NeuroHackademy 2018

Credit:

Anisha Keshavan

Nate Vack

Chris Gorgolewski

21 of 74

Real-Time Collaboration!

Notebooks, text files, …

Formatting this video so it looks great on any device. Check back later to view it.

22 of 74

Jupyter-AI: vendor agnostic, fully open…

23 of 74

Production: cloud or HPC!

24 of 74

Harnessing the power of cloud computing to study the whole Earth interactively

Interactivity

Distributed computing

Data models / numerics

25 of 74

https://mystmd.org

  1. Markup Language
  2. Specification
  3. Open Source Tools

Slides credit:

Rowan Cockett (CurveNote)

26 of 74

27 of 74

The future of publishing/sharing

28 of 74

New publication workflows…

29 of 74

Impact:

science, education, industry, …

30 of 74

Black holes: LIGO, Sept 14, 2015

31 of 74

2019: More Black Holes! Swiss Open Science Day @ EPFL

Katie Bouman

Caltech

32 of 74

Data8:

Computational and Inferential Thinking

The focus of the course is on reasoning, visualization, and interpretation, rather than calculations or the use of software packages. This approach is inspired by the boldly innovative Statistics by Freedman et al. (1978), a textbook that transformed the way the field of statistics was introduced to undergraduates at Berkeley and around the world.

[...]

However, the presentation of these classical lessons changed sharply when we adopted computation as a central tool

[...]

When students open up a browser-based Jupyter Notebook in a data science class, that action is as natural to them as breathing, regardless of their background or academic interests.

33 of 74

Data Science Education at Scale

Data 100:

~1,300

Data 8:

~2,000 students

34 of 74

35 of 74

  • Non-profit spinoff
    • (Berkeley, Columbia, UBC)
  • Business model: service provider for interactive cloud infrastructure.
  • Serves research and education, supports Open Source.

Funding

(Open Science Program)

36 of 74

  1. Fortran (1957)
  2. FFT (1965)
  3. Biological databases (1965)
  4. Global Circulation Model (1969)
  5. BLAS - Numerical Linear Algebra (1979)
  6. NIH Image (1987)
  7. BLAST - Gene sequencing (1990)
  8. arXiv.org - Preprints (1991)
  9. IPython/Jupyter notebook (2011)
  10. AlexNet - Deep Learning (2012)

37 of 74

Credit: Maryam Zaringhalam, NIH/NLM/OSTP

38 of 74

Embraced by Industry (large & small!)

OpenAI card for o1 model

Security evaluation, p. 17.

39 of 74

With this open architecture,

we are building in new spaces!

40 of 74

Agile Metabolic Health & JupyterHealth

41 of 74

Jupyter Health architecture: research and operations

Pilot: Yr 3 and 4

Platform Build

Prototyping

SMART-on-FHIR visualization app

Jupyter Health

Runtime �AI/ML platform

FHIR server

“Dev”

“Ops”

42 of 74

Actual workflow with my data

43 of 74

Jupyter, in the cloud, with my BP data

44 of 74

Collaborative, cloud-native 3d modeling

45 of 74

GeoJupyter

Open Platform for Earth/Geospatial Science

46 of 74

GeoSpatial data: the "crude oil" of modern data.

  • Science
    • Climate
    • Ecology
    • Remote sensing
    • Engineering
  • Government & policy
    • Urban planners
    • Transportation managers
    • Emergency response agencies
  • Industry
    • Agriculture.
    • Finance
    • Logistics

Image: GPT-4o

47 of 74

My concern: proprietary capture!

US hazards data is locked behind proprietary ESRI tools, b/c FEMA integrated their Hazards Model for Natural Catastrophes into the ESRI ecosystem.

48 of 74

JupyterGIS - rapid prototyping

Try it in your browser! github.com/geojupyter/jupytergis

49 of 74

2023

NASA’s Year of

Open Science

49

NASA Transform to Open Science Mission

Dr. Chelle Gentemann, Science Lead

Yvonne Ivey, Equity Lead

Cyndi Hall, Community Coordinator

Isabella Martinez, Content Coordinator

Dr. Yaitza Luna-Cruz, TOPS Program Officer

Dr. Paige Martin, TOPS Program Officer

Kevin Murphy, Chief Science Data Officer SMD

Katie Baynes, Deputy Chief Science Data Officer SMD

Dr. Steve Crawford, Science Policy Officer SMD

Andy Mitchell,

Dr. Elena Steponaitis, SMD Development Program Executive

Amy (Uyen) Truong, Chief Science Data Office Coordinator

Dr. Rachel Paseka, OSSI Program Officer

Dr. J.L. Galache, OSSI Program Officer

Dr. Demitri Muna, OSSI Program Officer

Molly Adams, OSSI Coordinator

TOPS Email List

TOPS Website

SCIENCE MISSION DIRECTORATE

50 of 74

Open-source science at the forefront

50

51 of 74

A quick reflection on the explosive growth of the AI/ML scene

(obviously the $$$ of industry investment helped a little 😉)

52 of 74

A culture of rapid, open iteration

Code

Data

Publications

Platforms

53 of 74

"Open Source is eating our lunch"

Leaked internal Google Document, May 4, 2023

semianalysis.com/p/google-we-have-no-moat-and-neither

54 of 74

GitHub: Python, Jupyter and AI

55 of 74

Jupyter-AI: vendor agnostic, fully open…

56 of 74

STAT 159/259: Collaborative and Reproducible Data Science

  • Why?
    • An essential concern of modern computational research.
    • Social and scientific implications of lack of reproducibility.
    • Frame the problem in terms practical, ethical and epistemological.
  • What?
    • Core ideas: data access, computation, statistical analysis and publication.
  • How?
    • Skills and habits necessary to make a practice of reproducibility.
    • An everyday practice, not a “publication time” concept.

57 of 74

STAT 159/259: Collaborative and Reproducible Data Science

  • Schedule: 3h Lectures, 2h lab
  • Prerequisites: foundations in computation, probability and statistical modeling
  • Enrollment: ~ 50 undergrads, 10-20 grads, multiple majors.
  • Grading: weekly readings, quizzes, homework and 3 projects.

Philip Stark

Statistics ('21)

Eli Ben-Michael

Statistics (‘17)

Daniel McAndrews

iSchool (‘21)

Facu Sapienza

Statistics (‘22)

Facu is at AGU in person. Talk to him!

58 of 74

Why? Ideas: readings

  • A few papers/blog posts/videos each week.
  • Reflections on computing, reproducible research and open science.
  • Tackle
    • epistemological questions on scientific validity
    • challenges in reproducibility
    • changes in modern scientific practices
    • questions about the scientific community, incentives, …
  • Basis for weekly ~½ h discussion during lecture.

59 of 74

What/How?

Practical skills, underlying concepts

Fundamental Idea

Technical Implementation Today

Version Control

Git and GitHub

Programming

Python

Process Automation

Make

Data Analysis

Numpy, Pandas, Matplotlib, Xarray,...

Software Testing

PyTest

Documentation and Publishing

Markdown, Sphinx and JupyterBook

Continuous Integration

GitHub Actions

Reproducible Containers

Binder (uses Docker)

60 of 74

Computational hygiene: a daily habit

GitHub Classroom for everything

61 of 74

Explicit dependency management

62 of 74

Open publishing with Jupyter Book

Open interactive textbooks with Jupyter Notebooks

How can we make the notebook a publishable document?

63 of 74

mybinder.org: shareable reproducibility

Explicit Dependencies

+

+

64 of 74

Continuous Integration with GH Actions

We use a (standard, canned) workflow, and simply edit, commit and push…

GH Actions automatically does the rest: run our deploy-book workflow, which triggers a bot's GH Pages website deployment.

65 of 74

A theme for the course: earth & climate

  • A unifying thread: open science, reproducibility, science in the cloud
  • More than ML for industry
  • Beyond CSVs / Data Frames
  • First-principles models meet data-driven methods

66 of 74

Homework: Real-World Reproducibility

67 of 74

Final Project: full research compendium

  • Main narrative notebook ("the paper"): summarizes and discusses results.
    • All analysis notebooks and custom code included.
  • Data: included in repo or linked if too large.
  • Tested, installable library support code.
  • Reproducibility support: Makefile and environment.yml
  • Good repository practices: README.md, LICENSE, .gitignore.

A “Standard Playbook”

68 of 74

A complete scientific narrative

69 of 74

Backed by analysis details

70 of 74

With supporting tools and tests

71 of 74

Github repo, reproducible on Binder

72 of 74

Made possible by our amazing team!

And many more!

73 of 74

Thank You!

74 of 74

Scan & fill out if you need to be added to bcourses: https://forms.gle/rqox3mTXp3EGvvZk6