Scientific Open Source?�Understanding and Supporting Software in Science
Fernando Pérez, UC Berkeley
These slides: bit.ly/cern23-oss-fperez
A bit about me
🇨🇴
Today, software is essential to science.
Do we treat it as such?
IPython and Jupyter - a lesson in the power of open communities
In the midst of a PhD crisis - Open Source?
IPython: "an afternoon hack", Oct 2001
But...
Janko Hauser
@lugensade
Nathan Gray
@n8gray
Image credit: Scriberia, for the Turing Way Book Dashes.
Notebooks: Computational Narratives
Jupyter Notebooks: ~11M on GitHub
~11M Public
Notebooks on Github
Scientific Python: a whole ecosystem
Impact:
science, education, industry, …
A long time ago in a galaxy far, far away…
Einstein’s Field Equations of General Relativity, Annalen der Physik, 1916
Black holes: LIGO, Sept 14, 2015
2019: More Black Holes! Swiss Open Science Day @ EPFL
Katie Bouman
Caltech
Talk Video: https://youtu.be/TSgpIiktkwc
April 2021: Ingenuity on Mars
JRC Big Data Analytics Platform (JEODPP)
2020: Microsoft Planetary Computer
Planetary Computer Hub
The Planetary Computer Hub is a convenient option for computing on the data provided by the Planetary Computer. The Hub is a JupyterHub deployment that includes a set of commonly used packages for geospatial and sustainability data analysis. It’s enabled with Dask for scalable computing.
COVID-19
"This project showcases how you can use fastpages to create a static dashboard that update regularly using Jupyter Notebooks." https://covid19dashboards.com
Data8:
Computational and Inferential Thinking
The focus of the course is on reasoning, visualization, and interpretation, rather than calculations or the use of software packages. This approach is inspired by the boldly innovative Statistics by Freedman et al. (1978), a textbook that transformed the way the field of statistics was introduced to undergraduates at Berkeley and around the world.
[...]
However, the presentation of these classical lessons changed sharply when we adopted computation as a central tool
[...]
When students open up a browser-based Jupyter Notebook in a data science class, that action is as natural to them as breathing, regardless of their background or academic interests.
Berkeley’s Data Science Courses
Data Science Education at Scale
First class was yesterday!
Data 100:
~1,200
Data 8:
~2,000 students
Data 8 in Fall 2018
Annual combined numbers
At steady state, will easily reach ~50% of campus!
Fastest growing courses in Berkeley history
Timeline
2018
Added R Studio to support courses using R
R Studio
2020
Org dedicated to make open tools for interactive computing
2I2C
2017
Datahub piloted for the first time at Berkeley as part of Data 8 coursework in Spring
Data 8
2019
Introduced Data 8x Hub to support Berkeley’s online course teaching foundations of Data Science
Data 8x
2021
Service grew to almost 10k+ users
10K users
2015 - 2016
Initial release of Jupyterhub
Jupyterhub
UC Berkeley Datahub - users in a semester
26
Engagement Metrics:
Summer
Fall 21 Semester
5000 daily active users
: key technology
Wide industrial adoption
This would never have happened under "traditional academic models"
Tools built by scientists, for their science!
A fluid dynamicist in India, an EE entrepreneur in Texas…
Physics PhD: Lattice QCD
Simulations
Matplotlib: ~2002-2003
John Hunter, Department of Pediatric Neurology, University of Chicago.
Career paths?
Traditional software infrastructure funding
Yes, it’s true, the budget is gone again… But you can’t deny that now, we get here in an instant!
Quino (Argentinian cartoonist)
“The Stack”: a complete ecosystem
Domain-agnostic backbone/trunk
Contrasts in culture and incentives
| Open Source | Academia |
Credit | Distributed | PI & hierarchy |
Output/artifacts | Continuous & Project-specific | Discrete papers |
Collaborators | Fluid: professionals, volunteers, … | Structured, funding-dependent |
Governance, decision making | Open, community based | Top-down, PI |
Authorship | Fluid, roles can evolve, no clear “first/senior” author | Need to say more? |
Peer review | Continuous, open, pervasive, friendly | The opposite |
Value metric | Utility, need, impact | “Novel and transformative” |
To build this, we need to recognize that there's a lot more than code…
Scientific Software: more than code
Content and Services (Notebooks)
Software
Standards and Protocols
Community
Content & Services
A language agnostic protocol
~100 different kernels: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
Community: formalized governance
Formal fiscal sponsorship
Large, inclusive, multi-stakeholder community. jupyter.org/governance
Governance structure
Community recognition
2023 Executive Council Members
Community: where all the credit goes
And thousands more open source contributors!
More than code, woven into science
Services and content: impact
Software
Standards and Protocols: ecosystem
Community: innovation & resiliency
People
Ideas
Tools
Stories
So, what next?
Computers are scientific instruments -
teach scientists how to use them well!
Careers? Papers vs Software:
"Economic model" differences
Some bright funding/policy lights
Incentives & policies support culture
It's about building healthy & productive communities.
THANK YOU
These slides: bit.ly/cern23-oss-fperez
@fperez
fernando.perez@berkeley.edu
@fperez_org