1 of 50

Scientific Open Source?Understanding and Supporting Software in Science

Fernando Pérez, UC Berkeley

2 of 50

A bit about me

  • Physics undergrad in Colombia:
    • Open science (Linux, ArXiV, …).
  • Particle Physics (QCD) in Colorado:
    • Software "on the side"
  • A long journey through applied math, neuroscience, data science…
    • Woven with open source software
  • Today faculty @ Berkeley, LBNL, direct several initiatives where open source tools are essential.

🇨🇴

3 of 50

Today, software is essential to science.

Do we treat it as such?

4 of 50

IPython and Jupyter - a lesson in the power of open communities

5 of 50

In the midst of a PhD crisis - Open Source?

  • Ethics: access and equity

  • Community: collaboration

  • Epistemology: proprietary science???

  • Technical: Python!! ❤️🐍

6 of 50

IPython: "an afternoon hack", Oct 2001

But...

  • IPython
  • Interactive Python Prompt - J. Hauser
  • Lazy Python - N. Gray

Janko Hauser

@lugensade

Nathan Gray

@n8gray

7 of 50

8 of 50

Image credit: Scriberia, for the Turing Way Book Dashes.

9 of 50

Notebooks: Computational Narratives

  • Interactive exploration is...
  • a conversation with the computer,
  • and with humans,
  • woven into a shareable, reproducible narrative.

10 of 50

Jupyter Notebooks: ~11M on GitHub

~11M Public

Notebooks on Github

11 of 50

Scientific Python: a whole ecosystem

12 of 50

Impact:

science, education, industry, …

13 of 50

A long time ago in a galaxy far, far away…

Einstein’s Field Equations of General Relativity, Annalen der Physik, 1916

14 of 50

Black holes: LIGO, Sept 14, 2015

15 of 50

2019: More Black Holes! Swiss Open Science Day @ EPFL

Katie Bouman

Caltech

16 of 50

April 2021: Ingenuity on Mars

17 of 50

JRC Big Data Analytics Platform (JEODPP)

18 of 50

2020: Microsoft Planetary Computer

Planetary Computer Hub

The Planetary Computer Hub is a convenient option for computing on the data provided by the Planetary Computer. The Hub is a JupyterHub deployment that includes a set of commonly used packages for geospatial and sustainability data analysis. It’s enabled with Dask for scalable computing.

19 of 50

COVID-19

"This project showcases how you can use fastpages to create a static dashboard that update regularly using Jupyter Notebooks." https://covid19dashboards.com

20 of 50

  1. Fortran (1957)
  2. FFT (1965)
  3. Biological databases (1965)
  4. Global Circulation Model (1969)
  5. BLAS - Numerical Linear Algebra (1979)
  6. NIH Image (1987)
  7. BLAST - Gene sequencing (1990)
  8. arXiv.org - Preprints (1991)
  9. IPython/Jupyter notebook (2011)
  10. AlexNet - Deep Learning (2012)

21 of 50

Data8:

Computational and Inferential Thinking

The focus of the course is on reasoning, visualization, and interpretation, rather than calculations or the use of software packages. This approach is inspired by the boldly innovative Statistics by Freedman et al. (1978), a textbook that transformed the way the field of statistics was introduced to undergraduates at Berkeley and around the world.

[...]

However, the presentation of these classical lessons changed sharply when we adopted computation as a central tool

[...]

When students open up a browser-based Jupyter Notebook in a data science class, that action is as natural to them as breathing, regardless of their background or academic interests.

22 of 50

Berkeley’s Data Science Courses

23 of 50

Data Science Education at Scale

First class was yesterday!

Data 100:

~1,200

Data 8:

~2,000 students

24 of 50

Data 8 in Fall 2018

  • ~ 1,300 enrolled students
  • ~ 200 waitlisted

Annual combined numbers

  • Data 8: ~ 3,000 students
  • UC Berkeley: ~ 7,500

At steady state, will easily reach ~50% of campus!

Fastest growing courses in Berkeley history

25 of 50

Timeline

2018

Added R Studio to support courses using R

R Studio

2020

Org dedicated to make open tools for interactive computing

2I2C

2017

Datahub piloted for the first time at Berkeley as part of Data 8 coursework in Spring

Data 8

2019

Introduced Data 8x Hub to support Berkeley’s online course teaching foundations of Data Science

Data 8x

2021

Service grew to almost 10k+ users

10K users

2015 - 2016

Initial release of Jupyterhub

Jupyterhub

26 of 50

UC Berkeley Datahub - users in a semester

26

Engagement Metrics:

  • 4000+ Daily Active Users (DAU) during peak days
  • 8400+ Monthly Active Users (MAU)
  • 60+ courses across the campus using Datahub during Fall 2021

Summer

Fall 21 Semester

5000 daily active users

27 of 50

28 of 50

: key technology

  • Open, modular, flexible: empower users
  • Hosting for equity: all students have the same tools
  • Replicates local experience
    • But without the tech support nightmare
  • Real world tools, not "teaching toys"

29 of 50

Wide industrial adoption

30 of 50

This would never have happened under "traditional academic models"

Tools built by scientists, for their science!

31 of 50

A fluid dynamicist in India, an EE entrepreneur in Texas…

Physics PhD: Lattice QCD

Simulations

32 of 50

Matplotlib: ~2002-2003

John Hunter, Department of Pediatric Neurology, University of Chicago.

33 of 50

Career paths?

34 of 50

35 of 50

Traditional software infrastructure funding

Yes, it’s true, the budget is gone again… But you can’t deny that now, we get here in an instant!

Quino (Argentinian cartoonist)

36 of 50

“The Stack”: a complete ecosystem

Domain-agnostic backbone/trunk

  • Not “real CS”
  • Not “real research”
  • Nobody’s problem
  • Critical for everything else

37 of 50

Contrasts in culture and incentives

Open Source

Academia

Credit

Distributed

PI & hierarchy

Output/artifacts

Continuous & Project-specific

Discrete papers

Collaborators

Fluid: professionals, volunteers, …

Structured, funding-dependent

Governance,

decision making

Open, community based

Top-down, PI

Authorship

Fluid, roles can evolve, no clear “first/senior” author

Need to say more?

Peer review

Continuous, open, pervasive, friendly

The opposite

Value metric

Utility, need, impact

“Novel and transformative”

38 of 50

To build this, we need to recognize that there's a lot more than code…

39 of 50

Scientific Software: more than code

Content and Services (Notebooks)

Software

Standards and Protocols

Community

40 of 50

Content & Services

41 of 50

A language agnostic protocol

42 of 50

Community: formalized governance

Formal fiscal sponsorship

Large, inclusive, multi-stakeholder community. jupyter.org/governance

Governance structure

  • Elected Executive Council (6 people)
  • Software Steering Council (large)
  • Community Advisory Panel (external input)

Community recognition

  • Distinguished Contributors

2023 Executive Council Members

43 of 50

Community: where all the credit goes

And thousands more open source contributors!

44 of 50

More than code, woven into science

Services and content: impact

Software

Standards and Protocols: ecosystem

Community: innovation & resiliency

People

Ideas

Tools

Stories

45 of 50

So, what next?

46 of 50

Computers are scientific instruments -

teach scientists how to use them well!

47 of 50

Careers? Papers vs Software:

"Economic model" differences

  • Publications: an asset that pays dividends
    • Benefits to authors increase monotonically.
    • (Mostly) "fire and forget" - little need to maintain old papers.
    • In some countries (🇨🇴), economic benefit is direct, cumulative and perpetual.
      • Incentivizes all kinds of toxic behavior and creates lasting institutional burdens.
  • Scientific Software projects: a liability with recurring payments
    • Open source 'is free like a puppy is free'
    • May have very long life cycles (e.g. for me - IPython: 2001-today)
    • Many/most project activities don't map cleanly to traditional academic metrics

48 of 50

Some bright funding/policy lights

49 of 50

Incentives & policies support culture

  • Papers ≠ measure of intellectual contributions.
  • Run away from narrow, one-dimensional metrics
    • h-index, impact factor and the like are toxic and detrimental to good science.
  • Beware of "novel and transformative" claims in proposals.
    • Blindly chasing novelty often leads to little lasting value.
  • Support the entire spectrum of activities in OSS projects
    • See e.g. "All Contributors" framework.

It's about building healthy & productive communities.

50 of 50

THANK YOU

@fperez

fernando.perez@berkeley.edu

@fperez_org