Understanding notebooks using code-based visualization techniques
Colin Brown
Computational Notebooks
Emacs IPython Notebook
Jupyter Notebook
Adapts computational notebooks into cell based model.
Set of inputs mapped to a set of outputs.
Background
Worked on “active version” of creating cell based visualizations in the Dataflow Notebook project
Interested in other ways of increasing understanding in the notebook area
The Problem
Jupyter Notebook number of issues with Reproducibility:
Focus on knowledge gap
[Kang 2021]
[Head 2019]
Basic numpy tutorial
Versioning
Not much additional complexity
Only really added some extra prints
More calls to arr
Nbdime
There are some great tools out there for comparing notebooks
Notebooks are not like source code
Set of pieces, loss of semantics
Does it help?
Nbdime can be useful...
Alternate Approaches
Decompose code into abstract foundational elements
Variables “travel from cell to cell”
Cell to Cell relationships compose the structure of the notebook.
Combine two notebooks together
Adjacency Matrices?
Create a unified diff
Calculate Similarity based on how many lines are the same, very simple metric
Something more practical
Part of Dataflow Notebook project had to create some examples
Grabbed a tutorial notebook from scikit-learn
Even tutorials can change significantly.
Everything is different!
How crazy could this be?
Some similarity here but many gaps
Look at first cell that has a high value
It looks pretty similar
Tokenizing
More nuance
Better picture of what’s going on
Most connected part is mostly the same
Evolving from Diffs
Diffs don’t really feel right in this space.
Colors representing different things make sense inside diff
Why not do better?
Capture shape of each notebook in overlay
What does this get us?
We can see more subtlety about the structure
Important part: We build these graphs based on relationships between cells
New notebook capture structure similarity
Still nuance to the structure:
Left: The original notebook Right: The updated version
Related Work and Thoughts
Versioning your own work is a problem but, snippets are a large problem in Jupyter Notebook.
Sigvardsson put the number at 53.9% of all Jupyter notebook files are comprised of snippets that are found in other notebooks. [Sigvardsson 2019]
Little done in static analysis
on notebooks compared
to dynamic concepts.
Work done by John Wenskovitch
similar, but that project went
in a different direction.
[Wenskovitch 2019]
Future Aims
Fully integrated to connect with Jupyter Notebook/Lab
Right now everything is passed over AJAX from Python
Concerns about Scaling
Closing Thoughts
This is a still evolving project, but the end goal is more of a unified way to explore notebook files and be able to get a better sense of what’s going on.
It seems as though improving understanding becomes more of an afterthought but it seems like it defeats the purpose of computational notebooks as they were always intended to get a better glimpse of what was going on in the author’s head and I think this is a way to move forward towards better understanding.