1 of 20

Understanding notebooks using code-based visualization techniques

Colin Brown

2 of 20

Computational Notebooks

Virtual environment for literate programming
Provides insight into the author’s process and creation of code
Functionality of word processing style software with the functionality of a shell and kernel

Emacs IPython Notebook

3 of 20

Jupyter Notebook

Adapts computational notebooks into cell based model.

Set of inputs mapped to a set of outputs.

4 of 20

Background

Worked on “active version” of creating cell based visualizations in the Dataflow Notebook project

Interested in other ways of increasing understanding in the notebook area

5 of 20

The Problem

Jupyter Notebook number of issues with Reproducibility:

Cell Ordering
Dependencies
Understanding

Focus on knowledge gap

People have problems understanding notebooks
Solutions exist to try to solve it

Microsoft Gather
ToonNote
Others like nbsafety

To better understand we visualize

[Kang 2021]

[Head 2019]

6 of 20

Basic numpy tutorial

7 of 20

Versioning

Not much additional complexity

Only really added some extra prints

More calls to arr

8 of 20

Nbdime

There are some great tools out there for comparing notebooks

Notebooks are not like source code

Set of pieces, loss of semantics

9 of 20

Does it help?

Nbdime can be useful...

10 of 20

Alternate Approaches

Decompose code into abstract foundational elements

Variables “travel from cell to cell”

Cell to Cell relationships compose the structure of the notebook.

Combine two notebooks together

Lay common variables on top of each other
Apply some diffing colors

11 of 20

Adjacency Matrices?

Create a unified diff

Calculate Similarity based on how many lines are the same, very simple metric

12 of 20

Something more practical

Part of Dataflow Notebook project had to create some examples

Grabbed a tutorial notebook from scikit-learn

Even tutorials can change significantly.

13 of 20

Everything is different!

14 of 20

How crazy could this be?

Some similarity here but many gaps

Look at first cell that has a high value

It looks pretty similar

15 of 20

Tokenizing

More nuance

Better picture of what’s going on

Most connected part is mostly the same

16 of 20

Evolving from Diffs

Diffs don’t really feel right in this space.

Colors representing different things make sense inside diff

Why not do better?

Capture shape of each notebook in overlay

17 of 20

What does this get us?

We can see more subtlety about the structure

Important part: We build these graphs based on relationships between cells

New notebook capture structure similarity

Still nuance to the structure:

y_train and X_train never leave the cell
ax only in new notebook

Left: The original notebook Right: The updated version

18 of 20

Related Work and Thoughts

Versioning your own work is a problem but, snippets are a large problem in Jupyter Notebook.

Sigvardsson put the number at 53.9% of all Jupyter notebook files are comprised of snippets that are found in other notebooks. [Sigvardsson 2019]

Little done in static analysis

on notebooks compared

to dynamic concepts.

Work done by John Wenskovitch

similar, but that project went

in a different direction.

[Wenskovitch 2019]

19 of 20

Future Aims

Fully integrated to connect with Jupyter Notebook/Lab

Right now everything is passed over AJAX from Python

Concerns about Scaling

20 of 20

Closing Thoughts

This is a still evolving project, but the end goal is more of a unified way to explore notebook files and be able to get a better sense of what’s going on.

It seems as though improving understanding becomes more of an afterthought but it seems like it defeats the purpose of computational notebooks as they were always intended to get a better glimpse of what was going on in the author’s head and I think this is a way to move forward towards better understanding.