1 of 15

Tutorial compute platform

Jim Pivarski

2 of 15

What and why?

  • Our tutorials teach students how to use software, and the most powerful learning is learning-by-doing.
    • We want students to be able to follow a demonstration on their own (“shift-click in Jupyter”), so that they can try deviating from the script to enlarge their understanding (rare).
    • We want to give students mini-exercises in the midst of a lecture, to break up the lecture (wake up the students) and verify that they really understand it, rather than just thinking they understand it. For this, they need a tightly controlled working copy of the environment.
    • We want students to work on major projects, for which they only need all of the packages to be installed (unless it’s an exercise in installing packages!).

3 of 15

Method

Failure modes

P(works for everyone)

Reusable afterward

Have students install everything on their own laptops: venv, conda-forge, Docker

1. Windows. 2. Not having the software to install the software.�3. ¯\_(ツ)_/¯

1 ‒ 0.9N

yes-ish

Public cloud-based Binder (mybinder.org)

1. Stuck loading image. 2. Crashes without persistence.

0.8

yes

GitHub Codespaces

1. Images too large. 2. Boots in VSCode, not Jupyter.

0.95

yes

Google Colab (with GPUs!)

1. Persistence. 2. Fake Jupyter.

0.95

yes

CERN Swan

1. CERN accounts.

0.8

yes

Paid cloud solution: AWS, SaturnCloud

1. Authentication (slips of paper!).

0.95

no

In-browser JupyterLite

1. Not all packages can use it.

1

yes

Self-hosted JupyterHub/BinderHub

1. Authentication. 2. GPUs.

0.9

depends

4 of 15

Authentication (ahead of time or day of the event)

  • Login pathway is confusing (“Don’t click on ATLAS IAM!”).
  • Inclusiveness of CILogin’s authentication methods is good and bad:�
  • Good: students can always find a way to connect, even if their university is not on the list, through Google, GitHub, ORCID, etc.
  • Bad: Since CILogin is so open, the event’s BinderHub has to be limited to a allowlist of allowed email addresses.
    • Students might connect to CILogin with a different email address than the allowed one…
      • because they weren’t told which one to use
      • because they didn’t pay attention when told which one to use
      • because their institution uses a different address than expected: ID@sas.upenn.edu
      • because an authentication method is associated with a nonsense email address: GITHUB-ID@github.com.
      • Because CILogin isn’t working for their institution (Eg, U of Delhi’s was down in May)

5 of 15

Authentication (months after the event)

  • A small fraction of students want to follow-up and repeat exercises, long after the event is over.

We want to encourage this!

  • But it’s also a way for cryptocurrency miners to get free CPUs.

We want to discourage this!

  • Since this is rare, resources are protected by an allowlist, and we have everyone’s full identity through their institution (exception: Google, etc.),�can we leave them open indefinitely?

6 of 15

Ways to seed the allowlist

  • Manually, as we’re doing now.
    • Error-prone and time-consuming when we’re trying to get everyone up and running.
  • Password shared by word-of-mouth in the event, like a wifi password?
    • Downside: what’s to keep the students from sharing the password with a friend who really, really needs to mine some BitCoin?
  • An open-enrollment period that facilitators can turn on and off? That is, everyone who attempts to connect in the first 10 minutes of the event has perpetual access (time is the password!).
    • Access can’t propagate through a friend network.
    • Can’t be exploited by an interloper, months after the event. They have to be aware of the schedule and attack accordingly.
    • Facilitators can see the allowlist, to see if there are any unexpected names/institutions (with special consideration for Google).
    • The open-enrollment period could auto-timeout.

7 of 15

Failure modes in the current authentication system

  • Confusing workflow:
    • Students (usually ATLAS) pressing the “ATLAS IAM” button, not CILogin.
    • Students getting to the “set up BinderHub” page and thinking they need to put a GitHub repo URL there, instead of clicking on an already set-up URL provided by teachers.
    • Students pressing the big “Start Server” button instead of BinderHub URL from teachers.
    • Students having 5 running servers without knowing it or how to stop them.
  • In July, the CILogin authentication method was sticky: once a student chooses a method, we need to put it on the allowlist, not the other way around.
    • Doing the first try in Incognito Mode is not a solution, since clicking on a link from email (not in Incognito Mode) is a step in the workflow.

8 of 15

Large images and GPUs

  • Issue:
    • O(50-100) students vs 8 GPUs at SSL (plus many available on NRP)
    • Images including CUDA are big. Including libraries like CUDF makes them big++
  • Bad experiences with starting up a session when using large images: resorted to selecting a few students to run with GPUs and telling them when to start their instances, one by one. �(limited success as typically students don’t want to wait their turn or want to share)
    • Means supporting more images during an event as GPU images require special care. Tension between wanting teachers to share as few kitchen-sink images as possible versus having many images, one per tutorial.
  • Wifi network instabilities mean that connections get lost. GPUs wind up “lost” until timeout in system gives them back (believe this issue has been improved since we ran a tutorial)
  • Generally (99%), we can install everything through conda-forge on a minimal Linux distribution, including compilers, git, commandline tools.
    • Recommendations?
    • Can we take advantage of layering?
    • (We don’t want “conda install everything” to be the first step for N students which presumably triggers the same overload issues as pulling images, or perhaps worse..)

9 of 15

Other topics

Ideas for how to save work in a binder session? (short of manual downloading)

10 of 15

Known future events

December 16-20 → HSF-India

January 13-18 → HSF-India

July → CoDaS-HEP

Institute interest in re-doing US-CMS/US-ATLAS events (that did not use BinderHub this time around)

11 of 15

Backup

12 of 15

Backup

13 of 15

Backup

14 of 15

Backup

15 of 15

Backup