1 of 49

There and back again: a short introduction to virtualization technologies

Python for Psychologists - Winter term 2022

Peer Herholz (he/him)

Research affiliate - NeuroDataScience-ORIGAMI lab at MNI, MIT, McGill & BRAMS

Member - BIDS, ReproNim, Brainhack, UNIQUE, CNeuroMod

@peerherholz

28/10/2022

Michael Ernst

Phd student - Neurocognitive Psychology at Goethe University Frankfurt

M-earnest

2 of 49

Recap of the last session

  • interacting with your computer
    • GUIs vs. CLIs
    • shell/terminal, jupyter notebooks, IDEs

3 of 49

Recap of the last session

4 of 49

GUI vs. CLI - an example

https://giphy.com/gifs/colbertlateshow-stephen-colbert-surprise-late-show-l0HlO3BJ8LALPW4sE

We will dive right in!

Everyone gets 5 min to do the following:

  • create a directory structure (folders/subfolders) on your desktop

“addams_family/pugsley”

“addams_family/gomez”

“addams_family/fester”

  • within each subfolder create 10 text files called “scooby_doo_[1-10].txt”
  • move text files with even numbers to a new folder called “the_new_yorker” and add their origin folder to their name (e.g. scooby_doo_2_pugsley.txt)

5 of 49

New session - old and new “friends”

Open a shell/terminal.

How do you check where you currently are wrt paths on your machine?

How can one check the contents of a given directory?

How can you navigate/move to the Desktop?

How do you create a directory and a file?

pwd

ls

cd /path/to/your/desktop

Further questions:

Are paths identical across different machines & OS?

Are BASH commands identical across machines & OS?

Why should you avoid spaces in directory & file names?

mkdir my_cool_directory

Path for windows users:

How do you create a directory and a file?

touch my_cool_file.txt/.py

/Users/user_name -> /mnt/c/Users/user_name

6 of 49

Recap of the last session

  • interacting with your computer
    • GUIs vs. CLIs
    • shell/terminal, jupyter notebooks, IDEs
  • shell
    • CLI, direct communication with computer
    • many different shells, we’ll focus on BASH
    • vast capabilities regarding lots of processes
    • important for course: navigating directories,

creating/moving/handling of files & directories

      • pwd, cd, ls, mkdir, touch, mv

7 of 49

Your expectations for this session

What are your expectations for this session?

https://i.imgflip.com/2w2rmv.jpg

8 of 49

Objectives for this session

https://media.makeameme.org/created/look-at-the-25f4b725ed.jpg

  • get to know problems wrt computational analysis & reproducibility
  • learn about virtualization and its different options
  • experiment with python virtualization options
  • Ask and answer questions
  • Have a great time

9 of 49

Outline for this session

  • The problem statement
  • introduction to virtualization
  • virtualization using python
    • venv
    • conda
  • Outro/Q&A

https://twitter.com/OfficeMemes_/status/1298982572848869380/photo/1

10 of 49

Standardization/Virtualization

The course I - Saving science?

https://giphy.com/gifs/twitter-3bznFj6OB5381BEjDu

https://reproducibilitea.org/

11 of 49

scale

researcher

project

single researcher/

lab

consortia

student

PI

collaboration

workflow transfer

dataset transfer

….

reproducibility

reproducibility

….

dataset transfer

workflow transfer

continuation

dimensions of (academic) research

Introduction to virtualization

12 of 49

https://giphy.com/gifs/colbertlateshow-stephen-colbert-surprise-late-show-l0HlO3BJ8LALPW4sE

We will dive right in!

The problem statement

13 of 49

Once

VIRTUALIZATION WARS

The problem statement

Imagine you want to conduct an analysis of some demographic data, including obtaining & reading data, filtering & descriptive analyses of data, inferential statistics and visualization.

A colleague has a python script that does all of these things ready to go and shares it with you.

Everything is ok….

14 of 49

  • within the introduction directory of the course repository you downloaded, you’ll find a script called

fancy_analyzes.py

  • using the shell, navigate to the respective folder and run the script via:

“python fancy_analyzes.py”

  • within the shell & the directory you should get several outputs (including .png)

The problem statement

15 of 49

Waaaaaiiiit a hot Montreal minute!

The script doesn’t run? The script leads to different results? What went wrong?

Let’s gather some errors here ...

The problem statement

16 of 49

The problem statement

17 of 49

What is happening?

  • (un)intentional mistakes
  • garden of forking paths
  • questionable research practices/p-hacking
  • fraud
  • publication bias

*adapted from Felix Schönbrodt

The problem statement

18 of 49

The problem statement

Freesurfer: Inter-Build Differences

Freesurfer: Inter-OS Differences

Surface maps of mean absolute difference, standard-deviation of absolute difference, t-statistics and RFT significance values showing regions where the cortical thickness extracted with Freesurfer differs for cluster A and cluster B

Surface maps of mean absolute difference, standard-deviation of absolute difference, t-statistics and RFT significance values showing regions where the cortical thickness extracted with Freesurfer differs for build 1 and build 2

19 of 49

Science reproducibility

  • each and every single project in a lab depends on complex

software environments:

    • operating system
    • Drivers
    • Software dependencies: Python, R, MATLAB + libraries
  • we try to avoid:

    • the computer I used was shut down a year ago, I can’t

rerun the analyzes from my publication… (looking at everyone)

    • the analyzes were run by my student, I have no idea where

and how...(looking at the PIs)

Operating system (OS)

Libraries/Binaries

Applications

Operating system (OS)

Applications

Machine 1

Machine 2

Libraries/Binaries

The problem statement

20 of 49

Collaboration with your colleagues and everyone else

  • sharing your code or using a repository might not be enough (spoiler: it won’t be enough) due to the aforementioned reasons
  • we try to avoid:

    • Well, I forgot to mention that you have to use Clang and gcc never worked for me ...
    • I don’t see any reason why it shouldn’t work on Windows … (I actually have no idea about Windows, but won’t say it … (I’m honest here: I have no idea about Windows))
    • etc.

Operating system (OS)

Libraries/Binaries

Applications

Machine 1

Operating system (OS)

Applications

Machine 2

Libraries/Binaries

X

The problem statement

21 of 49

Freedom to experiment

  • universal install script from xkcd: The failures usually don’t hurt anything … And usually all your old programs work ...
  • we try to avoid:

    • I just want to undo the last five hours of my (lab/work) life (virtualization won’t solve comparable problems in other life situations)

The problem statement

22 of 49

Are we all doomed to live in an unreproducible world, forced to painfully adapt and check every script we find?

Well… maybe, but you could also learn to utilize virtualization techniques...

The problem statement

23 of 49

Outline for this session

  • The problem statement
  • introduction to virtualization
  • virtualization using python
    • venv
    • conda
  • Outro/Q&A

https://twitter.com/OfficeMemes_/status/1298982572848869380/photo/1

24 of 49

Virtualization technologies aim to

  • isolate the computing environment

  • provide a mechanism to encapsulate environments in a self-contained unit that can run anywhere
    • reconstructing computing environments
    • sharing computing environments

Operating system (OS)

Libraries/Binaries

Applications

computing env 1

Operating system (OS)

Applications

computing env 2

Libraries/Binaries

Operating system (OS)

Applications

Machine 3

Libraries/Binaries

Introduction to virtualization

25 of 49

Virtualization technologies have 3 main types:

  • python virtualization

    • venv
    • conda

  • containers

    • Docker
    • Singularity

  • virtual machines

    • Virtualbox
    • VMware

this session

Introduction to virtualization

26 of 49

Outline for this session

  • The problem statement
  • Introduction to virtualization
  • Virtualization using python
    • venv
    • conda
  • Outro/Q&A

https://twitter.com/OfficeMemes_/status/1298982572848869380/photo/1

27 of 49

Once

VIRTUALIZATION WARS

Virtualization using venv & conda

The research galaxy went on a dark path of non-existent-reproducibility. A small alliance of brave python based resources aim to bring back the balance and ask you to join their movement ….

28 of 49

Virtual environments in python

  • keep the dependencies required by different projects in separate places

  • allows you to work with specific version of libraries or Python itself without affecting other Python projects

  • an environment manager and package manager (for python and beyond)

  • within a terminal type “which conda” and if it’s not installed do so now

Virtualization using python

  • an environment manager for Python 3.4 and up, usually preinstalled

  • within a terminal type “python -m venv

Operating system (OS)

Libraries/Binaries

Applications

computing env

29 of 49

  • Next move our fancy_anylses.py into this folder

  • We’ll be creating our virtual environment, which should be easy and straightforward through conda, with the general syntax being:

conda create -n *name* *python_version* *libraries*

, where *name* is the name of your virtual environment, *python_version* the python version you want to use and

*libraries* the libraries you want to install

  • let’s first create a directory for our journey, this time it’s called “ mos_eisley ”:

mkdir /Users/path/Desktop/mos_eisley

cd /Users/path/Desktop/mos_eisley

Virtualization using python - conda

30 of 49

  • conda, by default, already installs a fair amount of libraries as compared to venv

  • let's activate our newly created conda environment,the steps and syntax are very similar to what we've done before, yet slightly different due to conda:

conda activate r2d2

  • adapted to our mission, this looks as follows (naming our environment “r2d2”, installing python 3.7 and the pandas package:

conda create -y -n r2d2 python=3.7 pandas

Virtualization using python - conda

31 of 49

  • we can check this via “which python
  • or even better using a conda command that additionally lists all available conda environments:

conda info --envs

conda environments are created within the conda installation path.

Virtualization using python - conda

32 of 49

  • while this is amazing and brings us further towards our goal with lightspeed, we actually didn’t test if our fancy analyses works:

python fancy_analyzes.py

conda is powerful but still requires caution.

conda activate

conda deactivate

Virtualization using venv & conda

I find your lack of controlling and evaluating installation processes disturbing.

We installed pandas and not the missing requests library. You have to evaluate the libraries you need!

33 of 49

  • in order to check what python libraries we need, we can simply open our fancy_analyzes.py script in VScode

  • check all lines with an “import” statement

and gather the respective list of libraries

that are imported and thus needed to run

the script/pipeline

Virtualization using venv & conda

Operating system (OS)

Libraries/Binaries

Applications

Computing env

34 of 49

  • after reaching the end of our script/pipeline we gathered the following list of libraries: requests, pandas, matplotlib, plotly, ptitprince, seaborn, pingouin, statsmodels

  • installing and adding them our environments is way easier than the Kessel Run:

conda install requests, pandas, matplotlib, plotly, ptitprince, seaborn=0.11.0, pingouin, statsmodels

  • this might take a while
    • the reason for that is that conda not only gathers all the requested libraries but also their dependencies (thus addressing dependency issues)
    • it additionally gathers those versions of all libraries that can work together in forming the computing environment (thus addressing version issues)

Virtualization using venv & conda

35 of 49

  • the force of conda appears to be mighty but we still have to test it

python fancy_analyzes.py

conda is powerful but still requires caution.

conda activate

conda deactivate

Virtualization using venv & conda

https://giphy.com/gifs/disneyplus-the-mandalorian-mando-themandalorian-AcfTF7tyikWyroP0x73

36 of 49

  • we actually did, we accomplished our mission to make the script/analyzes work on (most of) your machines and earned a victory against the non-reproducibility empire with the help of the virtualization master conda

  • having obtained the knowledge necessary to fight back, we want to share our plans with other members of our rebel alliance, i.e. sharing our computing environment

conda env export > environment.yml

  • the result will be a .yml (a form of text) file that contains not only the python version we used but all libraries and their respective versions as well as the installation channels through which we downloaded them

Virtualization using venv & conda

37 of 49

Virtualization using python - venv & conda

conda very powerful is as environment (comparable to venv) and package (comparable to pip) manager it combines.

Sharing specific the python version, builds and channels it does.

Be aware of differences between conda & pip and other non-python dependencies you must be.

38 of 49

Outline for this session

  • The problem statement
  • Introduction to virtualization
  • Virtualization using python
    • venv
    • conda
  • Outro/Q&A

https://twitter.com/OfficeMemes_/status/1298982572848869380/photo/1

39 of 49

Outro/Q&A - The return of reproducibility?

  • no matter if you went on a selfish dark path of basically non-existent reproducibility and want to join the light side

40 of 49

Outro/Q&A - The return of reproducibility?

  • or if you were always on the light side of the force using it for the greater good

41 of 49

Outro/Q&A - The return of reproducibility?

  • we hope that this lecture/session provided you with some understanding on why and how to utilize virtualization within your research workflow
  • Remember, virtualization is great for:

  • sharing code/scripts/functions/pipelines with colleagues and everyone else without dependency issues (except virtualization software itself),
  • automize large parts of processing
  • rerunning analyses with identical or changed parameters
  • virtualization is important for:

  • reproducibility of results
  • evaluation of soft-/hardware parameters

42 of 49

Outro/Q&A - Recap for this session

  • computing environments, reproducibility, transfer & sharing
    • every analyzes depends on super complex and multi-layered computing environments
    • same script won’t run or produce different results when transferred between machines
    • tremendous variances across OS

  • virtualization technologies as a solution
    • recreate, isolate and share computing environments
    • different levels: python, containers, virtual machines
    • each advantages/disadvantages with choice of virtualization technique also depending on project & analyzes

43 of 49

Outro/Q&A

reproducible/scalable/

efficient research

44 of 49

Outro/Q&A - Questions you could/should ask based on this session

Is virtualization required for each project no matter the scale?

When should virtualization be integrated into the workflow?

What are limitations & disadvantages of virtualization?

How should virtualized computing environments be provided?

Do I really need to use virtualization, even if I don’t share scripts?

What other factors contribute to the mentioned problems and can they also be addressed via virtualization?

45 of 49

https://giphy.com/gifs/season-17-the-simpsons-17x6-xT5LMB2WiOdjpB7K4o

46 of 49

  • the non-reproducibility empire strikes back and demands the first homework assignment: create a new conda environment called “bb8” with python 3.9 and pandas, nilearn, jupyter
  • export it to a .yml file & send it via e-mail
  • Bonus Points: Go through the "presentation adventure" and show me the results of the "fancy analysis" script. Simply taking screenshots is fine and append them to the e-mail!

  • deadline: 14/11/2022, 11:59 PM EST

Remember your training:

  1. conda create -n *name* *python=version* �*libraries*
  2. conda activate *name*
  3. conda env export >

environment.yml

Outro/Q&A - homework assignment

47 of 49

Outro/Q&A - The return of reproducibility?

I’M VIRTUALIZATION

NOOOOOOOOOOOOO

48 of 49

Outro/Q&A - The return of reproducibility?

Interaction style

Shared

GUI

CLI

SW

OS

Binder

conda

container

Binder/VMs

49 of 49

Outro/Q&A - Readings/add-on material for this session

  • further reading:

Project T(eaching) I(ntegrity in) E(mpircal) R(eseach)

The Turing Way project illustration by Scriberia. Original version on Zenodo. http://doi.org/10.5281/zenodo.3695300.