1 of 14

TRIC: Traceability and Reproducibility through Individual Containerization

Dominic Kennedy∗, Paula Olaya∗, Jay Lofstead†, Rodrigo Vargas‡, Michela Taufer∗

∗ University of Tennessee, Knoxville, TN, USA

† Sandia National Laboratories, Albuquerque, NM, USA

‡ University of Delaware, Newark, DE, USA

1

NSDF All-Hands meeting October 2022

2 of 14

Trustworthiness in Scientific Workflows

Computational workflows play a key role in scientific discovery. These workflows are growing more complex

For scientists using these workflows to aid in research, trusting data, methods, software, and hardware becomes more necessary than ever

2

Scientific Workflow

Heterogeneous infrastructure

Scientist(s)

Data preprocessing

Data collection

Machine learning model

Data visualization suite

3 of 14

Traceability and Explainability

Scientists achieve trust when they can

  • Trace back through an in-depth data lineage
  • Explain computational methods and their output through an execution record trail

3

03

01

02

04

05

06

07

08

09

11

Data

App

10

12

4 of 14

Containerized Scientific Workflows

  • Traditionally, when containers are used in scientific workflows, the entire workflow is deployed into one monolithic container

4

5 of 14

Containerized Scientific Workflows

  • Traditionally, when containers are used in scientific workflows, the entire workflow is deployed into one monolithic container
  • This coarse-grained approach makes it difficult to precisely track the thread of execution

5

  • No identification and preservation of intermediate data
  • No records of workflow component interactions
  • No reusability; no composability

6 of 14

Fine Grained Containerization for Workflow Trust

6

  • A fine-grained approach to decouples and encapsulates each workflow component into its own independent container
  • Automatic annotation of the workflow with data provenance and execution record trails

7 of 14

Plugin for Fine Grained �Containerized Workflows

We develop a Singularity/Apptainer plugin that extends the Singularity/Apptainer runtime to support fine-grained containerized workflows

Our plugin transforms a monolithic workflow into a collection of fine-grained containers hosting applications and data separately which enables

  • Identification of all components and their interactions
  • Automatic annotation of the workflow with data provenance and execution record trails

7

8 of 14

Augmented Functionalities

workflow --create

  • Allows users the ability to adapt their workflows to fine-grained containerized environment
    • Via a web interface
    • Or via a predefined workflow definition
  • Initializes containers with metadata partitions

8

9 of 14

Fine Grained Workflow Creation

9

Our fine-grained containerized environment provides traceability and explainibility

  • We decouple data and applications of traditionally monolithic workflows
  • We encapsulate each component into individual containers

A data container follows a file-system-in-a-file model and includes an individual dataset (i.e., input, intermediary, or output data)

The application container includes the executable or script with the respective software stack (i.e., OS, libraries, and packages)

10 of 14

Fine Grained Workflow Creation

10

Our fine-grained containerized environment provides traceability and explainibility

  • We decouple data and applications of traditionally monolithic workflows
  • We encapsulate each component into individual containers
  • We annotate containers with execution metadata

A data container follows a file-system-in-a-file model and includes an individual dataset (i.e., input, intermediary, or output data)

The application container includes the executable or script with the respective software stack (i.e., OS, libraries, and packages)

The execution metadata exposes: unique hash code (UUID), container name, creation time, command line and record trail

11 of 14

Augmented Functionalities

workflow -- exec

  • Executes fine-grained containerized workflows by connecting data and application containers
  • Collects execution metadata that builds:
    • Data lineage
    • Execution record trail
  • Annotates the output container with this metadata

11

12 of 14

Fine Grained Workflow Execution

  • Locates all the workflow containers
  • Connects and executes the containers, with zero copy data transfer
  • Annotates containers with dynamically generated metadata

12

13 of 14

Running an Earth Science Workflow

  • We demonstrate the applicability of our environment for SOMOSPIE [1] (an earth science workflow)

13

[1] Paula Olaya, Dominic Kennedy, Ricardo Llamas, Leobardo Valera, Rodrigo Vargas, Jay Lofstead, and Michela Taufer, “Building Trust in Earth Science Findings through Data Traceability and Results Explainability”. IEEE Transactions on Parallel Distributed Systems (TPDS).

Example metadata from real workflow

14 of 14

Check Out Our Repository!

14

github.com/TauferLab/ContainerizedEnv