1 of 179

2024 MOC Alliance Workshop

Lunch

2 of 179

The benefits of sharing and collaboration in the MOC

"If I have seen further it is by standing on the shoulders of Giants" Issac Newton

Larry Rudolph Distinguished Fellow, MOC and Two Sigma, LP

3 of 179

Disclaimer

This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity.

The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa

Important Legal Information

3

4 of 179

Data (DB)

Compute

In the beginning there was

Is Compute the King or

Is Data the King (both)

Marriage ⇒ data pipelines

Data Pipeline

5 of 179

Imagine a world where

Before you can use data

first download it or make a copy in the cloud

did you get enough storage (can you afford it?)

did you get the latest version

clean and normalize the data yourself

think about checkpoints, saved intermediate results

make sure there is enough storage or available servers / GPUs

re-execute everything for each modification of parameter choice

Repeat until you are exhausted …

For many of us, the world is even messier than this

6 of 179

Imagine a world in which

immediately leverage the related work

every time you run a program,

it supports the easy sharing of data and code

it is reproducible

it automatically recovers from failures

it runs more efficiently

the cost of producing may be offset by those consuming

the marriage is sustainable

Supports community culture

and natural pedagology

7 of 179

Consider developing AI models

  1. Download or copy raw data
  2. Perform preprocessing steps:
    1. tokenization
    2. cleaning
    3. normalization
  3. Save the transformed data somewhere, call it DC1

Depict a series of gears and levers transforming raw data into neatly organized and standardized inputs. Describe each gear representing a different preprocessing step like tokenization, cleaning, and normalization, emphasizing the transformation process.

The is the Google Gemini Prompt (as with all other pics)

Where to download, how often, when to delete

Write it myself or find code made available by a kind soul or “buy” it

Each time new error found,

start from the beginning

Maybe I want to name each version, depending on versions of above steps

Step 1: Data Collection

8 of 179

Consider developing AI models

For each source of data

Perform all the steps in data collection

Giving each one a different output name

Check for new versions

Make a picture of a scene where a conveyor belt is moving through a diverse landscape, collecting various sources of data such as books, articles, websites, and social media posts. Mention the variety and richness of the data being collected.

Each source may need account/password/api-key

Data collection steps may fail. Decide how much to re-run

Do I run this on many of the versions in the data collection? How to remember what input was used for each output

Step 2: Preprocesing

9 of 179

Consider developing AI models

Build a model in some code

Rent GPUs

Take all the preprocessed data

Break into batches

Feed them into the GPUs to train the model

Save the model in some other file

Draw a caricature of a commons field with data and computers are grazing in the style of Boston Commons

Lots of decisions

Crashes or bugs, means either paying for GPU;s while debugging and doing all the earlier steps, or release and acquire GPUs

Learn by trial and error

Cloud costs quckly add up

Step 3: Training

10 of 179

You get the idea

A lot of steps in this meta pipeline (recursive pipelines)

  1. Data Collection
  2. Preprocessing
  3. Training
  4. Fine Tuning
  5. Deployment
  6. Inference
  7. Evaluation
  8. Monitoring and Maintenance

Data pipelines usually DAGs

with nodes and edges crossing domain boundaries

Not only humans in the loop, unknown humans

Upstream decisions affect unknown downstream users

11 of 179

No Application is an Island

12 of 179

No Data is an Island

13 of 179

No Server is an island

14 of 179

Data, Pipeline, HW at Scale

Data represents “some capture of the real world” → complex system

Number and size of data sets @ scale → complex system

Servers, Network, Disks, Power, Heat @ scale → complex system

“feature of complex systems” → unanticipated emergent properties

Solution:

Share

Collaborate

15 of 179

Tenant’s view:

Pipeline (DAG part) is the application

16 of 179

Provider’s view:

The data center is the computer

17 of 179

Goal: Optimize like we do for apps running on a computer

18 of 179

Note

I am mixing two different, but similar ideas

  • Data + Compute
  • Tenant + Provider

Both are about more sharing and more collaboration

(marriage between couples?)

19 of 179

Imagine a world where there is

a data center with

plenty of storage, networking, and compute

ability for any server can access any data

probes measuring network congestion, server load, etc.

multi-level caching throughout

dynamic security partitions, usage accounting

anyone can try out new h/w

no single controlling commercial entity

profit is not the motive (helping the tenant is the goal)

Hmm, this already exists ==> a great platform for sharing

draw a caricature of a data center with storage compute, and networking with probes to measure congestions

20 of 179

Sharing Data

(Dataverse, Norwegian Research Commons)

FAIR (Findable, Accessible, Interoperable, Re-useable)

i.e. data without compute is like a day without sunshine

No need for everyone to have their own copy of the data

data immutability w/ security

no need for everyone to pay for its storage copy (while the data center does de-duplication)

No need for everyone to redo data collection, pre-processing, …

saving time, energy

Access to lots of tools that run near the data

Draw a caricature of a commons field with data and computers are grazing in style of Boston Commons

Functional Semantic Types (advertisement)

21 of 179

Sharing DAG-Metadata ⇔ H/W State

Collect metadata provenance of data pipelines / workflows

Learn attributes to do better allocation, placement, execution

Learn which data or code is reused for attribution

Use metadata and runtime info for better object caching

Draw a caricature of the provenance of a data pipeline

Shared Caching

(community caching)

draw several caches with data and code

(Bringing Data Close to Compute, D4N)

22 of 179

Leveraging Shared Memory

Future? (research request)

A pool of disaggregated (DRAM) memory directly accessed by any server in the rack (e.g. Intel CXL v3)

Not great for current workloads (Google publication)

Might be great for shared data

Map dataframes, tables to active jobs running in same rack

draw leveraging shared memory

23 of 179

Reproducible

Memento (Two Sigma) & Semantic Cloud (Redhat)

During data pipeline execution, the results of data transforms maybe memoized

In future, when same transformation, on same data, with same parameters, and in same environment

no need to re-execute, just return memoized value

Sharing of transforms, intermediate values

draw a caricature of mementos

24 of 179

The vision is sharing

The place is the MOC-A

Production-level facility – appetite to share / save $$

Lots of data to share

Researchers want to share with

themselves while debugging

teammates while exploring

collaborators or future ones

all with right attribution, protection, cost sharing model

draw a caricature of a bunch of computers helping out another computer

Thank you!!

Now it’s time for the Data+Compute

engagement party

25 of 179

Thank you

Any Questions?

26 of 179

Emre KeskinUniversity Research �Data Officer

Wayne GilmoreExecutive Director�Research Computing Services

Scott YockelUniversity Research �Computing Officer

The Building Blocks of Cloud = Research Enablement

27 of 179

HPC TO CLOUD

Why the shift in tech?

BEYOND TECH

Why not let researchers build everything?

RESEARCH COMPUTING

What is this service?

28 of 179

Dawn of the Research Facilitator - 2014 ACI-REF

“Facilitator” - makes an action or process easier

Researchers are faced with:

  • Rapidly changing technology (HPC, specialized processors, data storage, distributed networks, …
  • Rapidly evolving analytical software
  • Spread of technology to new research disciplines
  • AI is taking over everything

Advanced Cyberinfrastructure - Research and Educational Facilitation: Campus-Based Computational Research Support - NSF Award # 1341935

29 of 179

Research Computing (RC) is at the intersection of providing leading technical solutions & supporting researchers in the scholarly process of discovery and innovation

Co-learn

RC professionals co-learn unique domain specific problems alongside of researchers

Co-create

RC professionals co-create solutions (technical, architectural, pipeline, software, …) alongside of researchers

30 of 179

What is HPC and why is it important?

High-performance computing (or Supercomputing from the 90s).

  • Fixed computing environment deployed (and controlled) by Systems Group
  • Large scale computing via batch processing (queuing) system
  • Basic command-line interface
  • Tailored to parallel processing
  • Tailored to large centralized shared scratch storage

Dating back to the Manhattan Project solving Physics, Chemistry & Engineering problems have predominantly created data on-the-fly / at runtime.

In 1966, during Robert Mulliken’s Nobel Prize acceptance speech: “I would like to emphasize strongly my belief that the era of computing chemists, when hundreds if not thousands of chemists will go to the computing machine instead of the laboratory for increasingly many facets of information is already at hand.”

31 of 179

Example: Quantum Chemistry Ĥ Ψ = E Ψ

1926 physicists Erwin Schrödinger gave us a partial differential equation that describes how the quantum state of some physical system changes with time.

electron-proton

proton-proton

electron-electron

32 of 179

Example: Quantum Chemistry Ĥ Ψ = E Ψ

How do electronic structure programs represent the wavefunction Ψ ?

33 of 179

Example: Quantum Chemistry Ĥ Ψ = E Ψ

How do electronic structure programs represent the wavefunction Ψ ?

1045 combinations !!!

34 of 179

2013 BU HPC (11M CPU Hours)

968 researchers / 34 departments

107 students / 4 courses / 0 trainings

2023 BU HPC (90M CPU Hours)

3,136 researchers / 95 departments

2,134 students / 53 courses / 60 trainings

35 of 179

The nature of research is changing��RC must adapt!

36 of 179

HPC TO CLOUD

Why the shift in tech?

BEYOND TECH

Why not let researchers build everything?

RESEARCH COMPUTING

What is this service?

37 of 179

From HPC to Cloud

High-performance computing �(aka Supercomputing from the 90s).

Fixed computing environment deployed (and controlled) by Systems Group

  • Large scale computing via batch processing (queuing) system
  • Basic command-line interface
  • Tailored to parallel processing
  • Tailored to large centralized shared scratch storage

Cloud native

Flexible (customizable) computing environments controlled by researcher

  • Orchestrated on-demand computing services with a variety of resource types: IaaS, PaaS, SaaS
  • Variety of users interfaces
  • Tailored to many single core tasks
  • Tailored to object or distributed storage
  • Ideal for scale out on-demand computing
  • Basis for modern product deployment

38 of 179

Why �was this important�???

Strategic Priority

Scale of Need

Data Science Platforms

Collaborative Research

Digital Humanities &

Data Driven Visualization

Restricted Data Analytics

University Product Development

Systems/Cloud Research

Domain Research Platforms/Gateways

Burst Resources

Academics/Training

39 of 179

So this is what we built…

$90M State, BU, Harvard, MIT, NEU, UMass

$5M NSF + Consortium

$2M MTC, $2M BU/HU

These are the investments

$500M RedHat

40 of 179

Who is using NERC?

Institutions: 11

PIs: 67

Total Projects: 121

Users: 771

41 of 179

Data becomes a first class citizen in the cloud

42 of 179

43 of 179

National Studies on Air Pollution and Health

Compute Space

CSV

FST

DAT

RAW Data: Air pollution | Census | Health Data (Medicare & Medicaid)

Open-Source Release: https://github.com/NSAPH-Data-Platform

Michael Bouzinier

User

1. data request

2. extract

3. analysis

Load Database

Data Exploration

44 of 179

HPC TO CLOUD

Why the shift in tech?

BEYOND TECH

Why not let researchers build everything?

RESEARCH COMPUTING

What is this service?

45 of 179

So much data!!!�

So many tools!!!

46 of 179

The Problem (as we see it)

  • Lack of data-centric tools throughout the research lifecycle
  • Lack of an interoperability between tools

Planning

Creating data

Discovery

Acquisition

Storing

Data transfer

Raw data

Reference data

Analysis

Data wrangling

Data analysis

Data sharing

Management

Data repositories

Data preservation

Disposal

47 of 179

The Problem (as we see it)

Planning

Creating data

Discovery

Acquisition

Storing

Data transfer

Raw data

Reference data

Analysis

Data wrangling

Data analysis

Data sharing

Management

Data repositories

Data preservation

Disposal

  • Ease the organization of large data
  • Facilitating the reproducibility of data
  • Sharing of active data internally and externally
  • Supporting data retention compliance
  • Tracking the provenance of data
  • Providing metrics on data access and use

48 of 179

Harvard Research Data Connect - Vision

An ecosystem of applications, services, and resources integrated by standards-based and service-oriented framework that will be populated by Library Services, University RC and school-based RC services, and Office of VP Research support as well as researchers themselves working in partnership.

49 of 179

  • Support the entire data lifecycle for conducting research (ease of access to data, publications, etc.) with support for delegation of common activities to automated processes.

  • Support creation, sharing, and curation of digital content and digital scholarship by making it easier to discover and analyze data; as well as tools for authoring and publishing scholarship. Resources will include:
    • data access for findability,
    • computation, virtual experimental facilities (data science),
    • virtual observational facilities (computer simulation as observation),
    • publication access for data sharing and improved reproducibility
    • ease of access to data science services.

  • Built on Service Oriented Architecture (SOA): joining independent services, provide integrated capabilities.
  • It will be extendable with new and domain specific tools (from astrophysics to digital humanities).
  • Modes of access that any researcher �can have (access through web browsers)

50 of 179

51 of 179

Thank you

Any Questions?

52 of 179

Bringing Data Close to Compute at Harvard Dataverse

Stefano M. Iacus

Senior Research Scientist & Director of Data Science and Product Research, IQSS, Harvard University

53 of 179

What is Dataverse?

An open-source platform that provides a generalist repository to publish, cite, and archive research data

Built to support multiple types of data, users, and workflows

Supports FAIR principles and Signposting.

Developed mainly at Harvard’s Institute for Quantitative Social Science (IQSS) since 2006 + key contributors from our large community

Started as a data sharing platform for the social science now covers almost all disciplines.

54 of 179

Who is using Dataverse?

  • datasets
    • 90K Harvard DV
    • [427K whole DV network]

  • files
    • 1.6M Harvard DV
    • [5.84M whole DV network]

  • storage
    • ~ 70TB Harvard DV

Contributed by 70K users [Harvard DV]

  • 115 installations (+14 new installations in 2023)

55 of 179

Who is using Dataverse?

recent containerization is giving a boost

56 of 179

The FAIR Guiding Principles (Wilkinson et al. 2016)

FINDABLE

Increases visibility, citations, and impact of research

Supports knowledge discovery and innovation

ACCESSIBLE

Streamlines and maximizes ability to build upon previous research results

Attracts partnerships with researchers and business in allied disciplines

REUSABLE

Promotes use and reuse of data allowing resources to be allocated wisely

Improves reproducibility and reliability of research results

INTEROPERABLE

Supports and promotes inter- and cross- disciplinary data and reuse

FAIR

Who makes DATA fair? Repositories (and researchers)!

  • Assigning persistent identifiers (DOI, ORCID, etc)
  • Structuring metadata according to a disciplinary standards or schemas
  • Indexing data as searchable resources
  • Retrieving datasets according to open protocols
  • Preserving data files and metadata
  • Tracking provenance and versions

🥐-ML

57 of 179

Discoverability & Interoperability

58 of 179

Traditional repositories web pages are not optimized for use by machine agents that navigate the scholarly web.

How can a robot determine which link on a landing page leads to content and which to metadata?

How can a bot distinguish those links from the myriad of other links on the page?

Signposting exposes these info to bots in a standards-based way.

Signposting and Discoverability

https://tinyurl.com/FAIR-Signposting-GREI

FAIR Signposting “Level 1”

59 of 179

Harvard Dataverse through the data life cycle

(present and planned integrations)

60 of 179

Large data support at Harvard Dataverse

GB

TB

PB

Upload through Dataverse

Direct upload/download to S3

Globus Transfer to S3

Reference Data in Remote Stores (HTTP -> Globus)

Sensitive

Globus Transfer to File/Tape

Part of the Harvard Data Commons Project

61 of 179

Globus Endpoints

Dataverse

Server

Globus Service

Researcher’s Browser

Dataverse Dataset

Transfer In/Out or Reference

launch

reliable parallel transfer

Dataverse- Globus Transfer App

Managed

Globus Endpoint

(e.g. over tape storage)

Globus Store(s)

S3 Store

File Store

Remote Store

manage ACLs

launch

notify

request transfer

monitor transfers

62 of 179

How to compute on data stored at Harvard Dataverse?

Previewers and AI tools to interrogate data are integrated into the Dataverse UI but work mostly for small data.

So far, large data must be downloaded over the net in order to enable computing.

63 of 179

Dataverse support Globus in different ways

Globus endpoint

DV Controls access

Globus Transfer To/From

Ingest/Previews/

Http Download

Managed Globus File Store

File/Tape

True

True

False

Managed Globus S3 store

S3 Connector

True

True

True

Remote Globus Store

Any Trusted

False

N/A

Reference Only

HTTP possibly at remote endpoint

64 of 179

The MOC version of Harvard Dataverse (Poc)

65 of 179

The MOC version of Harvard Dataverse (Poc)

happy Harvard Dataverse user

(only NERC PI’s will be able to run compute)

tape

disk

66 of 179

The MOC version of Harvard Dataverse (Poc)

67 of 179

Dataverse’s Globus application (Disk/Tape)

68 of 179

How Dataverse manages Globus transfer

  • The file “Upload with Globus” button is available on the Upload Files panel after dataset has been created

  • Transfer is done in the Dataverse Globus application
    • Handles user login to Globus, file selection, and initiating the transfer

    • Coordinates with Dataverse to handle access control and for Dataverse to monitor the transfer/update the dataset

  • Dataverse Globus Transfer API
    • Used by the Datavers-Globus app
    • Available for use by other tools

69 of 179

Globus - Dataverse Transfer Tool

Globus Directory Connection

Dataverse transfer space

  1. Authenticate to Globus

  • Connect to Globus storage

  • Select Files for transfer

  • Submit Transfer to Dataset in Dataverse.

70 of 179

Notification

mechanism

71 of 179

Once the process is complete the data is published

NESE tape storage

Valid DOI

72 of 179

Python notebook

Valid DOI

Computing on data

73 of 179

This will spin the JupyterLab container with the pre-loaded notebook taken from the dataset.

All files in this collection are seen as local to the Jupyter instance. Python will simply load them into memory for computing purposes.

74 of 179

NERC endpoint for the containerized storage (which exists on NESE)

75 of 179

Automatic mapping of local file names (local to the python notebook) to Harvard Dataverse file pointers on NESE

76 of 179

Then some nice computation happens

77 of 179

Towards AI integration

This chatbot only sees the tabular data but is clueless about the metadata

78 of 179

Tell me what is this data about

Cool but poor

This chatbot only sees the tabular data but is clueless about the metadata

79 of 179

tell me the range of latitudes and longitudes with the highest number of events

ok-ish

This chatbot only sees the tabular data but is clueless about the metadata

80 of 179

map the range of latitudes and longitudes with the highest number of events to the names of countries

LLM kicks in

This chatbot only sees the tabular data but is clueless about the metadata

81 of 179

Traditional Dataverse search based on Solr (Apache Lucene)

(Keywords) Query: “covid cases in Italy”

82 of 179

Search via embeddings

(NLP) Query: “datasets about covid cases in Italy”

83 of 179

Search via embeddings

(NLP) Query: “datasets su casi di covid in Italia” [LLM kicks in]

84 of 179

Roadmap

Challenge: Long term sustainability

  • add support for R language
  • add support to configure Jupyter VMs CPU/GPU/RAM to support different workflows
  • from “NERC PIs only” to more general user base ($$$)
  • establish a billing mechanism for compute
  • move MOC Dataverse from OpenStack to OpenShift
  • moving to production
  • HDV is essentially cost-free for all academic users, but we are going to build a business model to sustain long term costs (e.g. charging for compute, PT of storage, etc)

  • MOC (NERC/NESE) will be the key to sustainability for Dataverse, both technologically and administratively (e.g. managing external users).

85 of 179

Thanks to:

Any Questions?

L. Andreev, P. Durbin, C. Boyd, G. Durand. S. Barbosa (IQSS-Dataverse), J. Myers (GDCC), O. Bertuch (FZJ)

S. Yockel, M. Munakami, M. Kupcevic, M. Shad, F. Pontiggia (NESE, NERC, HUIT)

D. Shi (Redhat), O. Krieger (BU)

Harvard Dataverse

Dataverse Project

86 of 179

2024 MOC Alliance Workshop

Break and Networking Time

87 of 179

Norwegian Research Commons and Implications for the MOC

Rory Macneil

Founder and CEO, Research Space

88 of 179

The Norwegian Research Commons

as a model for a 

NERC Research Commons

Rory Macneil, Research Space

MOC Alliance Workshop

February 28, 2024 – Boston

89 of 179

Overview

  • Research Commons: context and concept
  • Global Open Research Commons International Model
  • REASON: Proposed Norwegian Research Commons
  • NERC as a Research Commons
  • Dataverse and RSpace as initial core elements of a NERC Research Commons
  • Federation possibilities

90 of 179

Overview

  • Research Commons: context and concept
  • Global Open Research Commons International Model
  • REASON: Proposed Norwegian Research Commons
  • NERC as a Research Commons
  • Dataverse and RSpace as initial core elements of a NERC Research Commons
  • Federation possibilities

91 of 179

Context

92 of 179

Problem

Lack of data-centric tools

Lack of interoperability

between tools

Impeding FAIR

Creating friction for researchers

Siloed data

93 of 179

Solution? Research Commons

”Bring together data with cloud computing infrastructure and commonly used software, services and applications for managing, analyzing and sharing data to create an interoperable resource for a research community”

Scott Yockel, Harvard University

Towards a Data Commons at Harvard

94 of 179

Who will provide Research Commons?

  • Universities

  • Harvard

  • National organizations

  • Canada’s Digital Research Alliance

  • Supra national organizations

  • EOSC: EUDAT Collaborative Data Infrastructure

95 of 179

  • Research Commons: context and concept
  • Global Open Research Commons International Model
  • REASON: Proposed Norwegian Research Commons
  • NERC as a Research Commons
  • Dataverse and RSpace as initial core elements of a NERC Research Commons
  • Federation possibilities

96 of 179

Global Open Research Commons International Model

  • Developed by Research Data Alliance working group
  • Released November 2023
  • Model/guideline for Research Commons
  • Three core technical elements and surrounding process/governance wrap
  • Many countries are looking at adopting – Canada, Sweden, Netherlands, Germany, UK, etc.

97 of 179

  • Research Commons: context and concept
  • Global Open Research Commons International Model
  • REASON: Proposed Norwegian Research Commons
  • NERC as a Research Commons
  • Dataverse and RSpace as initial core elements of a NERC Research Commons
  • Federation possibilities

98 of 179

Reason: Technical core

  • Core elements are Dataverse, iRODS and RSpace, which are already integrated
  • Includes integration of these and other resources with storage and compute infrastructure
  • Focus of proposed work is on further enhancing existing interoperability between tools
  • Objective: integrated infrastructure that facilitates data and metadata flow throughout the research data lifecycle

99 of 179

”Bring together data with cloud computing infrastructure and commonly used software, services and applications for managing, analyzing and sharing data to create an interoperable resource for a research community”

Scott Yockel, Harvard University

100 of 179

REASON as a model for other Research Commons

  • Comprehensive use of GORC elements
  • Detailed instantiation of Services and Tools Element of the GORC model
  • Built around around a group of complementary generalist tools all designed to enhance FAIRification of data
  • Focus on interoperability of tools, data and metadata
  • Encompasses need to interoperate with existing generalist and domain specific research infrastructure

101 of 179

  • Research Commons: context and concept
  • Global Open Research Commons International Model
  • REASON: Proposed Norwegian Research Commons
  • NERC as a Research Commons
  • Dataverse and RSpace as initial core elements of a NERC Research Commons
  • Federation possibilities

102 of 179

The NERC as a

Research Commons

103 of 179

  • Research Commons: context and concept
  • Global Open Research Commons International Model
  • REASON: Proposed Norwegian Research Commons
  • NERC as a Research Commons
  • Dataverse and RSpace as initial core elements of a NERC Research Commons
  • Federation possibilities

104 of 179

Dataverse

  • Generalist data repository
  • Harvard + Community
  • Part of GREI
  • Integration with compute – Dataverse work @Harvard

105 of 179

RSpace: Data-centric digital research platform designed to interoperate with and connect research infrastructure

106 of 179

RSpace – Dataverse Integration

107 of 179

NERC Research Commons starting with Dataverse and

RSpace

108 of 179

Dataverse and RSpace as initial core of NERC Research Commons

Active data management + repository / Ecosystem of connected tools / Integration with storage / Integration with compute

109 of 179

RSpace is now being deployed on the NERC!

110 of 179

  • Research Commons: context and concept
  • Global Open Research Commons International Model
  • REASON: Proposed Norwegian Research Commons
  • NERC as a Research Commons
  • Dataverse and RSpace as initial core elements of a NERC Research Commons
  • Federation possibilities

111 of 179

NERC Research Commons as trigger for federation with other Open Clouds and Research Commons

112 of 179

Resources and contact

Research Commons

GORC International Model

  • https://zenodo.org/records/10694490

REASON

  • https://zenodo.org/records/10410202

113 of 179

Institutional example 1: UCL

Connectivity with UCL infrastructure

114 of 179

Institutional example 1: UCL

Powering a FAIR research

data/metadata flow at the institutional level

115 of 179

Institutional example 2: Harvard

116 of 179

National example 1: European Open

Science Cloud

EUDAT Collaborative Data Infrastructure

117 of 179

National example 2: Canada

Digital Research Alliance Research Commons

118 of 179

Stage 2: Export data from RSpace to iRODS

  • Suggested by SURF

  • What: Push data from RSpace to iRODS

  • How: Use existing RSpace data export mechanism and interface

  • Why: Enable management of RSpace data in iRODS and association of RSpace data with other data managed in iRODS

  • Benefits: Enhanced FAIRification of data from RSpace and other data in iRODS

119 of 179

Federated Systems

We have seen siloed systems – connected through a central hub or portal, but with data and processes perhaps in walled gardens in proprietary formats or held behind subscription-based services.

This type of system requires continued budget for the subscriptions as well as vigilance to make sure your data is portable.

To plan for truly federated services requires more time and energy, but can result in a more robust ecosystem of interoperable pieces, resilient to shifting budgets and the consistent changing of underlying technologies.

Diplomacy (and interoperability) between sovereign systems is a more mature, slow, iterative process. It is how infrastructure should behave. It is best suited to be powered by open source solutions and well-understood formats and protocols.

120 of 179

Thank you

Any Questions?

121 of 179

Two Sigma Memento:�Why Good Artifact Naming Matters

Mark Roth

Managing Director, Data Engineering�Two Sigma Investments, LP

122 of 179

Important Legal Information

This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity.

The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Disclaimer

123 of 179

“Real names tell you the story of the things they belong to in my language, in the Old Entish as you might say”�

  • J.R.R. Tolkien, The Two Towers

Image credit: https://www.etsy.com/listing/715203686/a4-treebeard-poster-lord-of-the-rings

124 of 179

What Can Memento Do For Me?

  • Better Provenance
    • Maintain and communicate accurate provenance
  • Better Reproducibility
    • Make research more reproducible and enable continuing research
  • Better Ergonomics
    • Organize research notebooks and name intermediate artifacts
    • Reduce cost and research time by never repeating the same computation twice

On some level, these are all about naming!

125 of 179

Naming�is hard

126 of 179

  1. cache invalidation
  2. naming things
  3. off-by-one errors

Credit: Phil Karlton, Leon Bambrick, https://martinfowler.com/bliki/TwoHardThings.html

There are only two hard things in Computer Science:

Memento helps with 2 of 3!

127 of 179

Hypothesis

Data Needed

Zip codes in NYC with�higher median incomes show higher Citi Bike usage

  1. Citi Bike Usage by ZIP
  2. Median Income by ZIP

128 of 179

Final Data

Local Store

Questions

  1. What are the original data sources?
  2. What methodology (code) was used to process this?
  3. Where are the intermediate artifacts?
  4. Can I reproduce this data?
  5. How do I generate data for 2021?

2020.csv

date, zip, income, usage

2020-02-01, 10013, 150675, 113�2020-02-01, 10014, 147267, 107�...

129 of 179

Step 1. Ingest the Data

Runtime Env

?

?

External Source

Local Store

Amazon S3 tripdata

irs.gov/statistics

?

?

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

2020MM-�citibike-tripdata�.csv.zip

2020_

irs_gov_statistics_wealth.xlsx

step1_ingest_data�(yyyymm)

ingest_s3�(url, pathRegEx)

ingest_citibike�(year, month)

A

B

C

130 of 179

Step 2. Normalize the Data

Runtime Env

?

?

Local Store

Local Store

2020MM-�citibike-tripdata�.csv.zip

2020_�irs_gov_statistics_wealth.xlsx

?

?

normalize_�citibike(year)

normalize_�wealth(year)

citibike-2020.csv

wealth-2020.csv

step2_citibike_csv�(yyyy)

merge_zip_to_csv�(src, dest, **params)

normalize_citibike�(year)

B

C

A

131 of 179

Step 3. Join the Data

Runtime Env

join_citibike_�wealth�(year, tolerance)

Local Store

Local Store

citibike-2020.csv

wealth-2020.csv

2020.csv

132 of 179

The Whole Pipeline

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Local Store

Amazon S3 tripdata

irs.gov/statistics

2020_

irs_gov_statistics_wealth.xlsx

2020MM-�citibike-tripdata�.csv.zip

Runtime Env

normalize_�citibike(year)

normalize_�wealth(year)

Local Store

citibike-2020.csv

wealth-2020.csv

Runtime Env

join_citibike_�wealth�(year, tolerance)

2020.csv

Local Store

133 of 179

Change Management

Runtime Env

Runtime Env

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Local Store

Amazon S3 tripdata

irs.gov/statistics

2020_

irs_gov_statistics_wealth.xlsx

2020MM-�citibike-tripdata�.csv.zip

normalize_�citibike(year)

normalize_�wealth(year)

Local Store

citibike-2020.csv

wealth-2020.csv

join_citibike_�wealth�(year, tolerance)

2020.csv

Local Store

New Data!

New Tolerance

Fix Normal-�ization

New file format!

New CPU Architecture

134 of 179

Better Naming (Old Entish approach)

2020.csv

cpu_intel-env-gpu_nvidia-env-pandas_2.2.0-env-amazon_s3_tripdata-20230101-asof-2020-1-ingest_citibike-2020-normalize_citibike-irs_gov_statistics-20230101-asof-2020-ingest_irs_gov_statistics-2020-normalize_weath_v2-2020-0.7-join_citibike_wealth.csv

135 of 179

The Ents could have saved themselves a lot of time if they knew about hashing.

136 of 179

Better Naming (Memento Approach)

2020.csv

+

join_citibike_wealth(2020, 0.7)

#99872fbb

137 of 179

Memento Approach

1. Functions, not artifacts

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Local Store

Amazon S3 tripdata

irs.gov/statistics

2020_

irs_gov_statistics_wealth.xlsx

2020MM-�citibike-tripdata�.csv.zip

Runtime Env

normalize_�citibike(year)

normalize_�wealth(year)

Local Store

citibike-2020.csv

wealth-2020.csv

Runtime Env

join_citibike_�wealth�(year, tolerance)

2020.csv

Local Store

138 of 179

Memento Approach

1. Functions, not artifacts

Runtime Env

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(year)

normalize_�wealth(year)

Runtime Env

join_citibike_�wealth�(year, tolerance)

139 of 179

Memento Approach

2. Durably store ingested data

Runtime Env

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(year)

normalize_�wealth(year)

Runtime Env

join_citibike_�wealth�(year, tolerance)

Durable Storage

140 of 179

Memento Approach

3. Hash all code and environments

Runtime Env

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(year)

normalize_�wealth(year)

Runtime Env

join_citibike_�wealth�(year, tolerance)

Durable Storage

#af023329

#2276ea01

#297bba2f

#17fbccd4

#1288fe63

#1288fe63

#1288fe63

normalize_�citibike(year)

#73612fea

join_citibike_�wealth�(year, tolerance)

#0039dbbf

141 of 179

Memento Approach

4. Hash, memoize all invocations

Runtime Env

Runtime Env

ingest_citibike�(2020, 1)

ingest_irs_gov_

statistics(2020)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(2020)

normalize_�wealth(2020)

Runtime Env

join_citibike_�wealth�(2020, 0.7)

Durable Storage

#af023329

#2276ea01

#297bba2f

#17fbccd4

#99872fbb

#1288fe63

#1288fe63

#1288fe63

join_citibike_wealth(2020, 0.7)

#e8817aec

#1b726dda

#6239bb12

#1997ffb2

#200776dd

142 of 179

Memento Approach

5. Record “mementos”

Runtime Env

Runtime Env

ingest_citibike�(2020, 1)

ingest_irs_gov_

statistics(2020)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(2020)

normalize_�wealth(2020)

Runtime Env

join_citibike_�wealth�(2020, 0.7)

Durable Storage

#af023329

#2276ea01

#297bba2f

#17fbccd4

#99872fbb

#1288fe63

#1288fe63

#1288fe63

join_citibike_wealth(2020, 0.7)

#e8817aec

#1b726dda

#6239bb12

#1997ffb2

#200776dd

143 of 179

What’s in a Memento?

Standard Memento Metadata

  • public key of publisher
  • time of invocation
  • total runtime
  • running user
  • observed lineage
  • optional extensions

Inputs

  • function version
  • serialized parameter map
  • dependency set
  • runtime env spec
  • invocation hash

Outputs

  • execution metadata
  • result hash

Signature

Chain

Signature

144 of 179

What Can Memento Do For Me?

  • Better Provenance
    • Maintain and communicate accurate provenance
  • Better Reproducibility
    • Make research more reproducible and enable continuing research
  • Better Ergonomics
    • Organize research notebooks and name intermediate artifacts
    • Reduce cost and research time by never repeating the same computation twice

145 of 179

Provenance�is about naming

146 of 179

Why is Provenance Hard?

To understand provenance accurately, we want to know:

  • What raw data was used in the computation?
  • At what time was that raw data ingested?
  • Which parties were involved in ingesting and transforming the data?
  • What was the exact methodology used to transform the data?
  • What was the runtime environment used at each stage?
  • How has the provenance changed over time?

147 of 179

Provenance in Memento

  • Which raw data and ingestion methodology accurately named
  • Time of ingestion accurately named in Memento
  • Running user accurately named in Memento, with signature
  • Methodology accurately named (versioned via hashes)
  • Runtime environment accurately named (hashed)
  • As provenance changes, new mementos are recorded for posterity

Mementos provide high fidelity end-to-end provenance

Most frameworks can be adapted to use Memento as a standard

148 of 179

Reproducibility�is about naming

149 of 179

Why is Reproducibility Hard?

To reproduce computation, all of these must be accurately named (identified, recorded, versioned) in a way they can be reconstructed:

  • Exact snapshots of all external data used
  • Exact methodology (source code)
  • All configuration and parameters provided
  • Runtime environment:
    • Hardware (e.g. CPU architecture)
    • Operating System
    • Installed libraries
  • Use of non-deterministic libraries or hardware (e.g. GPUs)�(not handled by Memento)

150 of 179

Reproducibility in Memento

  • Raw Input Snapshots accurately named and kept in durable storage
  • Methodology accurately named (versioned via hashes)
  • All configuration and parameters accurately named (hashed)
  • Runtime environment accurately named (hashed)

Mementos record everything we need to know to reproduce research

The results are signed and can be validated by 3rd parties

151 of 179

Ergonomic�(and Economic)�Advantages

152 of 179

Organization Advantages

  • More efficient use of researcher time
  • Research notebooks and code become easier to organize
    • Just relax and write functions
    • Memento takes care of serialization, storage
  • Naming becomes easier
    • Use simple but descriptive names for functions
    • Memento tracks dependencies, versions, etc.
  • Automatic memoization
    • When notebook kernel restarts, can easily restart from where you left off

153 of 179

Caching Advantages

Cache needs to be invalidated when:

  • Code changes
  • Parameters change
  • Dependencies change
  • Relevant runtime environment changes (e.g. library upgrades)
  • New ingestion data is used

Surprise: Memento already tracks all of these in the invocation hash, making it possible to do high-fidelity automatic caching!

Caches can even be shared with other team members!

154 of 179

Caching in Memento

Memento

Metadata

Result Hash

Memoized Data

Result�Hash

Serialized Results

Function( )

arguments

Results

Invocation Hash

Parameter Map�(hash)

Function Reference (hash)

Output Cache�(result hash → result)

Memento

Federated

Catalog

(invocation hash → Memento)

155 of 179

Memento:�Current State�and Next Steps

156 of 179

Memento Open Source

Two Sigma has decided to open source a version of Memento, in order to help encourage further research

  • https://github.com/twosigma/memento
  • Features:
    • Optimized for Python and works with Pandas
    • Works well in Jupyter notebook environments (and elsewhere)
  • Memento whitepaper coming soon

pip install twosigma.memento

157 of 179

Future Work

  • Catch up Python reference implementation to ideas in white paper
    • Durable storage for ingestion
    • Context parameters
    • Signature chains
  • Distributed compute and storage plugins for major cloud providers
  • Mementos for Data Warehouses
  • Cache replacement policies, garbage collection
  • More Language Bindings (Java, Julia, Rust, …)
  • Memento Server
    • Monitoring, Metrics and Management
    • Containerization of Memento Functions
    • Federated Web of Data

158 of 179

Thank you

Any Questions?

159 of 179

D4N: A Community Cache for an Open Cloud

Matt Benjamin, Engineering Manager at IBM Storage

Amin Mosayyebzadeh, PhD Candidate at Boston University

160 of 179

Brief History of IBM Ceph Object / MOC Collaboration

  • D3N (2018+)
    • Emine Ugur Kaynar, Mania Abdi, Mohammad H. HajKazemi
    • L1/L2 performance evaluation (Red Hat, Ben England)
  • D4N
    • Emine U. Kaynar, Amin Mosayyebzadeh, Mania Abdi
    • Design Collaboration (Red Hat Ceph teams, esp. Object)
  • D4N Upstreaming
    • Amin Mosayyebzadeh, Sumatra Dhimoyee
    • Pritha Srivastava, Samarah Uriarte [et al]
  • D4N + Locality
    • Amin Mosayyebzadeh, Sumatra Dhimoyee, Austin Jamias

161 of 179

Alignment around Fusion of Data and Compute

  • MOC Researchers
    • D3/D4N
      • intelligent, intermediate S3 materialized cache (highlights: Kariz, e.g. integration w/workload planning)
      • general workload acceleration
      • exploratory writeback cache (e.g., ephemeral data, object packing)

  • Ceph Object Engineering (Red Hat, now IBM Storage)
    • focus on analytics workloads, AI
      • S3A (and SwiftA, Wal*Mart)
      • S3-select
      • New Work
        • Arrow Flight
        • FlightSql

162 of 179

D4N Upstreaming Effort

  • D4N cache design based on Zipper API, separated into:
    • Filter driver
    • Policy driver
    • Cache backend driver
  • Support for multiple cache backends - currently SSD-backed, Redis backed cache backends, other backends can be added
  • Flexible replacement policy
    • currently implemented LFUDA and LRU policies, other policies can be added
  • Boost::ASIO for IO operations
  • Redis server for caching location information, hot objects, transactional state, message bus

163 of 179

D4N Integration (Zipper Filter)

164 of 179

D4N Upstreaming Effort

  • Current Status
    • approaching upstream single-node MVP
      • supports read cache on a single node, using SSD backed cache backend and LFUDA policy as the replacement algo. Blocks of data are cached, head object isn’t as of now, but support for caching it is in place and will be used to cache head objects in writeback cache.
  • Future work
    • D4N distributed read cache (from research version)
    • Resilient write-back cache (c.f.)
  • Main challenges recently
    • ASIO related bugs
    • bugs in new Boost::REDIS client (new ASIO-based c++ client driver)

165 of 179

D4N with Higher Locality

RGW

RGW

RW cache

RW cache

Directory

RGW

RGW

Read

cache

Read

cache

Directory

Write back cache

166 of 179

Distributed D4N

  • Connecting D4N instances across data center
  • The vision is sharing
    • S3
    • NVMe-Over-TCP
    • RDMA

167 of 179

D4N and K8s

  • Want PODs in any k8s cluster automatically make requests to nearest RGW
  • K8s clusters are ephemeral and elastic
  • Topology Aware Routing of K8s to find nearest RGW

K8s

K8s

K8s

K8s

K8s

168 of 179

D4N Use Cases

  • Analytical frameworks
  • ML/AI
    • A significant intermediate data (write) with high reuse rate (read)
    • Snapshotting the progress stages

169 of 179

BDAS: Big Data Analytic Support

(Ceph Object Team Code Name)

  • S3-select
    • Trino, Presto, Spark
    • Complete
      • supports:
        • CSV, Parquet, JSON back-ends

170 of 179

BDAS: Big Data Analytic Support

(Ceph Object Team Code Name)

  • Arrow / Arrow Flight
    • Apache Arrow
      • columnar data format that is optimized for in-memory data and enable efficient, zero-copy computation.
      • Arrow format and Parquet format, which is optimized for storage, are both columnar and easily converted to each other.
    • Arrow Flight
      • high-performance distributed data transfer protocol to move data in Arrow, Parquet, and other formats without needing to (de-)serialize. Furthermore it can access data in parallel across multiple sources in parallel and transfer data that cannot all fit in memory in batches to be processed.
    • Because both technologies are platform- and language-independent, they offer a standard for interoperability.
    • RGW is being enhanced to act as an Arrow Flight Server to feed data to Arrow clients and store intermediate and final results after computation.

171 of 179

BDAS: Big Data Analytic Support

(Ceph Object Team Code Name)

  • Arrow / Arrow Flight
    • Computation in WB Cache
    • New Query Modes
      • With additional knowledge of the internal structure of data in Arrow and Parquet formats, RGW can optimize the laying out and striping of data to optimize retrieving of specific columns and rows and responding to Flight SQL queries.

172 of 179

MOC Collaboration Futures

  • IBM and Ceph teams
    • Resilient caching in the RGW/S3 tier is critical in progress and planned efforts to combine predicate pushdown with intermediate caching (of data and results)
    • Redis-based directory has formed the basis for more robust state sharing among cooperating RGWs, even in non-caching scenarios

  • MOC
    • K8S integration work facilitates convergence with Open Shift (Red Hat) and Fusion HCI (IBM) workloads
    • Could form the foundation of new caching-enabled S3 storage offerings to Open Cloud consumers

173 of 179

Credits

PhD Students

  • Amin Mosayyebzadeh
  • Sumatra Dhimoyee

Past

  • Mania Abdi
  • Mohammad Hossein Hajkazemi
  • Emine Ugur Kaynar
  • Raja R. Sambasivan
  • Ata Turk

Advisors

  • Brett Niver (Red Hat, IBM Storage)
  • Kyle Bader (Red Hat, IBM Storage)

Past

  • David Cohen (Intel)
  • Larry Rudolph (Two Sigma)

Research Professors

  • Orran Krieger, Boston University
  • Peter Desnoyers, Northeastern University

IBM Ceph Team

  • Casey Bodley (Upstream lead, ASIO)
  • Adam Emerson (Async)
  • Daniel Gryniewicz (Ceph Zipper)
  • Eric Ivancich (BDAS)
  • Ali Maredia (D4N)
  • Gal Salomon (S3-Select)
  • Pritha Srivastava (Cache Integration)
  • Samarah Uriarte (Cache Int., Redis Int)

174 of 179

Thank you

Any Questions?

175 of 179

Michael Daitzman

Director of Product Development,

Mass Open Cloud Alliance

176 of 179

United in the Cloud

177 of 179

The MOC Alliance Team

Emmanuel Cecchet

Software Engineer

Organization: UMass

Projects:, OCT, ESI

Harvard University:

Nick Amento, Network Architect

Robin Weber, NESE

Quan Pham

Software Engineer

Organization: Boston University

Projects: NERC, Mass Open Cloud Alliance

James Culbert

Director of IT

Organization: MGHPCC

Projects: NERC, MGHPCC, NEFRC, OSN,

Danni Shi

Senior Software Engineer

Organization: Red Hat

Projects:, OPE, NERC

Tzu-Mainn Chen

Organization: Red Hat

Projects: ESI, NERC

Isaiah Stapleton

Software Engineer

Organization: Red Hat

Projects:, OPE, NERC

Isaiah Stapleton

Software Engineer

Organization: Red Hat

Projects:, OPE, NERC

Steve Heckman, BU

Surbi Kanthed, Red Hat

Dylan Stewart, Red Hat

178 of 179

Two More Things . . . .

  • OCT Advisory Board Meeting is in 315 - up the stairs in the back

  • Please join us on the 17th Floor of the Center for Computing and Data Sciences in room 1750 at 6:30pm

179 of 179

2024 MOC Alliance Workshop

Reception

6:30 - 9:00 pm

CDS 1750 (17th Floor)

665 Commonwealth Ave