1 of 179

2024 MOC Alliance Workshop

Lunch

2 of 179

The benefits of sharing and collaboration in the MOC

"If I have seen further it is by standing on the shoulders of Giants" Issac Newton

Larry Rudolph Distinguished Fellow, MOC and Two Sigma, LP

3 of 179

Disclaimer

This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity.

The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa

Important Legal Information

3

4 of 179

Data (DB)

Compute

In the beginning there was

Is Compute the King or

Is Data the King (both)

Marriage ⇒ data pipelines

Data Pipeline

5 of 179

Imagine a world where

Before you can use data

first download it or make a copy in the cloud

did you get enough storage (can you afford it?)

did you get the latest version

clean and normalize the data yourself

think about checkpoints, saved intermediate results

make sure there is enough storage or available servers / GPUs

re-execute everything for each modification of parameter choice

Repeat until you are exhausted …

For many of us, the world is even messier than this

6 of 179

Imagine a world in which

immediately leverage the related work

every time you run a program,

it supports the easy sharing of data and code

it is reproducible

it automatically recovers from failures

it runs more efficiently

the cost of producing may be offset by those consuming

the marriage is sustainable

Supports community culture

and natural pedagology

7 of 179

Consider developing AI models

Download or copy raw data
Perform preprocessing steps:

tokenization
cleaning
normalization

Save the transformed data somewhere, call it DC1

Depict a series of gears and levers transforming raw data into neatly organized and standardized inputs. Describe each gear representing a different preprocessing step like tokenization, cleaning, and normalization, emphasizing the transformation process.

The is the Google Gemini Prompt (as with all other pics)

Where to download, how often, when to delete

Write it myself or find code made available by a kind soul or “buy” it

Each time new error found,

start from the beginning

Maybe I want to name each version, depending on versions of above steps

Step 1: Data Collection

8 of 179

Consider developing AI models

For each source of data

Perform all the steps in data collection

Giving each one a different output name

Check for new versions

Make a picture of a scene where a conveyor belt is moving through a diverse landscape, collecting various sources of data such as books, articles, websites, and social media posts. Mention the variety and richness of the data being collected.

Each source may need account/password/api-key

Data collection steps may fail. Decide how much to re-run

Do I run this on many of the versions in the data collection? How to remember what input was used for each output

Step 2: Preprocesing

9 of 179

Consider developing AI models

Build a model in some code

Rent GPUs

Take all the preprocessed data

Break into batches

Feed them into the GPUs to train the model

Save the model in some other file

Draw a caricature of a commons field with data and computers are grazing in the style of Boston Commons

Lots of decisions

Crashes or bugs, means either paying for GPU;s while debugging and doing all the earlier steps, or release and acquire GPUs

Learn by trial and error

Cloud costs quckly add up

Step 3: Training

10 of 179

You get the idea

A lot of steps in this meta pipeline (recursive pipelines)

Data Collection
Preprocessing
Training
Fine Tuning
Deployment
Inference
Evaluation
Monitoring and Maintenance

Data pipelines usually DAGs

with nodes and edges crossing domain boundaries

Not only humans in the loop, unknown humans

Upstream decisions affect unknown downstream users

11 of 179

No Application is an Island

12 of 179

No Data is an Island

13 of 179

No Server is an island

14 of 179

Data, Pipeline, HW at Scale

Data represents “some capture of the real world” → complex system

Number and size of data sets @ scale → complex system

Servers, Network, Disks, Power, Heat @ scale → complex system

“feature of complex systems” → unanticipated emergent properties

Solution:

Share

Collaborate

15 of 179

Tenant’s view:

Pipeline (DAG part) is the application

16 of 179

Provider’s view:

The data center is the computer

17 of 179

Goal: Optimize like we do for apps running on a computer

18 of 179

Note

I am mixing two different, but similar ideas

Data + Compute
Tenant + Provider

Both are about more sharing and more collaboration

(marriage between couples?)

19 of 179

Imagine a world where there is

a data center with

plenty of storage, networking, and compute

ability for any server can access any data

probes measuring network congestion, server load, etc.

multi-level caching throughout

dynamic security partitions, usage accounting

anyone can try out new h/w

no single controlling commercial entity

profit is not the motive (helping the tenant is the goal)

Hmm, this already exists ==> a great platform for sharing

draw a caricature of a data center with storage compute, and networking with probes to measure congestions

20 of 179

Sharing Data

(Dataverse, Norwegian Research Commons)

FAIR (Findable, Accessible, Interoperable, Re-useable)

i.e. data without compute is like a day without sunshine

No need for everyone to have their own copy of the data

data immutability w/ security

no need for everyone to pay for its storage copy (while the data center does de-duplication)

No need for everyone to redo data collection, pre-processing, …

saving time, energy

Access to lots of tools that run near the data

Draw a caricature of a commons field with data and computers are grazing in style of Boston Commons

Functional Semantic Types (advertisement)

21 of 179

Sharing DAG-Metadata ⇔ H/W State

Collect metadata provenance of data pipelines / workflows

Learn attributes to do better allocation, placement, execution

Learn which data or code is reused for attribution

Use metadata and runtime info for better object caching

Draw a caricature of the provenance of a data pipeline

Shared Caching

(community caching)

draw several caches with data and code

(Bringing Data Close to Compute, D4N)

22 of 179

Leveraging Shared Memory

Future? (research request)

A pool of disaggregated (DRAM) memory directly accessed by any server in the rack (e.g. Intel CXL v3)

Not great for current workloads (Google publication)

Might be great for shared data

Map dataframes, tables to active jobs running in same rack

draw leveraging shared memory

23 of 179

Reproducible

Memento (Two Sigma) & Semantic Cloud (Redhat)

During data pipeline execution, the results of data transforms maybe memoized

In future, when same transformation, on same data, with same parameters, and in same environment

no need to re-execute, just return memoized value

Sharing of transforms, intermediate values

draw a caricature of mementos

24 of 179

The vision is sharing

The place is the MOC-A

Production-level facility – appetite to share / save $$

Lots of data to share

Researchers want to share with

themselves while debugging

teammates while exploring

collaborators or future ones

all with right attribution, protection, cost sharing model

draw a caricature of a bunch of computers helping out another computer

Thank you!!

Now it’s time for the Data+Compute

engagement party

25 of 179

Thank you

Any Questions?

26 of 179

Emre Keskin�University Research �Data Officer

Wayne Gilmore�Executive Director�Research Computing Services

Scott Yockel�University Research �Computing Officer

The Building Blocks of Cloud = Research Enablement

27 of 179

HPC TO CLOUD

Why the shift in tech?

BEYOND TECH

Why not let researchers build everything?

RESEARCH COMPUTING

What is this service?

28 of 179

Dawn of the Research Facilitator - 2014 ACI-REF

“Facilitator” - makes an action or process easier

Researchers are faced with:

Rapidly changing technology (HPC, specialized processors, data storage, distributed networks, …
Rapidly evolving analytical software
Spread of technology to new research disciplines
AI is taking over everything

Advanced Cyberinfrastructure - Research and Educational Facilitation: Campus-Based Computational Research Support - NSF Award # 1341935

Literally in the slide, these people are in the background, they are the ones that make stuff work for researchers. There is now a whole national oraganization of these professionals Campus Research Computing Consortium with 1000s of participants on national calls monthly.

Background:

A “facilitator” is by definition a person that makes an action or process easier, and thus an ACI-REF is a person that specializes in making it easier to use advanced cyberinfrastructure. This is important because (1) we are at a time when advanced cyberinfrastructure resources (e.g. computing clusters, specialized processors, data storage systems, etc.) and analytical software for science are rapidly evolving and (2) the use of advanced cyberinfrastructure is rapidly spreading into research disciplines beyond the physical sciences and engineering. Many researchers are overwhelmed with these major changes -- uncertain of how to take advantage of available technology in a way that advances research capabilities and discoveries. At the beginning of this project in 2014, the term “Facilitator” was virtually unused within campus computing centers across the U.S. In a relatively short period, computing experts and researchers, themselves, are seeing and using it at many community conferences, which now have a "Facilitation" track or focus, and through a number of newly-created positions within campus computing organizations, which include "Facilitator" in the title. This is a significant cultural shift in cyberinfrastructure staffing models from focusing primarily on expertise around the function and administration of advanced cyberinfrastructure systems and toward the necessary human element required for researchers to fully realize the potential of computing to advance their work.

29 of 179

Research Computing (RC) is at the intersection of providing leading technical solutions & supporting researchers in the scholarly process of discovery and innovation

Co-learn

RC professionals co-learn unique domain specific problems alongside of researchers

Co-create

RC professionals co-create solutions (technical, architectural, pipeline, software, …) alongside of researchers

30 of 179

What is HPC and why is it important?

High-performance computing (or Supercomputing from the 90s).

Fixed computing environment deployed (and controlled) by Systems Group
Large scale computing via batch processing (queuing) system
Basic command-line interface
Tailored to parallel processing
Tailored to large centralized shared scratch storage

Dating back to the Manhattan Project solving Physics, Chemistry & Engineering problems have predominantly created data on-the-fly / at runtime.

In 1966, during Robert Mulliken’s Nobel Prize acceptance speech: “I would like to emphasize strongly my belief that the era of computing chemists, when hundreds if not thousands of chemists will go to the computing machine instead of the laboratory for increasingly many facets of information is already at hand.”

31 of 179

Example: Quantum Chemistry Ĥ Ψ = E Ψ

1926 physicists Erwin Schrödinger gave us a partial differential equation that describes how the quantum state of some physical system changes with time.

electron-proton

proton-proton

electron-electron

32 of 179

Example: Quantum Chemistry Ĥ Ψ = E Ψ

How do electronic structure programs represent the wavefunction Ψ ?

33 of 179

Example: Quantum Chemistry Ĥ Ψ = E Ψ

How do electronic structure programs represent the wavefunction Ψ ?

10^{45 combinations !!!}

34 of 179

2013 BU HPC (11M CPU Hours)

968 researchers / 34 departments

107 students / 4 courses / 0 trainings

2023 BU HPC (90M CPU Hours)

3,136 researchers / 95 departments

2,134 students / 53 courses / 60 trainings

35 of 179

The nature of research is changing��RC must adapt!

36 of 179

HPC TO CLOUD

Why the shift in tech?

BEYOND TECH

Why not let researchers build everything?

RESEARCH COMPUTING

What is this service?

37 of 179

From HPC to Cloud

High-performance computing �(aka Supercomputing from the 90s).

Fixed computing environment deployed (and controlled) by Systems Group

Large scale computing via batch processing (queuing) system
Basic command-line interface
Tailored to parallel processing
Tailored to large centralized shared scratch storage

Cloud native

Flexible (customizable) computing environments controlled by researcher

Orchestrated on-demand computing services with a variety of resource types: IaaS, PaaS, SaaS
Variety of users interfaces
Tailored to many single core tasks
Tailored to object or distributed storage
Ideal for scale out on-demand computing
Basis for modern product deployment

38 of 179

Why �was this important�???

Strategic Priority

Scale of Need

Data Science Platforms

Collaborative Research

Digital Humanities &

Data Driven Visualization

Restricted Data Analytics

University Product Development

Systems/Cloud Research

Domain Research Platforms/Gateways

Burst Resources

Academics/Training

39 of 179

So this is what we built…

$90M State, BU, Harvard, MIT, NEU, UMass

$5M NSF + Consortium

$2M MTC, $2M BU/HU

These are the investments

$500M RedHat

40 of 179

Who is using NERC?

Institutions: 11

PIs: 67

Total Projects: 121

Users: 771

41 of 179

Data becomes a first class citizen in the cloud

42 of 179

43 of 179

National Studies on Air Pollution and Health

Compute Space

CSV

FST

DAT

RAW Data: Air pollution | Census | Health Data (Medicare & Medicaid)

Open-Source Release: https://github.com/NSAPH-Data-Platform

Michael Bouzinier

User

1. data request

2. extract

3. analysis

Load Database

Data Exploration

44 of 179

HPC TO CLOUD

Why the shift in tech?

BEYOND TECH

Why not let researchers build everything?

RESEARCH COMPUTING

What is this service?

45 of 179

So much data!!!�

So many tools!!!

46 of 179

The Problem (as we see it)

Lack of data-centric tools throughout the research lifecycle
Lack of an interoperability between tools

Planning

Creating data

Discovery

Acquisition

Storing

Data transfer

Raw data

Reference data

Analysis

Data wrangling

Data analysis

Data sharing

Management

Data repositories

Data preservation

Disposal

47 of 179

The Problem (as we see it)

Planning

Creating data

Discovery

Acquisition

Storing

Data transfer

Raw data

Reference data

Analysis

Data wrangling

Data analysis

Data sharing

Management

Data repositories

Data preservation

Disposal

Ease the organization of large data
Facilitating the reproducibility of data
Sharing of active data internally and externally
Supporting data retention compliance
Tracking the provenance of data
Providing metrics on data access and use

48 of 179

Harvard Research Data Connect - Vision

An ecosystem of applications, services, and resources integrated by standards-based and service-oriented framework that will be populated by Library Services, University RC and school-based RC services, and Office of VP Research support as well as researchers themselves working in partnership.

49 of 179

Support the entire data lifecycle for conducting research (ease of access to data, publications, etc.) with support for delegation of common activities to automated processes.

Support creation, sharing, and curation of digital content and digital scholarship by making it easier to discover and analyze data; as well as tools for authoring and publishing scholarship. Resources will include:

data access for findability,
computation, virtual experimental facilities (data science),
virtual observational facilities (computer simulation as observation),
publication access for data sharing and improved reproducibility
ease of access to data science services.

Built on Service Oriented Architecture (SOA): joining independent services, provide integrated capabilities.
It will be extendable with new and domain specific tools (from astrophysics to digital humanities).
Modes of access that any researcher �can have (access through web browsers)

50 of 179

51 of 179

Thank you

Any Questions?

52 of 179

Bringing Data Close to Compute at Harvard Dataverse

Stefano M. Iacus

Senior Research Scientist & Director of Data Science and Product Research, IQSS, Harvard University

53 of 179

What is Dataverse?

An open-source platform that provides a generalist repository to publish, cite, and archive research data

Built to support multiple types of data, users, and workflows

Supports FAIR principles and Signposting.

Developed mainly at Harvard’s Institute for Quantitative Social Science (IQSS) since 2006 + key contributors from our large community

Started as a data sharing platform for the social science now covers almost all disciplines.

54 of 179

Who is using Dataverse?

datasets

90K Harvard DV
[427K whole DV network]

files

1.6M Harvard DV
[5.84M whole DV network]

storage

~ 70TB Harvard DV

Contributed by 70K users [Harvard DV]

115 installations (+14 new installations in 2023)

55 of 179

Who is using Dataverse?

recent containerization is giving a boost

56 of 179

The FAIR Guiding Principles (Wilkinson et al. 2016)

FINDABLE

Increases visibility, citations, and impact of research

Supports knowledge discovery and innovation

ACCESSIBLE

Streamlines and maximizes ability to build upon previous research results

Attracts partnerships with researchers and business in allied disciplines

REUSABLE

Promotes use and reuse of data allowing resources to be allocated wisely

Improves reproducibility and reliability of research results

INTEROPERABLE

Supports and promotes inter- and cross- disciplinary data and reuse

FAIR

Who makes DATA fair? Repositories (and researchers)!

Assigning persistent identifiers (DOI, ORCID, etc)
Structuring metadata according to a disciplinary standards or schemas
Indexing data as searchable resources
Retrieving datasets according to open protocols
Preserving data files and metadata
Tracking provenance and versions

🥐-ML

Another foundational concept that we use to describe high quality shared data, is FAIR, which stands for

Findable
Accessible
Interoperable
and Reuseable

Sharing data is more involved than simply uploading files somewhere for others to find. It also doesn’t necessarily mean making it freely open or publicly available. The methods you will use to share data will depend on a number of factors including the size and content of your data, mandates from the entities that fund and publish your research, and assumptions and requirements related to future use.

When sharing, it isn't sufficient to ensure that our data is available, we also need to make sure it is usable. We need the data to be accessible. This means you can obtain the data with minimal effort. Even if the data is restricted, the metadata is open so that you can review what the dataset includes, and can request further access. The data should be in an appropriate format, using common formats and standards, and controlled vocabularies. This ensures users can integrate a new dataset with other data.

Lastly, we need the data to be well-documented and has clear usage licence. Upon first review, anyone should be able to determine how they could use the data, and, again accurately understand the contents of the dataset.

57 of 179

Discoverability & Interoperability

https://www.perplexity.ai/

https://data.niaid.nih.gov/

https://datasetsearch.research.google.com/

58 of 179

Traditional repositories web pages are not optimized for use by machine agents that navigate the scholarly web.

How can a robot determine which link on a landing page leads to content and which to metadata?

How can a bot distinguish those links from the myriad of other links on the page?

Signposting exposes these info to bots in a standards-based way.

Signposting and Discoverability

https://tinyurl.com/FAIR-Signposting-GREI

FAIR Signposting “Level 1”

59 of 179

Harvard Dataverse through the data life cycle

(present and planned integrations)

60 of 179

Large data support at Harvard Dataverse

GB

TB

PB

Upload through Dataverse

Direct upload/download to S3

Globus Transfer to S3

Reference Data in Remote Stores (HTTP -> Globus)

Sensitive

Globus Transfer to File/Tape

Part of the Harvard Data Commons Project

61 of 179

Globus Endpoints

Dataverse

Server

Globus Service

Researcher’s Browser

Dataverse Dataset

Transfer In/Out or Reference

launch

reliable parallel transfer

Dataverse- Globus Transfer App

Managed

Globus Endpoint

(e.g. over tape storage)

Globus Store(s)

S3 Store

File Store

Remote Store

manage ACLs

launch

notify

request transfer

monitor transfers

62 of 179

How to compute on data stored at Harvard Dataverse?

Previewers and AI tools to interrogate data are integrated into the Dataverse UI but work mostly for small data.

So far, large data must be downloaded over the net in order to enable computing.

63 of 179

Dataverse support Globus in different ways

	Globus endpoint	DV Controls access	Globus Transfer To/From	Ingest/Previews/ Http Download
Managed Globus File Store	File/Tape	True	True	False
Managed Globus S3 store	S3 Connector	True	True	True
Remote Globus Store	Any Trusted	False	N/A Reference Only	HTTP possibly at remote endpoint

64 of 179

The MOC version of Harvard Dataverse (Poc)

65 of 179

The MOC version of Harvard Dataverse (Poc)

happy Harvard Dataverse user

(only NERC PI’s will be able to run compute)

tape

disk

66 of 179

The MOC version of Harvard Dataverse (Poc)

67 of 179

Dataverse’s Globus application (Disk/Tape)

68 of 179

How Dataverse manages Globus transfer

The file “Upload with Globus” button is available on the Upload Files panel after dataset has been created

Transfer is done in the Dataverse Globus application

Handles user login to Globus, file selection, and initiating the transfer

Coordinates with Dataverse to handle access control and for Dataverse to monitor the transfer/update the dataset

Dataverse Globus Transfer API

Used by the Datavers-Globus app
Available for use by other tools

69 of 179

Globus - Dataverse Transfer Tool

Globus Directory Connection

Dataverse transfer space

Authenticate to Globus

Connect to Globus storage

Select Files for transfer

Submit Transfer to Dataset in Dataverse.

70 of 179

Notification

mechanism

71 of 179

Once the process is complete the data is published

NESE tape storage

Valid DOI

72 of 179

Python notebook

Valid DOI

Computing on data

73 of 179

This will spin the JupyterLab container with the pre-loaded notebook taken from the dataset.

All files in this collection are seen as local to the Jupyter instance. Python will simply load them into memory for computing purposes.

74 of 179

NERC endpoint for the containerized storage (which exists on NESE)

75 of 179

Automatic mapping of local file names (local to the python notebook) to Harvard Dataverse file pointers on NESE

76 of 179

Then some nice computation happens

77 of 179

Towards AI integration

https://doi.org/10.7910/DVN/PHHZI7

This chatbot only sees the tabular data but is clueless about the metadata

78 of 179

Tell me what is this data about

Cool but poor

This chatbot only sees the tabular data but is clueless about the metadata

79 of 179

tell me the range of latitudes and longitudes with the highest number of events

ok-ish

This chatbot only sees the tabular data but is clueless about the metadata

80 of 179

map the range of latitudes and longitudes with the highest number of events to the names of countries

LLM kicks in

This chatbot only sees the tabular data but is clueless about the metadata

81 of 179

Traditional Dataverse search based on Solr (Apache Lucene)

(Keywords) Query: “covid cases in Italy”

82 of 179

Search via embeddings

(NLP) Query: “datasets about covid cases in Italy”

83 of 179

Search via embeddings

(NLP) Query: “datasets su casi di covid in Italia” [LLM kicks in]

84 of 179

Roadmap

Challenge: Long term sustainability

add support for R language
add support to configure Jupyter VMs CPU/GPU/RAM to support different workflows
from “NERC PIs only” to more general user base ($$$)
establish a billing mechanism for compute
move MOC Dataverse from OpenStack to OpenShift
moving to production

HDV is essentially cost-free for all academic users, but we are going to build a business model to sustain long term costs (e.g. charging for compute, PT of storage, etc)

MOC (NERC/NESE) will be the key to sustainability for Dataverse, both technologically and administratively (e.g. managing external users).

85 of 179

Thanks to:

Any Questions?

L. Andreev, P. Durbin, C. Boyd, G. Durand. S. Barbosa (IQSS-Dataverse), J. Myers (GDCC), O. Bertuch (FZJ)

S. Yockel, M. Munakami, M. Kupcevic, M. Shad, F. Pontiggia (NESE, NERC, HUIT)

D. Shi (Redhat), O. Krieger (BU)

Harvard Dataverse

Dataverse Project

86 of 179

2024 MOC Alliance Workshop

Break and Networking Time

87 of 179

Norwegian Research Commons and Implications for the MOC

Rory Macneil

Founder and CEO, Research Space

88 of 179

The Norwegian Research Commons

as a model for a

NERC Research Commons

Rory Macneil, Research Space

MOC Alliance Workshop

February 28, 2024 – Boston

89 of 179

Overview

Research Commons: context and concept
Global Open Research Commons International Model
REASON: Proposed Norwegian Research Commons
NERC as a Research Commons
Dataverse and RSpace as initial core elements of a NERC Research Commons
Federation possibilities

90 of 179

Overview

Research Commons: context and concept
Global Open Research Commons International Model
REASON: Proposed Norwegian Research Commons
NERC as a Research Commons
Dataverse and RSpace as initial core elements of a NERC Research Commons
Federation possibilities

91 of 179

Context

92 of 179

Problem

Lack of data-centric tools

Lack of interoperability

between tools

Impeding FAIR

Creating friction for researchers

Siloed data

93 of 179

Solution? Research Commons

”Bring together data with cloud computing infrastructure and commonly used software, services and applications for managing, analyzing and sharing data to create an interoperable resource for a research community”

Scott Yockel, Harvard University

Towards a Data Commons at Harvard

94 of 179

Who will provide Research Commons?

Universities

Harvard

National organizations

Canada’s Digital Research Alliance

Supra national organizations

EOSC: EUDAT Collaborative Data Infrastructure

95 of 179

Research Commons: context and concept
Global Open Research Commons International Model
REASON: Proposed Norwegian Research Commons
NERC as a Research Commons
Dataverse and RSpace as initial core elements of a NERC Research Commons
Federation possibilities

96 of 179

Global Open Research Commons International Model

Developed by Research Data Alliance working group
Released November 2023
Model/guideline for Research Commons
Three core technical elements and surrounding process/governance wrap
Many countries are looking at adopting – Canada, Sweden, Netherlands, Germany, UK, etc.

97 of 179

Research Commons: context and concept
Global Open Research Commons International Model
REASON: Proposed Norwegian Research Commons
NERC as a Research Commons
Dataverse and RSpace as initial core elements of a NERC Research Commons
Federation possibilities

98 of 179

Reason: Technical core

Core elements are Dataverse, iRODS and RSpace, which are already integrated
Includes integration of these and other resources with storage and compute infrastructure
Focus of proposed work is on further enhancing existing interoperability between tools
Objective: integrated infrastructure that facilitates data and metadata flow throughout the research data lifecycle

99 of 179

”Bring together data with cloud computing infrastructure and commonly used software, services and applications for managing, analyzing and sharing data to create an interoperable resource for a research community”

Scott Yockel, Harvard University

100 of 179

REASON as a model for other Research Commons

Comprehensive use of GORC elements
Detailed instantiation of Services and Tools Element of the GORC model
Built around around a group of complementary generalist tools all designed to enhance FAIRification of data
Focus on interoperability of tools, data and metadata
Encompasses need to interoperate with existing generalist and domain specific research infrastructure

101 of 179

Research Commons: context and concept
Global Open Research Commons International Model
REASON: Proposed Norwegian Research Commons
NERC as a Research Commons
Dataverse and RSpace as initial core elements of a NERC Research Commons
Federation possibilities

102 of 179

The NERC as a

Research Commons

103 of 179

Research Commons: context and concept
Global Open Research Commons International Model
REASON: Proposed Norwegian Research Commons
NERC as a Research Commons
Dataverse and RSpace as initial core elements of a NERC Research Commons
Federation possibilities

104 of 179

Dataverse

Generalist data repository
Harvard + Community
Part of GREI
Integration with compute – Dataverse work @Harvard

105 of 179

RSpace: Data-centric digital research platform designed to interoperate with and connect research infrastructure

106 of 179

RSpace – Dataverse Integration

107 of 179

NERC Research Commons starting with Dataverse and

RSpace

108 of 179

Dataverse and RSpace as initial core of NERC Research Commons

Active data management + repository / Ecosystem of connected tools / Integration with storage / Integration with compute

109 of 179

RSpace is now being deployed on the NERC!

110 of 179

Research Commons: context and concept
Global Open Research Commons International Model
REASON: Proposed Norwegian Research Commons
NERC as a Research Commons
Dataverse and RSpace as initial core elements of a NERC Research Commons
Federation possibilities

111 of 179

NERC Research Commons as trigger for federation with other Open Clouds and Research Commons

112 of 179

Resources and contact

Research Commons

GORC International Model

https://zenodo.org/records/10694490

REASON

https://zenodo.org/records/10410202

Email

rmacneil@researchspace.com

113 of 179

Institutional example 1: UCL

Connectivity with UCL infrastructure

114 of 179

Institutional example 1: UCL

Powering a FAIR research

data/metadata flow at the institutional level

115 of 179

Institutional example 2: Harvard

116 of 179

National example 1: European Open

Science Cloud

EUDAT Collaborative Data Infrastructure

117 of 179

National example 2: Canada

Digital Research Alliance Research Commons

118 of 179

Stage 2: Export data from RSpace to iRODS

Suggested by SURF

What: Push data from RSpace to iRODS

How: Use existing RSpace data export mechanism and interface

Why: Enable management of RSpace data in iRODS and association of RSpace data with other data managed in iRODS

Benefits: Enhanced FAIRification of data from RSpace and other data in iRODS

119 of 179

Federated Systems

We have seen siloed systems – connected through a central hub or portal, but with data and processes perhaps in walled gardens in proprietary formats or held behind subscription-based services.

This type of system requires continued budget for the subscriptions as well as vigilance to make sure your data is portable.

To plan for truly federated services requires more time and energy, but can result in a more robust ecosystem of interoperable pieces, resilient to shifting budgets and the consistent changing of underlying technologies.

Diplomacy (and interoperability) between sovereign systems is a more mature, slow, iterative process. It is how infrastructure should behave. It is best suited to be powered by open source solutions and well-understood formats and protocols.

120 of 179

Thank you

Any Questions?

121 of 179

Two Sigma Memento:�Why Good Artifact Naming Matters

Mark Roth

Managing Director, Data Engineering�Two Sigma Investments, LP

122 of 179

Important Legal Information

This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity.

The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Disclaimer

123 of 179

“Real names tell you the story of the things they belong to in my language, in the Old Entish as you might say”�

J.R.R. Tolkien, The Two Towers

Image credit: https://www.etsy.com/listing/715203686/a4-treebeard-poster-lord-of-the-rings

124 of 179

What Can Memento Do For Me?

Better Provenance

Maintain and communicate accurate provenance

Better Reproducibility

Make research more reproducible and enable continuing research

Better Ergonomics

Organize research notebooks and name intermediate artifacts
Reduce cost and research time by never repeating the same computation twice

On some level, these are all about naming!

125 of 179

Naming�is hard

126 of 179

cache invalidation
naming things
off-by-one errors

Credit: Phil Karlton, Leon Bambrick, https://martinfowler.com/bliki/TwoHardThings.html

There are only two hard things in Computer Science:

Memento helps with 2 of 3!

127 of 179

Hypothesis

Data Needed

Zip codes in NYC with�higher median incomes show higher Citi Bike usage

Citi Bike Usage by ZIP
Median Income by ZIP

128 of 179

Final Data

Local Store

Questions

What are the original data sources?
What methodology (code) was used to process this?
Where are the intermediate artifacts?
Can I reproduce this data?
How do I generate data for 2021?

2020.csv

date, zip, income, usage

2020-02-01, 10013, 150675, 113�2020-02-01, 10014, 147267, 107�...

129 of 179

Step 1. Ingest the Data

Runtime Env

?

External Source

Local Store

Amazon S3 tripdata

irs.gov/statistics

?

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

2020MM-�citibike-tripdata�.csv.zip

2020_

irs_gov_statistics_wealth.xlsx

step1_ingest_data�(yyyymm)

ingest_s3�(url, pathRegEx)

ingest_citibike�(year, month)

A

B

C

Let’s go back and build this pipeline step by step.

We’ll start by pulling in data for:

Citibike usage, which is available on Amazon S3 in a bucket called tripdata

Median incomes for zip codes in NYC, which we can get from irs.gov/statistics (and no doubt there’s a source on Harvard Dataverse as well, I’m sure!)

Our first step should be to pull this data in and keep a durable snapshot.

Durable is important, because we don’t know if these vendors will stop offering the data or change it without notifying us. For our research to be reproducible, we must never lose these files.

But what do we name the ingestion functions? I’m going to poll the audience. Our three choices are:

A - technology-centric

B - pipeline-centric

C - dataset-centric

Which one is easier to understand? Which one communicates more about provenance?

How many of you think A? B? C? It Depends?

There’s no right answer. Naming is tricky. Let’s go with C.

Similarly we can name our datasets.

130 of 179

Step 2. Normalize the Data

Runtime Env

?

Local Store

2020MM-�citibike-tripdata�.csv.zip

2020_�irs_gov_statistics_wealth.xlsx

?

normalize_�citibike(year)

normalize_�wealth(year)

citibike-2020.csv

wealth-2020.csv

step2_citibike_csv�(yyyy)

merge_zip_to_csv�(src, dest, **params)

normalize_citibike�(year)

B

C

A

131 of 179

Step 3. Join the Data

Runtime Env

join_citibike_�wealth�(year, tolerance)

Local Store

citibike-2020.csv

wealth-2020.csv

2020.csv

132 of 179

The Whole Pipeline

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Local Store

Amazon S3 tripdata

irs.gov/statistics

2020_

irs_gov_statistics_wealth.xlsx

2020MM-�citibike-tripdata�.csv.zip

Runtime Env

normalize_�citibike(year)

normalize_�wealth(year)

Local Store

citibike-2020.csv

wealth-2020.csv

Runtime Env

join_citibike_�wealth�(year, tolerance)

2020.csv

Local Store

133 of 179

Change Management

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Local Store

Amazon S3 tripdata

irs.gov/statistics

2020_

irs_gov_statistics_wealth.xlsx

2020MM-�citibike-tripdata�.csv.zip

normalize_�citibike(year)

normalize_�wealth(year)

Local Store

citibike-2020.csv

wealth-2020.csv

join_citibike_�wealth�(year, tolerance)

2020.csv

Local Store

New Data!

New Tolerance

Fix Normal-�ization

New file format!

New CPU Architecture

Now, once the pipeline is built, it’s good forever, right?

Not quite. The only constant is change.

What if irs.gov provides a correction to the data? If you didn’t store the ingestion durably, you may have lost it forever. Even if you did, you now have a second copy of the same 2020 data. What do you name the new one and how do you update downstream code?

Maybe one of those changes needs a fix to the normalization logic. How do we know if the version of 2020.csv I’m looking at has that fix incorporated or not?

Maybe we change the tolerance parameter. Is that incorporated, or is it using the old value of the parameter?

What if Citibike begins issuing its data in Parquet instead of zipped CSV? How does that impact our pipeline naming?

What if we move to a new CPU architecture on the join step? That may quantitatively affect the results. For reproducibility, we may wish to know on which architecture 2020.csv was generated.

Underlying any data set is a rich story of how it came to be, and the reason reproducibility is so hard is that that story is often never captured.

134 of 179

Better Naming (Old Entish approach)

2020.csv

cpu_intel-env-gpu_nvidia-env-pandas_2.2.0-env-amazon_s3_tripdata-20230101-asof-2020-1-ingest_citibike-2020-normalize_citibike-irs_gov_statistics-20230101-asof-2020-ingest_irs_gov_statistics-2020-normalize_weath_v2-2020-0.7-join_citibike_wealth.csv

2020.csv just doesn’t communicate any of this.

But it turns out our friends, the Ents may be on to a good solution. What if the name of the dataset told us the story of the things they belonged to?

In my role at Two Sigma I get to see a lot of clever researcher hacks. One time, I saw a feature request come in to enable longer column names. My first reaction was “what? 255 characters ought to be enough for anyone!”

I interviewed them about their process, and it was actually quite clever. Every stage of their process was captured in the name of the column of their data, using an RPN-style naming scheme. This way, if anything changed, they had a clear and canonical way to name the column to incorporate it. If they wanted to know if they had previously generated data with a certain tolerance, just look at the column name!

So, in this case, we might have a name for our dataset that looks something like…

Now as great of a filename as this is, I don’t see anyone going around naming files like this. It’s too tedious. There has to be a better way.

135 of 179

The Ents could have saved themselves a lot of time if they knew about hashing.

2020.csv just doesn’t communicate any of this.

But it turns out our friends, the Ents may be on to a good solution. What if the name of the dataset told us the story of the things they belonged to?

In my role at Two Sigma I get to see a lot of clever researcher hacks. One time, I saw a feature request come in to enable longer column names. My first reaction was “what? 255 characters ought to be enough for anyone!”

I interviewed them about their process, and it was actually quite clever. Every stage of their process was captured in the name of the column of their data, using an RPN-style naming scheme. This way, if anything changed, they had a clear and canonical way to name the column to incorporate it. If they wanted to know if they had previously generated data with a certain tolerance, just look at the column name!

So, in this case, we might have a name for our dataset that looks something like…

Now as great of a filename as this is, I don’t see anyone going around naming files like this. It’s too tedious. There has to be a better way.

136 of 179

Better Naming (Memento Approach)

2020.csv

+

join_citibike_wealth(2020, 0.7)

#99872fbb

137 of 179

Memento Approach

1. Functions, not artifacts

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Local Store

Amazon S3 tripdata

irs.gov/statistics

2020_

irs_gov_statistics_wealth.xlsx

2020MM-�citibike-tripdata�.csv.zip

Runtime Env

normalize_�citibike(year)

normalize_�wealth(year)

Local Store

citibike-2020.csv

wealth-2020.csv

Runtime Env

join_citibike_�wealth�(year, tolerance)

2020.csv

Local Store

138 of 179

Memento Approach

1. Functions, not artifacts

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(year)

normalize_�wealth(year)

Runtime Env

join_citibike_�wealth�(year, tolerance)

139 of 179

Memento Approach

2. Durably store ingested data

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(year)

normalize_�wealth(year)

Runtime Env

join_citibike_�wealth�(year, tolerance)

Durable Storage

140 of 179

Memento Approach

3. Hash all code and environments

Runtime Env

ingest_citibike�(year, month)

ingest_irs_gov_

statistics(year)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(year)

normalize_�wealth(year)

Runtime Env

join_citibike_�wealth�(year, tolerance)

Durable Storage

#af023329

#2276ea01

#297bba2f

#17fbccd4

#1288fe63

normalize_�citibike(year)

#73612fea

join_citibike_�wealth�(year, tolerance)

#0039dbbf

141 of 179

Memento Approach

4. Hash, memoize all invocations

Runtime Env

ingest_citibike�(2020, 1)

ingest_irs_gov_

statistics(2020)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(2020)

normalize_�wealth(2020)

Runtime Env

join_citibike_�wealth�(2020, 0.7)

Durable Storage

#af023329

#2276ea01

#297bba2f

#17fbccd4

#99872fbb

#1288fe63

join_citibike_wealth(2020, 0.7)

#e8817aec

#1b726dda

#6239bb12

#1997ffb2

#200776dd

142 of 179

Memento Approach

5. Record “mementos”

Runtime Env

ingest_citibike�(2020, 1)

ingest_irs_gov_

statistics(2020)

External Source

Amazon S3 tripdata

irs.gov/statistics

normalize_�citibike(2020)

normalize_�wealth(2020)

Runtime Env

join_citibike_�wealth�(2020, 0.7)

Durable Storage

#af023329

#2276ea01

#297bba2f

#17fbccd4

#99872fbb

#1288fe63

join_citibike_wealth(2020, 0.7)

#e8817aec

#1b726dda

#6239bb12

#1997ffb2

#200776dd

143 of 179

What’s in a Memento?

Standard Memento Metadata

public key of publisher
time of invocation
total runtime
running user
observed lineage
optional extensions

Inputs

function version
serialized parameter map
dependency set
runtime env spec
invocation hash

Outputs

execution metadata
result hash

Signature

Chain

Signature

144 of 179

What Can Memento Do For Me?

Better Provenance

Maintain and communicate accurate provenance

Better Reproducibility

Make research more reproducible and enable continuing research

Better Ergonomics

Organize research notebooks and name intermediate artifacts
Reduce cost and research time by never repeating the same computation twice

145 of 179

Provenance�is about naming

146 of 179

Why is Provenance Hard?

To understand provenance accurately, we want to know:

What raw data was used in the computation?
At what time was that raw data ingested?
Which parties were involved in ingesting and transforming the data?
What was the exact methodology used to transform the data?
What was the runtime environment used at each stage?
How has the provenance changed over time?

147 of 179

Provenance in Memento

Which raw data and ingestion methodology accurately named
Time of ingestion accurately named in Memento
Running user accurately named in Memento, with signature
Methodology accurately named (versioned via hashes)
Runtime environment accurately named (hashed)
As provenance changes, new mementos are recorded for posterity

Mementos provide high fidelity end-to-end provenance

Most frameworks can be adapted to use Memento as a standard

148 of 179

Reproducibility�is about naming

149 of 179

Why is Reproducibility Hard?

To reproduce computation, all of these must be accurately named (identified, recorded, versioned) in a way they can be reconstructed:

Exact snapshots of all external data used
Exact methodology (source code)
All configuration and parameters provided
Runtime environment:

Hardware (e.g. CPU architecture)
Operating System
Installed libraries

Use of non-deterministic libraries or hardware (e.g. GPUs)�(not handled by Memento)

150 of 179

Reproducibility in Memento

Raw Input Snapshots accurately named and kept in durable storage
Methodology accurately named (versioned via hashes)
All configuration and parameters accurately named (hashed)
Runtime environment accurately named (hashed)

Mementos record everything we need to know to reproduce research

The results are signed and can be validated by 3rd parties

151 of 179

Ergonomic�(and Economic)�Advantages

152 of 179

Organization Advantages

More efficient use of researcher time
Research notebooks and code become easier to organize

Just relax and write functions
Memento takes care of serialization, storage

Naming becomes easier

Use simple but descriptive names for functions
Memento tracks dependencies, versions, etc.

Automatic memoization

When notebook kernel restarts, can easily restart from where you left off

153 of 179

Caching Advantages

Cache needs to be invalidated when:

Code changes
Parameters change
Dependencies change
Relevant runtime environment changes (e.g. library upgrades)
New ingestion data is used

Surprise: Memento already tracks all of these in the invocation hash, making it possible to do high-fidelity automatic caching!

Caches can even be shared with other team members!

154 of 179

Caching in Memento

Memento

Metadata

Result Hash

Memoized Data

Result�Hash

Serialized Results

Function( )

arguments

Results

Invocation Hash

Parameter Map�(hash)

Function Reference (hash)

Output Cache�(result hash → result)

Memento

Federated

Catalog

(invocation hash → Memento)

It’s worth looking a bit more closely at how Memento does caching.

Let’s say you have a Memento Function which takes arguments and returns results.

First, the result itself can be hashed
and stored to an output cache using content-addressable storage. This ensures that two functions which return exactly the same value can reuse the same cache storage.
Next, Memento takes the source code for the function,
along with a serialized form of the arguments provided,
and computes what it calls an invocation hash.
That invocation hash is placed in a Memento, containing the metadata, including the result hash which points to the value,
and that is placed in a queryable catalog, which can be federated.

When the same version of the function is called with the same parameters, the invocation hash can be computed and looked up in the catalog to see if there is already a value in the cache.

It’s that simple!

155 of 179

Memento:�Current State�and Next Steps

156 of 179

Memento Open Source

Two Sigma has decided to open source a version of Memento, in order to help encourage further research

https://github.com/twosigma/memento
Features:

Optimized for Python and works with Pandas
Works well in Jupyter notebook environments (and elsewhere)

Memento whitepaper coming soon

pip install twosigma.memento

157 of 179

Future Work

Catch up Python reference implementation to ideas in white paper

Durable storage for ingestion
Context parameters
Signature chains

Distributed compute and storage plugins for major cloud providers
Mementos for Data Warehouses
Cache replacement policies, garbage collection
More Language Bindings (Java, Julia, Rust, …)
Memento Server

Monitoring, Metrics and Management
Containerization of Memento Functions
Federated Web of Data

Memento is ripe for future research work.

First, the Python reference implementation can catch up with the ideas in the whitepaper. Some initial thoughts are to implement durable storage, context parameters, and signature chains.

Distributed compute and storage plugins for major cloud providers is another area of interest.

I’m personally really interested in figuring out what it would look like for major Data Warehouses like BigQuery to generate Mementos next to their data, especially in the context of DBT pipelines and the like.

There are no cache replacement policies implemented yet, and no garbage collection of unused output caches.

One bigger area would be to implement bindings in other languages, such as Java, Julia and Rust.

We have a separate initiative we’re calling Memento Server, to produce a Memento-aware functions as a service server. This would automatically do monitoring, metrics and management of Mementos, and containerization of Memento Functions. It also opens up the possibility of a pretty big idea which is a Federated Web of Data.

I’m looking forward to hearing your questions about Memento, and ideas for how we can further collaborate!

158 of 179

Thank you

Any Questions?

159 of 179

D4N: A Community Cache for an Open Cloud

Matt Benjamin, Engineering Manager at IBM Storage

Amin Mosayyebzadeh, PhD Candidate at Boston University

160 of 179

Brief History of IBM Ceph Object / MOC Collaboration

D3N (2018+)

Emine Ugur Kaynar, Mania Abdi, Mohammad H. HajKazemi
L1/L2 performance evaluation (Red Hat, Ben England)

D4N

Emine U. Kaynar, Amin Mosayyebzadeh, Mania Abdi
Design Collaboration (Red Hat Ceph teams, esp. Object)

D4N Upstreaming

Amin Mosayyebzadeh, Sumatra Dhimoyee
Pritha Srivastava, Samarah Uriarte [et al]

D4N + Locality

Amin Mosayyebzadeh, Sumatra Dhimoyee, Austin Jamias

161 of 179

Alignment around Fusion of Data and Compute

MOC Researchers

D3/D4N

intelligent, intermediate S3 materialized cache (highlights: Kariz, e.g. integration w/workload planning)
general workload acceleration
exploratory writeback cache (e.g., ephemeral data, object packing)

Ceph Object Engineering (Red Hat, now IBM Storage)

focus on analytics workloads, AI

S3A (and SwiftA, Wal*Mart)
S3-select
New Work

Arrow Flight
FlightSql

162 of 179

D4N Upstreaming Effort

D4N cache design based on Zipper API, separated into:

Filter driver
Policy driver
Cache backend driver

Support for multiple cache backends - currently SSD-backed, Redis backed cache backends, other backends can be added
Flexible replacement policy

currently implemented LFUDA and LRU policies, other policies can be added

Boost::ASIO for IO operations
Redis server for caching location information, hot objects, transactional state, message bus

163 of 179

D4N Integration (Zipper Filter)

164 of 179

D4N Upstreaming Effort

Current Status

approaching upstream single-node MVP

supports read cache on a single node, using SSD backed cache backend and LFUDA policy as the replacement algo. Blocks of data are cached, head object isn’t as of now, but support for caching it is in place and will be used to cache head objects in writeback cache.

Future work

D4N distributed read cache (from research version)
Resilient write-back cache (c.f.)

Main challenges recently

ASIO related bugs
bugs in new Boost::REDIS client (new ASIO-based c++ client driver)

165 of 179

D4N with Higher Locality

RGW

RW cache

166 of 179

Distributed D4N

Connecting D4N instances across data center
The vision is sharing

S3
NVMe-Over-TCP
RDMA

167 of 179

D4N and K8s

Want PODs in any k8s cluster automatically make requests to nearest RGW
K8s clusters are ephemeral and elastic
Topology Aware Routing of K8s to find nearest RGW

K8s

168 of 179

D4N Use Cases

Analytical frameworks
ML/AI

A significant intermediate data (write) with high reuse rate (read)
Snapshotting the progress stages

169 of 179

BDAS: Big Data Analytic Support

(Ceph Object Team Code Name)

S3-select

Trino, Presto, Spark
Complete

supports:

CSV, Parquet, JSON back-ends

170 of 179

BDAS: Big Data Analytic Support

(Ceph Object Team Code Name)

Arrow / Arrow Flight

Apache Arrow

columnar data format that is optimized for in-memory data and enable efficient, zero-copy computation.
Arrow format and Parquet format, which is optimized for storage, are both columnar and easily converted to each other.

Arrow Flight

high-performance distributed data transfer protocol to move data in Arrow, Parquet, and other formats without needing to (de-)serialize. Furthermore it can access data in parallel across multiple sources in parallel and transfer data that cannot all fit in memory in batches to be processed.

Because both technologies are platform- and language-independent, they offer a standard for interoperability.
RGW is being enhanced to act as an Arrow Flight Server to feed data to Arrow clients and store intermediate and final results after computation.

171 of 179

BDAS: Big Data Analytic Support

(Ceph Object Team Code Name)

Arrow / Arrow Flight

Computation in WB Cache
New Query Modes

With additional knowledge of the internal structure of data in Arrow and Parquet formats, RGW can optimize the laying out and striping of data to optimize retrieving of specific columns and rows and responding to Flight SQL queries.

172 of 179

MOC Collaboration Futures

Research Contributions

IBM and Ceph teams

Resilient caching in the RGW/S3 tier is critical in progress and planned efforts to combine predicate pushdown with intermediate caching (of data and results)
Redis-based directory has formed the basis for more robust state sharing among cooperating RGWs, even in non-caching scenarios

MOC

K8S integration work facilitates convergence with Open Shift (Red Hat) and Fusion HCI (IBM) workloads
Could form the foundation of new caching-enabled S3 storage offerings to Open Cloud consumers

173 of 179

Credits

PhD Students

Amin Mosayyebzadeh
Sumatra Dhimoyee

Past

Mania Abdi
Mohammad Hossein Hajkazemi
Emine Ugur Kaynar
Raja R. Sambasivan
Ata Turk

Advisors

Brett Niver (Red Hat, IBM Storage)
Kyle Bader (Red Hat, IBM Storage)

Past

David Cohen (Intel)
Larry Rudolph (Two Sigma)

Research Professors

Orran Krieger, Boston University
Peter Desnoyers, Northeastern University

IBM Ceph Team

Casey Bodley (Upstream lead, ASIO)
Adam Emerson (Async)
Daniel Gryniewicz (Ceph Zipper)
Eric Ivancich (BDAS)
Ali Maredia (D4N)
Gal Salomon (S3-Select)
Pritha Srivastava (Cache Integration)
Samarah Uriarte (Cache Int., Redis Int)

174 of 179

Thank you

Any Questions?

175 of 179

Michael Daitzman

Director of Product Development,

Mass Open Cloud Alliance

176 of 179

United in the Cloud

177 of 179

The MOC Alliance Team

Emmanuel Cecchet

Software Engineer

Organization: UMass

Projects:, OCT, ESI

Harvard University:

Nick Amento, Network Architect

Robin Weber, NESE

Quan Pham

Software Engineer

Organization: Boston University

Projects: NERC, Mass Open Cloud Alliance

James Culbert

Director of IT

Organization: MGHPCC

Projects: NERC, MGHPCC, NEFRC, OSN,

Danni Shi

Senior Software Engineer

Organization: Red Hat

Projects:, OPE, NERC

Tzu-Mainn Chen

Organization: Red Hat

Projects: ESI, NERC

Isaiah Stapleton

Software Engineer

Organization: Red Hat

Projects:, OPE, NERC

Isaiah Stapleton

Software Engineer

Organization: Red Hat

Projects:, OPE, NERC

Steve Heckman, BU

Surbi Kanthed, Red Hat

Dylan Stewart, Red Hat

178 of 179

Two More Things . . . .

OCT Advisory Board Meeting is in 315 - up the stairs in the back

Please join us on the 17th Floor of the Center for Computing and Data Sciences in room 1750 at 6:30pm

179 of 179

2024 MOC Alliance Workshop

Reception

6:30 - 9:00 pm

CDS 1750 (17th Floor)

665 Commonwealth Ave