2024 MOC Alliance Workshop
Lunch
The benefits of sharing and collaboration in the MOC
"If I have seen further it is by standing on the shoulders of Giants" Issac Newton
Larry Rudolph Distinguished Fellow, MOC and Two Sigma, LP
Disclaimer
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa
Important Legal Information
3
Data (DB)
Compute
In the beginning there was
Is Compute the King or
Is Data the King (both)
Marriage ⇒ data pipelines
Data Pipeline
Imagine a world where
Before you can use data
first download it or make a copy in the cloud
did you get enough storage (can you afford it?)
did you get the latest version
clean and normalize the data yourself
think about checkpoints, saved intermediate results
make sure there is enough storage or available servers / GPUs
re-execute everything for each modification of parameter choice
Repeat until you are exhausted …
For many of us, the world is even messier than this
Imagine a world in which
immediately leverage the related work
every time you run a program,
it supports the easy sharing of data and code
it is reproducible
it automatically recovers from failures
it runs more efficiently
the cost of producing may be offset by those consuming
the marriage is sustainable
Supports community culture
and natural pedagology
Consider developing AI models
Depict a series of gears and levers transforming raw data into neatly organized and standardized inputs. Describe each gear representing a different preprocessing step like tokenization, cleaning, and normalization, emphasizing the transformation process.
The is the Google Gemini Prompt (as with all other pics)
Where to download, how often, when to delete
Write it myself or find code made available by a kind soul or “buy” it
Each time new error found,
start from the beginning
Maybe I want to name each version, depending on versions of above steps
Step 1: Data Collection
Consider developing AI models
For each source of data
Perform all the steps in data collection
Giving each one a different output name
Check for new versions
Make a picture of a scene where a conveyor belt is moving through a diverse landscape, collecting various sources of data such as books, articles, websites, and social media posts. Mention the variety and richness of the data being collected.
Each source may need account/password/api-key
Data collection steps may fail. Decide how much to re-run
Do I run this on many of the versions in the data collection? How to remember what input was used for each output
Step 2: Preprocesing
Consider developing AI models
Build a model in some code
Rent GPUs
Take all the preprocessed data
Break into batches
Feed them into the GPUs to train the model
Save the model in some other file
Draw a caricature of a commons field with data and computers are grazing in the style of Boston Commons
Lots of decisions
Crashes or bugs, means either paying for GPU;s while debugging and doing all the earlier steps, or release and acquire GPUs
Learn by trial and error
Cloud costs quckly add up
Step 3: Training
You get the idea
A lot of steps in this meta pipeline (recursive pipelines)
Data pipelines usually DAGs
with nodes and edges crossing domain boundaries
Not only humans in the loop, unknown humans
Upstream decisions affect unknown downstream users
No Application is an Island
No Data is an Island
No Server is an island
Data, Pipeline, HW at Scale
Data represents “some capture of the real world” → complex system
Number and size of data sets @ scale → complex system
Servers, Network, Disks, Power, Heat @ scale → complex system
“feature of complex systems” → unanticipated emergent properties
Solution:
Share
Collaborate
Tenant’s view:
Pipeline (DAG part) is the application
Provider’s view:
The data center is the computer
Goal: Optimize like we do for apps running on a computer
Note
I am mixing two different, but similar ideas
Both are about more sharing and more collaboration
(marriage between couples?)
Imagine a world where there is
a data center with
plenty of storage, networking, and compute
ability for any server can access any data
probes measuring network congestion, server load, etc.
multi-level caching throughout
dynamic security partitions, usage accounting
anyone can try out new h/w
no single controlling commercial entity
profit is not the motive (helping the tenant is the goal)
Hmm, this already exists ==> a great platform for sharing
draw a caricature of a data center with storage compute, and networking with probes to measure congestions
Sharing Data
(Dataverse, Norwegian Research Commons)
FAIR (Findable, Accessible, Interoperable, Re-useable)
i.e. data without compute is like a day without sunshine
No need for everyone to have their own copy of the data
data immutability w/ security
no need for everyone to pay for its storage copy (while the data center does de-duplication)
No need for everyone to redo data collection, pre-processing, …
saving time, energy
Access to lots of tools that run near the data
Draw a caricature of a commons field with data and computers are grazing in style of Boston Commons
Functional Semantic Types (advertisement)
Sharing DAG-Metadata ⇔ H/W State
Collect metadata provenance of data pipelines / workflows
Learn attributes to do better allocation, placement, execution
Learn which data or code is reused for attribution
Use metadata and runtime info for better object caching
Draw a caricature of the provenance of a data pipeline
Shared Caching
(community caching)
draw several caches with data and code
(Bringing Data Close to Compute, D4N)
Leveraging Shared Memory
Future? (research request)
A pool of disaggregated (DRAM) memory directly accessed by any server in the rack (e.g. Intel CXL v3)
Not great for current workloads (Google publication)
Might be great for shared data
Map dataframes, tables to active jobs running in same rack
draw leveraging shared memory
Reproducible
Memento (Two Sigma) & Semantic Cloud (Redhat)
During data pipeline execution, the results of data transforms maybe memoized
In future, when same transformation, on same data, with same parameters, and in same environment
no need to re-execute, just return memoized value
Sharing of transforms, intermediate values
draw a caricature of mementos
The vision is sharing
The place is the MOC-A
Production-level facility – appetite to share / save $$
Lots of data to share
Researchers want to share with
themselves while debugging
teammates while exploring
collaborators or future ones
all with right attribution, protection, cost sharing model
draw a caricature of a bunch of computers helping out another computer
Thank you!!
Now it’s time for the Data+Compute
engagement party
Thank you
Any Questions?
Emre Keskin�University Research �Data Officer
Wayne Gilmore�Executive Director�Research Computing Services
Scott Yockel�University Research �Computing Officer
The Building Blocks of Cloud = Research Enablement
HPC TO CLOUD
Why the shift in tech?
BEYOND TECH
Why not let researchers build everything?
RESEARCH COMPUTING
What is this service?
Dawn of the Research Facilitator - 2014 ACI-REF
“Facilitator” - makes an action or process easier
Researchers are faced with:
Advanced Cyberinfrastructure - Research and Educational Facilitation: Campus-Based Computational Research Support - NSF Award # 1341935
Research Computing (RC) is at the intersection of providing leading technical solutions & supporting researchers in the scholarly process of discovery and innovation
Co-learn
RC professionals co-learn unique domain specific problems alongside of researchers
Co-create
RC professionals co-create solutions (technical, architectural, pipeline, software, …) alongside of researchers
What is HPC and why is it important?
High-performance computing (or Supercomputing from the 90s).
Dating back to the Manhattan Project solving Physics, Chemistry & Engineering problems have predominantly created data on-the-fly / at runtime.
In 1966, during Robert Mulliken’s Nobel Prize acceptance speech: “I would like to emphasize strongly my belief that the era of computing chemists, when hundreds if not thousands of chemists will go to the computing machine instead of the laboratory for increasingly many facets of information is already at hand.”
Example: Quantum Chemistry Ĥ Ψ = E Ψ
1926 physicists Erwin Schrödinger gave us a partial differential equation that describes how the quantum state of some physical system changes with time.
electron-proton
proton-proton
electron-electron
Example: Quantum Chemistry Ĥ Ψ = E Ψ
How do electronic structure programs represent the wavefunction Ψ ?
Example: Quantum Chemistry Ĥ Ψ = E Ψ
How do electronic structure programs represent the wavefunction Ψ ?
1045 combinations !!!
2013 BU HPC (11M CPU Hours)
968 researchers / 34 departments
107 students / 4 courses / 0 trainings
2023 BU HPC (90M CPU Hours)
3,136 researchers / 95 departments
2,134 students / 53 courses / 60 trainings
The nature of research is changing��RC must adapt!
HPC TO CLOUD
Why the shift in tech?
BEYOND TECH
Why not let researchers build everything?
RESEARCH COMPUTING
What is this service?
From HPC to Cloud
High-performance computing �(aka Supercomputing from the 90s).
Fixed computing environment deployed (and controlled) by Systems Group
Cloud native
Flexible (customizable) computing environments controlled by researcher
Why �was this important�???
Strategic Priority
Scale of Need
Data Science Platforms
Collaborative Research
Digital Humanities &
Data Driven Visualization
Restricted Data Analytics
University Product Development
Systems/Cloud Research
Domain Research Platforms/Gateways
Burst Resources
Academics/Training
So this is what we built…
$90M State, BU, Harvard, MIT, NEU, UMass
$5M NSF + Consortium
$2M MTC, $2M BU/HU
These are the investments
$500M RedHat
Who is using NERC?
Institutions: 11
PIs: 67
Total Projects: 121
Users: 771
Data becomes a first class citizen in the cloud
National Studies on Air Pollution and Health
Compute Space
CSV
FST
DAT
RAW Data: Air pollution | Census | Health Data (Medicare & Medicaid)
Open-Source Release: https://github.com/NSAPH-Data-Platform
Michael Bouzinier
User
1. data request
2. extract
3. analysis
Load Database
Data Exploration
HPC TO CLOUD
Why the shift in tech?
BEYOND TECH
Why not let researchers build everything?
RESEARCH COMPUTING
What is this service?
So much data!!!�
So many tools!!!
The Problem (as we see it)
Planning
Creating data
Discovery
Acquisition
Storing
Data transfer
Raw data
Reference data
Analysis
Data wrangling
Data analysis
Data sharing
Management
Data repositories
Data preservation
Disposal
The Problem (as we see it)
Planning
Creating data
Discovery
Acquisition
Storing
Data transfer
Raw data
Reference data
Analysis
Data wrangling
Data analysis
Data sharing
Management
Data repositories
Data preservation
Disposal
Harvard Research Data Connect - Vision
An ecosystem of applications, services, and resources integrated by standards-based and service-oriented framework that will be populated by Library Services, University RC and school-based RC services, and Office of VP Research support as well as researchers themselves working in partnership.
Thank you
Any Questions?
Bringing Data Close to Compute at Harvard Dataverse
Stefano M. Iacus
Senior Research Scientist & Director of Data Science and Product Research, IQSS, Harvard University
What is Dataverse?
An open-source platform that provides a generalist repository to publish, cite, and archive research data
Built to support multiple types of data, users, and workflows
Supports FAIR principles and Signposting.
Developed mainly at Harvard’s Institute for Quantitative Social Science (IQSS) since 2006 + key contributors from our large community
Started as a data sharing platform for the social science now covers almost all disciplines.
Who is using Dataverse?
Contributed by 70K users [Harvard DV]
Who is using Dataverse?
recent containerization is giving a boost
The FAIR Guiding Principles (Wilkinson et al. 2016)
FINDABLE
Increases visibility, citations, and impact of research
Supports knowledge discovery and innovation
ACCESSIBLE
Streamlines and maximizes ability to build upon previous research results
Attracts partnerships with researchers and business in allied disciplines
REUSABLE
Promotes use and reuse of data allowing resources to be allocated wisely
Improves reproducibility and reliability of research results
INTEROPERABLE
Supports and promotes inter- and cross- disciplinary data and reuse
FAIR
Who makes DATA fair? Repositories (and researchers)!
🥐-ML
Discoverability & Interoperability
Traditional repositories web pages are not optimized for use by machine agents that navigate the scholarly web.
How can a robot determine which link on a landing page leads to content and which to metadata?
How can a bot distinguish those links from the myriad of other links on the page?
Signposting exposes these info to bots in a standards-based way.
Signposting and Discoverability
https://tinyurl.com/FAIR-Signposting-GREI
FAIR Signposting “Level 1”
Harvard Dataverse through the data life cycle
(present and planned integrations)
Large data support at Harvard Dataverse
GB
TB
PB
Upload through Dataverse
Direct upload/download to S3
Globus Transfer to S3
Reference Data in Remote Stores (HTTP -> Globus)
Sensitive
Globus Transfer to File/Tape
Part of the Harvard Data Commons Project
Globus Endpoints
Dataverse
Server
Globus Service
Researcher’s Browser
Dataverse Dataset
Transfer In/Out or Reference
launch
reliable parallel transfer
Dataverse- Globus Transfer App
Managed
Globus Endpoint
(e.g. over tape storage)
Globus Store(s)
S3 Store
File Store
Remote Store
manage ACLs
launch
notify
request transfer
monitor transfers
How to compute on data stored at Harvard Dataverse?
Previewers and AI tools to interrogate data are integrated into the Dataverse UI but work mostly for small data.
So far, large data must be downloaded over the net in order to enable computing.
Dataverse support Globus in different ways
| Globus endpoint | DV Controls access | Globus Transfer To/From | Ingest/Previews/ Http Download |
Managed Globus File Store | File/Tape | True | True | False |
Managed Globus S3 store | S3 Connector | True | True | True |
Remote Globus Store | Any Trusted | False | N/A Reference Only | HTTP possibly at remote endpoint |
The MOC version of Harvard Dataverse (Poc)
The MOC version of Harvard Dataverse (Poc)
happy Harvard Dataverse user
(only NERC PI’s will be able to run compute)
tape
disk
The MOC version of Harvard Dataverse (Poc)
Dataverse’s Globus application (Disk/Tape)
How Dataverse manages Globus transfer
Globus - Dataverse Transfer Tool
Globus Directory Connection
Dataverse transfer space
Notification
mechanism
Once the process is complete the data is published
NESE tape storage
Valid DOI
Python notebook
Valid DOI
Computing on data
This will spin the JupyterLab container with the pre-loaded notebook taken from the dataset.
All files in this collection are seen as local to the Jupyter instance. Python will simply load them into memory for computing purposes.
NERC endpoint for the containerized storage (which exists on NESE)
Automatic mapping of local file names (local to the python notebook) to Harvard Dataverse file pointers on NESE
Then some nice computation happens
Towards AI integration
This chatbot only sees the tabular data but is clueless about the metadata
Tell me what is this data about
Cool but poor
This chatbot only sees the tabular data but is clueless about the metadata
tell me the range of latitudes and longitudes with the highest number of events
ok-ish
This chatbot only sees the tabular data but is clueless about the metadata
map the range of latitudes and longitudes with the highest number of events to the names of countries
LLM kicks in
This chatbot only sees the tabular data but is clueless about the metadata
Traditional Dataverse search based on Solr (Apache Lucene)
(Keywords) Query: “covid cases in Italy”
Search via embeddings
(NLP) Query: “datasets about covid cases in Italy”
Search via embeddings
(NLP) Query: “datasets su casi di covid in Italia” [LLM kicks in]
Roadmap
Challenge: Long term sustainability
Thanks to:
Any Questions?
L. Andreev, P. Durbin, C. Boyd, G. Durand. S. Barbosa (IQSS-Dataverse), J. Myers (GDCC), O. Bertuch (FZJ)
S. Yockel, M. Munakami, M. Kupcevic, M. Shad, F. Pontiggia (NESE, NERC, HUIT)
D. Shi (Redhat), O. Krieger (BU)
Harvard Dataverse
Dataverse Project
2024 MOC Alliance Workshop
Break and Networking Time
Norwegian Research Commons and Implications for the MOC
Rory Macneil
Founder and CEO, Research Space
The Norwegian Research Commons
as a model for a
NERC Research Commons
Rory Macneil, Research Space
MOC Alliance Workshop
February 28, 2024 – Boston
Overview
Overview
Context
Problem
Lack of data-centric tools
Lack of interoperability
between tools
Impeding FAIR
Creating friction for researchers
Siloed data
Solution? Research Commons
”Bring together data with cloud computing infrastructure and commonly used software, services and applications for managing, analyzing and sharing data to create an interoperable resource for a research community”
Scott Yockel, Harvard University
Who will provide Research Commons?
Global Open Research Commons International Model
Reason: Technical core
”Bring together data with cloud computing infrastructure and commonly used software, services and applications for managing, analyzing and sharing data to create an interoperable resource for a research community”
Scott Yockel, Harvard University
REASON as a model for other Research Commons
The NERC as a
Research Commons
Dataverse
RSpace: Data-centric digital research platform designed to interoperate with and connect research infrastructure
RSpace – Dataverse Integration
NERC Research Commons starting with Dataverse and
RSpace
Dataverse and RSpace as initial core of NERC Research Commons
Active data management + repository / Ecosystem of connected tools / Integration with storage / Integration with compute
RSpace is now being deployed on the NERC!
NERC Research Commons as trigger for federation with other Open Clouds and Research Commons
Resources and contact
Research Commons
GORC International Model
REASON
Institutional example 1: UCL
Connectivity with UCL infrastructure
Institutional example 1: UCL
Powering a FAIR research
data/metadata flow at the institutional level
Institutional example 2: Harvard
National example 1: European Open
Science Cloud
EUDAT Collaborative Data Infrastructure
National example 2: Canada
Digital Research Alliance Research Commons
Stage 2: Export data from RSpace to iRODS
Federated Systems
We have seen siloed systems – connected through a central hub or portal, but with data and processes perhaps in walled gardens in proprietary formats or held behind subscription-based services.
This type of system requires continued budget for the subscriptions as well as vigilance to make sure your data is portable.
To plan for truly federated services requires more time and energy, but can result in a more robust ecosystem of interoperable pieces, resilient to shifting budgets and the consistent changing of underlying technologies.
Diplomacy (and interoperability) between sovereign systems is a more mature, slow, iterative process. It is how infrastructure should behave. It is best suited to be powered by open source solutions and well-understood formats and protocols.
Thank you
Any Questions?
Two Sigma Memento:�Why Good Artifact Naming Matters
Mark Roth
Managing Director, Data Engineering�Two Sigma Investments, LP
Important Legal Information
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Disclaimer
“Real names tell you the story of the things they belong to in my language, in the Old Entish as you might say”�
Image credit: https://www.etsy.com/listing/715203686/a4-treebeard-poster-lord-of-the-rings
What Can Memento Do For Me?
On some level, these are all about naming!
Naming�is hard
Credit: Phil Karlton, Leon Bambrick, https://martinfowler.com/bliki/TwoHardThings.html
There are only two hard things in Computer Science:
Memento helps with 2 of 3!
Hypothesis
Data Needed
Zip codes in NYC with�higher median incomes show higher Citi Bike usage
Final Data
Local Store
Questions
2020.csv
date, zip, income, usage
2020-02-01, 10013, 150675, 113�2020-02-01, 10014, 147267, 107�...
Step 1. Ingest the Data
Runtime Env
?
?
External Source
Local Store
Amazon S3 tripdata
irs.gov/statistics
?
?
ingest_citibike�(year, month)
ingest_irs_gov_
statistics(year)
2020MM-�citibike-tripdata�.csv.zip
2020_
irs_gov_statistics_wealth.xlsx
step1_ingest_data�(yyyymm)
ingest_s3�(url, pathRegEx)
ingest_citibike�(year, month)
A
B
C
Step 2. Normalize the Data
Runtime Env
?
?
Local Store
Local Store
2020MM-�citibike-tripdata�.csv.zip
2020_�irs_gov_statistics_wealth.xlsx
?
?
normalize_�citibike(year)
normalize_�wealth(year)
citibike-2020.csv
wealth-2020.csv
step2_citibike_csv�(yyyy)
merge_zip_to_csv�(src, dest, **params)
normalize_citibike�(year)
B
C
A
Step 3. Join the Data
Runtime Env
join_citibike_�wealth�(year, tolerance)
Local Store
Local Store
citibike-2020.csv
wealth-2020.csv
2020.csv
The Whole Pipeline
Runtime Env
ingest_citibike�(year, month)
ingest_irs_gov_
statistics(year)
External Source
Local Store
Amazon S3 tripdata
irs.gov/statistics
2020_
irs_gov_statistics_wealth.xlsx
2020MM-�citibike-tripdata�.csv.zip
Runtime Env
normalize_�citibike(year)
normalize_�wealth(year)
Local Store
citibike-2020.csv
wealth-2020.csv
Runtime Env
join_citibike_�wealth�(year, tolerance)
2020.csv
Local Store
Change Management
Runtime Env
Runtime Env
Runtime Env
ingest_citibike�(year, month)
ingest_irs_gov_
statistics(year)
External Source
Local Store
Amazon S3 tripdata
irs.gov/statistics
2020_
irs_gov_statistics_wealth.xlsx
2020MM-�citibike-tripdata�.csv.zip
normalize_�citibike(year)
normalize_�wealth(year)
Local Store
citibike-2020.csv
wealth-2020.csv
join_citibike_�wealth�(year, tolerance)
2020.csv
Local Store
New Data!
New Tolerance
Fix Normal-�ization
New file format!
New CPU Architecture
Better Naming (Old Entish approach)
2020.csv
cpu_intel-env-gpu_nvidia-env-pandas_2.2.0-env-amazon_s3_tripdata-20230101-asof-2020-1-ingest_citibike-2020-normalize_citibike-irs_gov_statistics-20230101-asof-2020-ingest_irs_gov_statistics-2020-normalize_weath_v2-2020-0.7-join_citibike_wealth.csv
The Ents could have saved themselves a lot of time if they knew about hashing.
Better Naming (Memento Approach)
2020.csv
+
join_citibike_wealth(2020, 0.7)
#99872fbb
Memento Approach
1. Functions, not artifacts
Runtime Env
ingest_citibike�(year, month)
ingest_irs_gov_
statistics(year)
External Source
Local Store
Amazon S3 tripdata
irs.gov/statistics
2020_
irs_gov_statistics_wealth.xlsx
2020MM-�citibike-tripdata�.csv.zip
Runtime Env
normalize_�citibike(year)
normalize_�wealth(year)
Local Store
citibike-2020.csv
wealth-2020.csv
Runtime Env
join_citibike_�wealth�(year, tolerance)
2020.csv
Local Store
Memento Approach
1. Functions, not artifacts
Runtime Env
Runtime Env
ingest_citibike�(year, month)
ingest_irs_gov_
statistics(year)
External Source
Amazon S3 tripdata
irs.gov/statistics
normalize_�citibike(year)
normalize_�wealth(year)
Runtime Env
join_citibike_�wealth�(year, tolerance)
Memento Approach
2. Durably store ingested data
Runtime Env
Runtime Env
ingest_citibike�(year, month)
ingest_irs_gov_
statistics(year)
External Source
Amazon S3 tripdata
irs.gov/statistics
normalize_�citibike(year)
normalize_�wealth(year)
Runtime Env
join_citibike_�wealth�(year, tolerance)
Durable Storage
Memento Approach
3. Hash all code and environments
Runtime Env
Runtime Env
ingest_citibike�(year, month)
ingest_irs_gov_
statistics(year)
External Source
Amazon S3 tripdata
irs.gov/statistics
normalize_�citibike(year)
normalize_�wealth(year)
Runtime Env
join_citibike_�wealth�(year, tolerance)
Durable Storage
#af023329
#2276ea01
#297bba2f
#17fbccd4
#1288fe63
#1288fe63
#1288fe63
normalize_�citibike(year)
#73612fea
join_citibike_�wealth�(year, tolerance)
#0039dbbf
Memento Approach
4. Hash, memoize all invocations
Runtime Env
Runtime Env
ingest_citibike�(2020, 1)
ingest_irs_gov_
statistics(2020)
External Source
Amazon S3 tripdata
irs.gov/statistics
normalize_�citibike(2020)
normalize_�wealth(2020)
Runtime Env
join_citibike_�wealth�(2020, 0.7)
Durable Storage
#af023329
#2276ea01
#297bba2f
#17fbccd4
#99872fbb
#1288fe63
#1288fe63
#1288fe63
join_citibike_wealth(2020, 0.7)
#e8817aec
#1b726dda
#6239bb12
#1997ffb2
#200776dd
Memento Approach
5. Record “mementos”
Runtime Env
Runtime Env
ingest_citibike�(2020, 1)
ingest_irs_gov_
statistics(2020)
External Source
Amazon S3 tripdata
irs.gov/statistics
normalize_�citibike(2020)
normalize_�wealth(2020)
Runtime Env
join_citibike_�wealth�(2020, 0.7)
Durable Storage
#af023329
#2276ea01
#297bba2f
#17fbccd4
#99872fbb
#1288fe63
#1288fe63
#1288fe63
join_citibike_wealth(2020, 0.7)
#e8817aec
#1b726dda
#6239bb12
#1997ffb2
#200776dd
What’s in a Memento?
Standard Memento Metadata
Inputs
Outputs
Signature
Chain
Signature
What Can Memento Do For Me?
Provenance�is about naming
Why is Provenance Hard?
To understand provenance accurately, we want to know:
Provenance in Memento
Mementos provide high fidelity end-to-end provenance
Most frameworks can be adapted to use Memento as a standard
Reproducibility�is about naming
Why is Reproducibility Hard?
To reproduce computation, all of these must be accurately named (identified, recorded, versioned) in a way they can be reconstructed:
Reproducibility in Memento
Mementos record everything we need to know to reproduce research
The results are signed and can be validated by 3rd parties
Ergonomic�(and Economic)�Advantages
Organization Advantages
Caching Advantages
Cache needs to be invalidated when:
Surprise: Memento already tracks all of these in the invocation hash, making it possible to do high-fidelity automatic caching!
Caches can even be shared with other team members!
Caching in Memento
Memento
Metadata
Result Hash
Memoized Data
Result�Hash
Serialized Results
Function( )
arguments
Results
Invocation Hash
Parameter Map�(hash)
Function Reference (hash)
Output Cache�(result hash → result)
Memento
Federated
Catalog
(invocation hash → Memento)
Memento:�Current State�and Next Steps
Memento Open Source
Two Sigma has decided to open source a version of Memento, in order to help encourage further research
pip install twosigma.memento
Future Work
Thank you
Any Questions?
D4N: A Community Cache for an Open Cloud
Matt Benjamin, Engineering Manager at IBM Storage
Amin Mosayyebzadeh, PhD Candidate at Boston University
Brief History of IBM Ceph Object / MOC Collaboration
Alignment around Fusion of Data and Compute
D4N Upstreaming Effort
D4N Integration (Zipper Filter)
D4N Upstreaming Effort
D4N with Higher Locality
RGW
RGW
RW cache
RW cache
Directory
RGW
RGW
Read
cache
Read
cache
Directory
Write back cache
Distributed D4N
D4N and K8s
K8s
K8s
K8s
K8s
K8s
D4N Use Cases
BDAS: Big Data Analytic Support
(Ceph Object Team Code Name)
BDAS: Big Data Analytic Support
(Ceph Object Team Code Name)
BDAS: Big Data Analytic Support
(Ceph Object Team Code Name)
MOC Collaboration Futures
Credits
PhD Students
Past
Advisors
Past
Research Professors
IBM Ceph Team
Thank you
Any Questions?
Michael Daitzman
Director of Product Development,
Mass Open Cloud Alliance
United in the Cloud
The MOC Alliance Team
Emmanuel Cecchet
Software Engineer
Organization: UMass
Projects:, OCT, ESI
Harvard University:
Nick Amento, Network Architect
Robin Weber, NESE
Quan Pham
Software Engineer
Organization: Boston University
Projects: NERC, Mass Open Cloud Alliance
James Culbert
Director of IT
Organization: MGHPCC
Projects: NERC, MGHPCC, NEFRC, OSN,
Danni Shi
Senior Software Engineer
Organization: Red Hat
Projects:, OPE, NERC
Tzu-Mainn Chen
Organization: Red Hat
Projects: ESI, NERC
Isaiah Stapleton
Software Engineer
Organization: Red Hat
Projects:, OPE, NERC
Isaiah Stapleton
Software Engineer
Organization: Red Hat
Projects:, OPE, NERC
Steve Heckman, BU
Surbi Kanthed, Red Hat
Dylan Stewart, Red Hat
Two More Things . . . .
2024 MOC Alliance Workshop
Reception
6:30 - 9:00 pm
CDS 1750 (17th Floor)
665 Commonwealth Ave