1 of 10

Science MLCommons Working Group

Science Working Group contribution to MLCommons Community Meeting July 14 2022

Farzana Yasmin Ahmad,

Gregg Barrett,

Wahid Bhimji,

Cade Brown,

Bala Desinghu,

Murali Emani,

Steve Farrell

Geoffrey Fox,

Grigori Fursin,

Tony Hey,

David Kanter,

Christine Kirkpatrick,

Piotr Luszczek,

Lauren Moos,

Piotr Hai Ah Nam,

Juri Papay,

Amit Ruhela,

Mallikarjun Shankar,

Jeyan Thiyagalingam

Aristeidis Tsaris,

Gregor von Laszewski,

Feiyi Wang,

Junqi Yin,

attended September 2021 – July 2022 meetings; Purple co-chairs

Tom Gibbs,

Rushil Anirudh,

Hyojin Kim,

Sergey Samsonau,

2 of 10

Science Research MLCommons working group

Science like industry involves edge and data-center issues, end-to-end systems, inference, and training, There are some similarities in the datasets and analytics as both industry and science involve image data but also differences; science data associated with simulations and particle physics experiments are quite different from most industry exemplars
When fully contributed, the benchmark suite will cover (at least) the following domains: material sciences, environmental sciences, life sciences, fusion, particle physics, astronomy, earthquake and earth sciences, with more than one representative problem from each of these domains

https://mlcommons.org/en/groups/research-science/
One aim is to provide a mechanism for assessing the capability of different ML models in addressing different scientific problem
i.e. one benchmark measure is Scientific Discovery
Cover rich range of problem classes
“End-to-end” is one class
Provide common environment to store and run benchmarks (Software)
4 Initial Benchmarks (2 from DOE labs, 1 UK, 1 UVA)
Surrogates Included (1 from LLNL next round)
Lead use of FAIR metadata for MLCommons

3 of 10

Science-based Metrics

Metrics will include those measuring performance on science discovery, e.g., could be one or more of:

Accuracy achieved
Time to solution (to meet a specific accuracy target)
Top-1 or Top-5 score
Chance your home will suffer a big earthquake …..

Goal of our benchmarks is to stimulate development of new methods relevant for scientific outcomes. We aim to:

Offer well-defined “science data” sets
Provide a reference implementation - to help others overcome any format/interpretation/usage hurdles
Specify target benchmark metrics (to outperform)
Require a description of the improved method or code used by respondents

The science data should have enough richness to allow experimentation with innovative approaches.
Also allow traditional system performance benchmarks

4 of 10

Benchmark	Science	Task	Owner Institute	Specific Benchmark Issues
CloudMask	Climate	Segmentation	RAL	Classify cloud pixels in images
STEMDL	Material	Classification	ORNL	Classifying the space groups of materials from their electron diffraction patterns
CANDLE-UNO	Medicine	Classification	ANL	Cancer prediction at cellular, molecular and population levels.
TEvolOp Forecasting	Earthquake	Regression	Virginia	Predict Earthquake Activity from recorded event data
ICF or Inertial Confinement Fusion	Plasma Physics	Simulation surrogate	LLNL	There are other possible LLNL benchmarks from collection of 10

Benchmark contains Datasets, Science Goals, Reference Implementations; hosted at SDSC or RAL

Specification of 4 Benchmarks https://drive.google.com/file/d/1BeefJTj4ZZL4Wa5c3zNz1l5nzQN-ktGR/view?usp=sharing

5 of 10

RAL Cloud Masking – Benchmark Overview

Problem of identifying individual pixels of cloud from satellite imagery necessary for estimating the sea or land surface temperature
Relies on Sentinel-3 satellite data, particularly the Data from the Sea Land Surface Temperature Radiometer (SLSTR) instrument
First version will have one dataset (200GB), but we intend to include another (>1TB)
This benchmark uses a U-Net-based deep neural network
Metrics: Classification accuracy (among others)

Sam Jackson, Caroline Cox, Jeyan Thiyagalingam and Tony Hey

6 of 10

ORNL STEMDL – Benchmark Overview

Classifying the space groups of materials from their electron diffraction patterns
Reference based on ResNet50-based model
Implementation: Python, PyTorch (with Horovod)
Metrics: F1-Score and per-class accuracy (among others)
Data: electron diffraction patterns for over 60,000 materials in material project database 10.13139/OLCF/1510313 (~550GB)

Material

E-Beam

Diffracted Beam

2D Scan

Material Properties

CBED Scan

Electron Microscope

Junqi Yin, Sajal Dash, Aristeidis Tsaris, Feiyi Wang, Mallikarjun Shankar

7 of 10

ANL CANDLE – Benchmark Overview

Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer
Implement deep learning architectures that are relevant to problems in cancer. These architectures address problems at three biological scales: cellular (Pilot1 P1), molecular (Pilot P2) and population (Pilot3)
This benchmark focuses on Uno from Pilot1 (P1): The high-level goal of the problem is to predict drug response based on molecular features of tumor cells across multiple data sources.
The goal of Uno is to build neural network-based models to predict tumor response to single and paired drugs, based on molecular features of tumor cells.
It implements a deep learning architecture with 21M parameters in Python, Tensorflow 2, Keras
Metric: Quality of prediction (Validation Loss)
3070 unique samples and 53520 unique drugs
Has rich collection of data engineering steps so could be end-to-end benchmark

Murali Emani and Venkatram Vishwanath

8 of 10

UVA TEvolOp Benchmark - Overview

Time Series Evolution Operator
Focuses on extracting the evolution in �earthquake time series data
Contains three reference models

LSTM
Temporal Fusion Transformer (Google/NVIDIA modified by UVA)
Science Transformer (University of Virginia)

Earthquake data from 1950-now from USGS

California; faults tagged -- typical results shown in figure

1790 time bins (2 weeks but input data daily), 2400 locations, ~12 measurements of magnitude, energy, depth, multiplicity
5 GB raw data
Metrics: multi-year forecasts of Earthquake activity as a function of time (4 years ahead illustrated)

Nash-Sutcliffe Efficiency

Related to extreme events in stock market

9 of 10

Ready to Announce

First 4 benchmarks have been loaded into MLCommons GitHub https://github.com/mlcommons/science
All these benchmarks have Apache 2.0 licenses
All have been run on multiple platforms
4 Benchmarks available with datasets, reference implementations and goals
The benchmarks are ready except for final web site and approvals
Propose September 30 as first submission date but I think significant science improvements will take longer

This date will allow problems to be run with a simple exploration

Rules (adapted from HPC) proposed
Focus is Open Division with Science Discovery Metric
Also support a Closed Division where Metric is System Performance

10 of 10

Science WG Benchmark Futures

ML (currently deep learning) will transform most scientific fields

e.g. new book “AI for Science” has 40 chapters

ML approaches can be grouped so methods are cross-science domain

Time Series e.g. Earthquake, Tokamak electron density, Weather, Stocks, Particle motion
Simulation surrogates
Map properties to capabilities (chemistry to drug)
Image analysis (note can have time series of images so categories mixed)
GAN scenario generation
Publication/science text
Control systems (networks, accelerators, tokamaks)

Could collect a few (~4) examples in each group; explore Foundation/giant models
Join Working group https://mlcommons.org/en/groups/research-science/ at https://mlcommons.org/en/get-involved/
See minutes at https://docs.google.com/document/d/167m7FK6-Ud4M5gXta5cIc1hKqaRHkk2B1GyKasdeQLc/edit?usp=sharing