1 of 10

Science MLCommons Working Group

1

Science Working Group contribution to MLCommons Community Meeting July 14 2022

Farzana Yasmin Ahmad,

Gregg Barrett,

Wahid Bhimji,

Cade Brown,

Bala Desinghu,

Murali Emani,

Steve Farrell

Geoffrey Fox,

Grigori Fursin,

Tony Hey,

David Kanter,

Christine Kirkpatrick,

Piotr Luszczek,

Lauren Moos,

Piotr Hai Ah Nam,

Juri Papay,

Amit Ruhela,

Mallikarjun Shankar,

Jeyan Thiyagalingam

Aristeidis Tsaris,

Gregor von Laszewski,

Feiyi Wang,

Junqi Yin,

attended September 2021 – July 2022 meetings; Purple co-chairs

Tom Gibbs,

Rushil Anirudh,

Hyojin Kim,

Sergey Samsonau,

2 of 10

Science Research MLCommons working group

  • Science like industry involves edge and data-center issues, end-to-end systems, inference, and training, There are some similarities in the datasets and analytics as both industry and science involve image data but also differences; science data associated with simulations and particle physics experiments are quite different from most industry exemplars
  • When fully contributed, the benchmark suite will cover (at least) the following domains: material sciences, environmental sciences, life sciences, fusion, particle physics, astronomy, earthquake and earth sciences, with more than one representative problem from each of these domains

2

  • https://mlcommons.org/en/groups/research-science/
  • One aim is to provide a mechanism for assessing the capability of different ML models in addressing different scientific problem
  • i.e. one benchmark measure is Scientific Discovery
  • Cover rich range of problem classes
  • “End-to-end” is one class
  • Provide common environment to store and run benchmarks (Software)
  • 4 Initial Benchmarks (2 from DOE labs, 1 UK, 1 UVA)
  • Surrogates Included (1 from LLNL next round)
  • Lead use of FAIR metadata for MLCommons

3 of 10

Science-based Metrics

  • Metrics will include those measuring performance on science discovery, e.g., could be one or more of:
    • Accuracy achieved
    • Time to solution (to meet a specific accuracy target)
    • Top-1 or Top-5 score
    • Chance your home will suffer a big earthquake …..
  • Goal of our benchmarks is to stimulate development of new methods relevant for scientific outcomes. We aim to:
    • Offer well-defined “science data” sets
    • Provide a reference implementation - to help others overcome any format/interpretation/usage hurdles
    • Specify target benchmark metrics (to outperform)
    • Require a description of the improved method or code used by respondents
  • The science data should have enough richness to allow experimentation with innovative approaches.
  • Also allow traditional system performance benchmarks

4 of 10

Benchmark

Science

Task

Owner Institute

Specific Benchmark Issues

CloudMask

Climate

Segmentation

RAL

Classify cloud pixels in images

STEMDL

Material

Classification

ORNL

Classifying the space groups of materials from their electron diffraction patterns

CANDLE-UNO

Medicine

Classification

ANL

Cancer prediction at cellular, molecular and population levels.

TEvolOp Forecasting

Earthquake

Regression

Virginia

Predict Earthquake Activity from recorded event data

ICF or Inertial Confinement Fusion

Plasma Physics

Simulation surrogate

LLNL

There are other possible LLNL benchmarks from collection of 10

Benchmark contains Datasets, Science Goals, Reference Implementations; hosted at SDSC or RAL

Specification of 4 Benchmarks https://drive.google.com/file/d/1BeefJTj4ZZL4Wa5c3zNz1l5nzQN-ktGR/view?usp=sharing

5 of 10

RAL Cloud Masking – Benchmark Overview

  • Problem of identifying individual pixels of cloud from satellite imagery necessary for estimating the sea or land surface temperature
  • Relies on Sentinel-3 satellite data, particularly the Data from the Sea Land Surface Temperature Radiometer (SLSTR) instrument
  • First version will have one dataset (200GB), but we intend to include another (>1TB)
  • This benchmark uses a U-Net-based deep neural network
  • Metrics: Classification accuracy (among others)

Sam Jackson, Caroline Cox, Jeyan Thiyagalingam and Tony Hey

6 of 10

ORNL STEMDL – Benchmark Overview

  • Classifying the space groups of materials from their electron diffraction patterns
  • Reference based on ResNet50-based model
  • Implementation: Python, PyTorch (with Horovod)
  • Metrics: F1-Score and per-class accuracy (among others)
  • Data: electron diffraction patterns for over 60,000 materials in material project database 10.13139/OLCF/1510313 (~550GB)

Material

E-Beam

Diffracted Beam

Y

2D Scan

X

Material Properties

CBED Scan

Electron Microscope

?

Junqi Yin, Sajal Dash, Aristeidis Tsaris, Feiyi Wang, Mallikarjun Shankar

7 of 10

ANL CANDLE – Benchmark Overview

  • Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer
  • Implement deep learning architectures that are relevant to problems in cancer. These architectures address problems at three biological scales: cellular (Pilot1 P1), molecular (Pilot P2) and population (Pilot3)
  • This benchmark focuses on Uno from Pilot1 (P1): The high-level goal of the problem is to predict drug response based on molecular features of tumor cells across multiple data sources.
  • The goal of Uno is to build neural network-based models to predict tumor response to single and paired drugs, based on molecular features of tumor cells.
  • It implements a deep learning architecture with 21M parameters in Python, Tensorflow 2, Keras
  • Metric: Quality of prediction (Validation Loss)
  • 3070 unique samples and 53520 unique drugs
  • Has rich collection of data engineering steps so could be end-to-end benchmark

Murali Emani and Venkatram Vishwanath

8 of 10

UVA TEvolOp Benchmark - Overview

  • Time Series Evolution Operator
  • Focuses on extracting the evolution in �earthquake time series data
  • Contains three reference models
    • LSTM
    • Temporal Fusion Transformer (Google/NVIDIA modified by UVA)
    • Science Transformer (University of Virginia)
  • Earthquake data from 1950-now from USGS
    • California; faults tagged -- typical results shown in figure
  • 1790 time bins (2 weeks but input data daily), 2400 locations, ~12 measurements of magnitude, energy, depth, multiplicity
  • 5 GB raw data
  • Metrics: multi-year forecasts of Earthquake activity as a function of time (4 years ahead illustrated)
    • Nash-Sutcliffe Efficiency
  • Related to extreme events in stock market

9 of 10

Ready to Announce

  • First 4 benchmarks have been loaded into MLCommons GitHub https://github.com/mlcommons/science
  • All these benchmarks have Apache 2.0 licenses
  • All have been run on multiple platforms
  • 4 Benchmarks available with datasets, reference implementations and goals
  • The benchmarks are ready except for final web site and approvals
  • Propose September 30 as first submission date but I think significant science improvements will take longer
    • This date will allow problems to be run with a simple exploration
  • Rules (adapted from HPC) proposed
  • Focus is Open Division with Science Discovery Metric
  • Also support a Closed Division where Metric is System Performance

10 of 10

Science WG Benchmark Futures

  • ML (currently deep learning) will transform most scientific fields
    • e.g. new book “AI for Science” has 40 chapters
  • ML approaches can be grouped so methods are cross-science domain
    • Time Series e.g. Earthquake, Tokamak electron density, Weather, Stocks, Particle motion
    • Simulation surrogates
    • Map properties to capabilities (chemistry to drug)
    • Image analysis (note can have time series of images so categories mixed)
    • GAN scenario generation
    • Publication/science text
    • Control systems (networks, accelerators, tokamaks)
  • Could collect a few (~4) examples in each group; explore Foundation/giant models
  • Join Working group https://mlcommons.org/en/groups/research-science/ at https://mlcommons.org/en/get-involved/
  • See minutes at https://docs.google.com/document/d/167m7FK6-Ud4M5gXta5cIc1hKqaRHkk2B1GyKasdeQLc/edit?usp=sharing