1 of 141

MLCommons®

Community Meeting 1Q23

April 20, 2023

2 of 141

This community meeting is being recorded and will be shared

2

3 of 141

Schedule

3

9:00 AM	Breakfast
9:30 AM	VMware Welcome: Sujata Banerjee
9:45 AM	MLC Welcome: Peter Mattson
10:00 AM	MLC Update: David Kanter
10:30 AM	Break
10:50 AM	Working Group Update
12:20 PM	Lunch
1:20 PM	Power WG Showcase
1:35 PM	DataPerf WG Showcase
1:50 PM	Group discussions (in person only)
	I Benchmark value to enterprise customers: getting involved / moderator: Debojyoti Dutta II Datasets for model quality benchmarking - e.g. which is the best LLM? / moderator: Kurt Bollacker III MLCommons research: how do we deliver value for researchers? / moderator: Vijay Janapa Reddi
3:20 PM	Cake Break
3:45 PM	Social hour
4:45 PM	End

4 of 141

Welcome

4

5 of 141

5

What is the MLCommons Assocation?

6 of 141

6

ML/AI has huge potential to benefit everyone

Information access�

Health

Safety�

Human productivity

”Safety” photo: Ian Maddox; “Info Access” icon: Ætoms; “Business Productivity” photo: Katrina.Tuliao

7 of 141

MLCommons is building the ML ecosystem

7

Mission

AI / ML Ecosystem

Pillars

Benchmarks

Best practices

Research

Data

AI / ML Ecosystem

Wright brothers: public domain; Planes: Marek Ślusarczyk

Community

https://mlcommons.org/en/news/unlocking-ml-requires-an-ecosystem-approach/

8 of 141

Working groups

8

9 of 141

9

Benchmarks

10 of 141

ML needs benchmarks for everything

10

ML component	Metrics	MLCommons WG
Hardware	Speed/efficiency	MLPerf WGs
Software (compiler + runtime)	Speed/efficiency	MLPerf WGs
Model	Accuracy/efficiency	AlgoPerf WG
Training algorithm	Accuracy/efficiency	AlgoPerf WG
Data	Accuracy/efficiency	DataPerf WG
Solution	Accuracy/safety	MedPerf WG Automotive task force …

11 of 141

11

Data

12 of 141

Data is the new code.

Data defines best possible functionality.

The model is a lossy compiler.

12

13 of 141

Modern ML is built on public datasets

13

Public datasets are the language of ML research …

Even for the largest of ML focused companies…

Reddi et al

14 of 141

But ML is evolving

14

15 of 141

How do we develop better datasets?

15

Community

Datasets

AGI train and test

Industry / tool R&D

Infrastructure

Public Good

Tools

Metrics

Funding

$ € ¥ …

Neurips

ICML?...

Venues + incentives

🏆

People + shared vision

!!!

16 of 141

16

Challenges

17 of 141

ML/AI is taking off

17

“AI” search interest over time

18 of 141

We are driving 200mph…while building the road

18

Photos: unsplash

19 of 141

Concretely

Rapid changes

ML-deployed-in-verticals

LLMs

Quality benchmarks

Datasets

Industrial use at academic pace

Org challenges

Member/community growth

Staffing/processes maturity

Membership model

19

20 of 141

20

Getting involved

21 of 141

We need more smart people!

21

22 of 141

22

Values

23 of 141

Values (https://mlcommons.org/en/philosophy/)

Grow ML markets and make the world a better place
Act through collaborative engineering
Get everyone involved
Make fast but consensus-supported decisions
Build a community that people want to be part of

23

Photos: unsplash

24 of 141

MLCommons Update

24

25 of 141

MLCommons is Growing our Staff

Director of Marketing
Product Manager
Sys admin
Lead for MedPerf
Tech writers
New software engineering firm and mobile Eng
Tech lead for autonomous driving

Welcome aboard - excited for your contributions!

25

26 of 141

Q1 Accomplishments

Started Auto Benchmark Taskforce with AVCC

Dedicated inference benchmarks for automotive

Data-centric ML benchmarking: Announcing DataPerf’s 2023 challenges – Google AI Blog
MLCommons - YouTube soft launch
Dollar Street launch - datasets for diversity
White paper with McKinsey on AI development

26

27 of 141

MLPerf™ Inference v3.0 Results Overview

Results: MLPerf Inference v3.0 Results (Embargoed until 4/5/23 @ 10am Pacific)

Over 6,700 performance results
>2,400 power measurement results

Performance: Alibaba, ASUSTeK, Azure, cTuning, Deci, Dell, GIGABYTE, H3C, HPE, Inspur, Intel, Krai, Lenovo, Moffett, Nettrix, Neuchips, Neural Magic, NVIDIA, Qualcomm, Quanta Cloud Technology, rebellions, SiMa, Supermicro,VMware, xFusion

Power: Alibaba, cTuning, Dell, HPE, KRAI, Lenovo, NEUCHIPS, NVIDIA, Qualcomm, SiMa

Inference over Network: HPE, NVIDIA, Qualcomm

New submitters in bold

27

28 of 141

MLPerf Inference Trends

Lots of new hardware systems
Increasing performance in the datacenter

Over 30%+ in some benchmarks since MLPerf Inference v2.1

Increasing emphasis on power efficiency

50% increase in number of submitters measuring power efficiency

More interest in Inference over the network (3X increase)

Open data center - nearly 3X more submissions

Wide variety of techniques: distillation, sparsification, new models

New MLPerf Mobile app available for Android and iOS, contact for access

28

29 of 141

Press Coverage

60+ Stories
Good mix of coverage across Tech Press, Broad media (Forbes)
Local media pickup of press release
Full spreadsheet of articles here

“One of the best ways the AI/ML industry has today for measuring performance is with the MLPerf set of testing benchmarks, which have been developed by the multi-stakeholder MLCommons organization.”

Venture Beat

“This round featured even greater participation across the community with a record-breaking 25 submitting organizations, over 6,700 performance results, and more than 2,400 performance and power efficiency measurements.”

Yahoo! Finance

“Peter Rutten, VP infrastructure systems, IDC, said, “[MLPerf 3.0] is especially helpful because of the huge differences between all the systems in terms of performance and power consumption [and] the software that each system deploys to optimize the performance. Having the ability to compare all these systems in an objective way that is supported by most of the AI industry is allowing us to see how vendors compare”.

Enterprise AI

30 of 141

1Q23 MLCommons Hero Awards

30

Pablo Gonzalez Mesa:

Heroically landing MLPerf Inference despite many challenges and being awesome

Lilith Bat-Leah: Amazing volunteer spirit, building the DataPerf webpage, outreach, ICML workshop, and tireless organization

Oana Balmau:

Superb leadership, dedication, and enthusiasm for MLPerf Storage

31 of 141

Kelly Berschauer (Marketing)

ROLE: Director of Marketing�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Marketing Director with 20+ years of experience across Microsoft, Meta and Truveta a healthcare startup
Spent 6 years doing marketing for Microsoft Research and created the Facebook Research brand
Have lived and worked in Seattle, London and the Silicon Valley
In my spare time you can find me travelling, kayaking in Lake Union or gardening on Whidbey Island
Trivia ?: In the early 80’s I occasionally hung out with a now-deceased grunge icon in Grays Harbor County, WA. �Who was it?

31

32 of 141

Nathan Wasson (IT)

ROLE: MLCommons Systems Administrator, Auditor, & Video Editor�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Writer and technical services provider; started producing and editing written, audio, and video content at The Tech Report but has moved on since its demise
Published in The Tech Report, HotHardware, and RETURN
Has troubleshooted technical issues from clients’ houses to the offices of the U.S. Congress
Internet privacy and security enthusiast
Takes weekend respites from the digital world to hit cars with wrenches

32

33 of 141

David Tafur (Product Management)

ROLE: Product Manager

BACKGROUND AND A LITTLE BIT ABOUT ME: �

6+ years in Product Management & Business Strategy across corporations and startups (Banking, Cosmetics, B2B)
Product Consultant for US & South American companies
Global citizen: Born in Peru, lived in USA, Australia, Costa Rica, and Brazil
Enthusiast of traveling, surfing, swimming, and ukulele

33

34 of 141

Sally Doherty (Board of Directors)

ROLE: Board Member & Finance Committee Chair�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Chief Marketing Office at AI compute company Graphcore, responsible for GTM, performance marketing, product marketing, communications and brand
30+ years of experience in technology marketing, with experience over the years across Nvidia, Sony Computer Entertainment & start ups like cellular communication firm, Icera
When I’m not at work or walking my Irish Setter, you’ll find me cooking, learning about gardening, eating out or at the theatre

34

35 of 141

Weiming Zhao (Board of Directors)

ROLE: Director of Marketing�

BACKGROUND AND A LITTLE BIT ABOUT ME:

As an architect at Alibaba, my primary responsibility is to oversee the the build of AI infrastructure. This includes evaluating and enabling cutting edge AI hardware and optimizing software to ensure maximum performance and efficiency.
10+ years of experience in system software including virtualization and compiler optimization.
Enjoy the hobbies of reading, playing badminton and jogging

35

36 of 141

Kurt Bollacker (Datasets)

ROLE: Datasets WG Chair�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Digital Research Director at the Long Now Foundation
Areas of Research: machine learning, search engines, graph databases, digital archiving, and electro-cardiographic simulation
Public Datasets I’ve created or helped start: CiteseerX, Internet Archive, Rosetta Project, Freebase (Google KG), Sleep and Dream Database
I bake cakes most every Friday evening. I try to never repeat a recipe.

36

37 of 141

Andreas Prodromou (HPC)

ROLE: HPC WG Co-chair�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Senior Deep Learning Scientist in NVIDIA
Works in NVIDIA’s DL engineering team, focusing on DL accelerator architectures.
Highlights:

Involved with HPC WG as a participant/contributor for two years.
In-depth familiarity with a wide range of DL models, accelerators and frameworks. Hands-on experience in deploying state of the art AI models.

Committed to perpetual self-growth, via a semi-random exploration of new interests. Ask for more info if interested.

37

38 of 141

Juri Papay (Science)

ROLE: Science WG Co-chair�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Senior Data Scientist at STFC-RAL
Since my student years I have been interested in measuring the performance of parallel computers.
Worked on over twenty EU and UK funded projects, covering a wide range of topics such as benchmarking of HPC, security modelling and semantic research.
At STFC-RAL I work on my favourite topic of benchmarking machine learning applications and investigating the performance of large scale GPU systems.

38

39 of 141

Ritika Borkar (Training)

ROLE: Training WG Co-chair�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Senior Deep Learning Architect at NVIDIA
Work in Compute Architecture Team with focus on HW & SW optimizations for High Performance AI Computing on GPUs and datacenter systems
Also serve as Board Member at MLCommons
Been involved with MLPerf Training for more than 3 years
Can’t get enough of Pacific Northwest. Hit me up if you are ever in Portland area and need recommendations for good food or hikes!

39

40 of 141

Max Bartolo (Dynabench)

ROLE: Dynabench WG Co-chair�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Tech Lead for Command models at Cohere and Adjunct Teaching Fellow at University College London (UCL)
Works on improving the robustness and overall capabilities of conversational instruction-following large language models
Previously spent time at Satalia, Bloomsbury AI, Meta AI and DeepMind
One of the original Dynabench contributors and creator of the ShARC and AdversarialQA datasets
Enjoys football, martial arts, tennis, hiking, diving & more
Trivia?: I appear (for a few seconds) in Game of Thrones. Which episode?

40

41 of 141

Wei Zhao (Mobile)

ROLE: Mobile WG Co-chair�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Director of Technical Marketing at Zeku
Works on technology planning, business development and ecosystem partnerships
Ph.D. on ECE from University and Maryland College Park
10+ years of industry experience on mobile and AI
Held system engineering and product management roles at corporation and startup
Enjoy soccer, movies and traveling in spare time

41

42 of 141

Mostafa El-Khamy (Mobile)

ROLE: Mobile WG Co-chair�

BACKGROUND AND A LITTLE BIT ABOUT ME:

Sr. Principal Engineer at Samsung Device Solutions Research America
Leads R&D for AI multimedia systems and AI Benchmarking @Samsung
Ph.D. in EE from the California Institute of Technology (Caltech), MS and BSc in EE from Alexandria University, MBA from Edinburgh Business School.
Previously worked at Qualcomm CRD, and faculty member of Alexandria University and Egypt Japan Univ. of Science and Technology
Erdős number is 2 (Paul Erdős -> Robert McEliece -> M. El-Khamy)
Enjoys being on the water (fishing/sailing/diving)

42

43 of 141

43

It would not be possible without our members

Founding Members

Academics from educational institutions including:

Harvard University

Polytechnique Montreal

Peng Cheng Laboratory

Stanford University

University of California, Berkeley

University of Toronto

University of Tübingen

University of Virginia

University of York, United Kingdom

Yonsei University

York University, Canada

Members

44 of 141

Break

44

45 of 141

Schedule

45

9:00 AM	Breakfast
9:30 AM	VMware Welcome: Sujata Banerjee
9:45 AM	MLC Welcome: Peter Mattson
10:00 AM	MLC Update: David Kanter
10:30 AM	Break
10:50 AM	Working Group Update
12:20 PM	Lunch
1:20 PM	Power WG Showcase
1:35 PM	DataPerf WG Showcase
1:50 PM	Group discussions (in person only)
	I Benchmark value to enterprise customers: getting involved / moderator: Debojyoti Dutta II Datasets for model quality benchmarking - e.g. which is the best LLM? / moderator: Kurt Bollacker III MLCommons research: how do we deliver value for researchers? / moderator: Vijay Janapa Reddi
3:20 PM	Cake Break
3:45 PM	Social hour
4:45 PM	End

46 of 141

Working Group Updates

46

47 of 141

Working Group Updates

Goal is a brief overview of each working group
5 minutes per working group (WG), slides follow a fairly fixed format
Not a lot of time for audio questions

Follow up with WG chairs via email, chat, or in-person

MLCommons WG Roadmaps and WG OKRs offer a snapshot

Feedback on them welcome, both format and contents

47

48 of 141

Mobile

48

EXAMPLE

49 of 141

Mobile Group

WG Purpose:

Develop a performance-accuracy benchmark suite for consumer mobile devices (phones & laptop) with different AI stacks

Goal:

Allowing general public to examine the AI performance of their devices through the MLPerf Mobile benchmark app

49

EXAMPLE

50 of 141

Updates from Last Quarter (change title!)

REMEMBER YOU HAVE 5 MINUTES TOTAL, BE BRIEF
Goal is to share with the community what is going on, and get people interested, or able to help

David will talk about major things like benchmark releases in the first part of the presentation (most likely)

Give us a few bullet points about what you accomplished in the last quarter

MLCommons Roadmaps and MLCommons Working Group (WG) OKRs are good starting points

Keep to

50

EXAMPLE

51 of 141

What’s Next (1Q and 4Q)? (Change title!)

REMEMBER YOU HAVE 5 MINUTES TOTAL, BE BRIEF
Tell us what is happening in the next 3-12 months

Should roughly match MLCommons Roadmaps and MLCommons Working Group (WG) OKRs - if not, then we should think about updating

You can use the timeline in the next slide if it helps

What are your top 2-3 challenges?

E.g., need owner for a reference model?

What are your top community asks?

What can MLC do to help your WG?

51

EXAMPLE

52 of 141

What’s ahead

52

Feb 2022 v2.0

New features and support expansion

New segmentation model

Aug 2022 v2.1

Cross Platform Enablement

Porting Android implementation to flutter
Core ML support

Mar 2023 v3.0

New features and Cross platform support

New SR model
Windows Flutter support

Aug 2023 v3.1

Increase adoption

Internal launch of the score collection website
Having MLPerf app to use a default runtime on all mobile devices

EXAMPLE

53 of 141

Mobile

53

54 of 141

Mobile Group

WG Purpose:

Develop a performance-accuracy benchmark suite for consumer mobile devices (phones & laptop) with different AI stacks

Goal:

Allowing general public to examine the AI performance of their devices through the MLPerf Mobile benchmark app

54

55 of 141

Updates since last Community Event

MLPerf Mobile v3.0 submission

New Super Resolution Model from Seoul National University

Add diversity (newer and larger model)
Finding a usable dataset was challenging

Submission for Android & Windows platforms

MLPerf v3.0 app now available for download

Including the new SR model
Adding support for Mediatek Dimensity 9200 and Qualcomm Snapdragon 8 Gen2 & 7+ Gen2

55

56 of 141

Upcoming Features

CI/CD pipeline

Helps to expedite the app development process

New models for v3.1/v4.0

Replacing older models such as mobilenetEdgeTPU

Default runtime

Ensure the legacy/low-tier platforms not supported by the vendors will still generate benchmark scores - important for press

Data collection

Collect user-generated scores

Great reference data & website traffic generator

56

57 of 141

What’s ahead

57

Mar 2023 v3.0

New Features and Cross Platform Support

New SR model
Expand Windows coverage

Aug 2023 v3.1

Working Towards Default Runtime and Data Collection

Model Updates
First Windows benchmark app
Internal launch of the score collection website

Aug 2022 v2.1

Cross Platform Enablement

Porting Android implementation to flutter
Core ML support

Mar 2024 v4.0

Increase Benchmark Coverage and Adoption

Official Launch of the score collection website
Default runtime available for all mobile devices

58 of 141

Autonomous Driving Benchmark

58

59 of 141

Autonomous Driving Benchmark Group

Purpose:

Develop a benchmark for a representative automotive task for both training and inference.

Goal:

Add a training/inference multimodal 3D object detection benchmark.

59

60 of 141

Updates

Dataset, accuracy metrics and high level model are settled.
Waymo Open Dataset.
Average Precision Heading.
PointPainting model.

Working implementation of PointPainting with samples of dataset.

Need help with compute resources for full dataset.

Longer Term:

Find cochair, we are still looking for those interested.
Settle details of model.
Train on full dataset.
Determine inference scenario.
Targeting MLPerf Training 4.0 and Inference 4.1.

60

61 of 141

Automotive Benchmarking Task Force

61

62 of 141

Automotive Benchmarking Task Force

Background:

MLCommons and AVCC (Autonomous Vehicle Compute Consortium) are collaborating on creating an ML benchmark suite for automotive

MLCommons knows ML
AVCC knows automotive

Purpose:

Solve the current non-alignment in ML compute performance measurement in the automotive supply chain

Goal:

Define and develop an automotive industry standard ML benchmark suite to be used in RFIs/RFQs

62

63 of 141

Updates

Successful kickoff in end of February

30+ attendees

Joint MLC and AVCC MoU close to being finalized

Will be publicly announced

Proposed timeline

Gathering requirements which will define the specification

The “MVP Demo” will implement a subset of the specification

63

64 of 141

Algorithms

64

65 of 141

Algorithms Working Group

WG Purpose:

Create a set of rigorous and relevant benchmarks to measure neural network training speedups due to algorithmic improvements.

Specific Goals:

AlgoPerf Training Algorithm Benchmark
[Future] AlgoPerf Model Benchmark
…

65

66 of 141

Updates from Last Quarter

Putting the finishing touches on our codebase.

Bugfixing
Implementing workload variants
Implementing baseline submissions

Wrote a draft for the paper introducing the rules of the AlgoPerf Training Algorithms Benchmark.

66

67 of 141

What’s Next?

Short Term:

Publish paper introducing the rules for the AlgoPerf Training Algorithms Benchmark.
Publish a Call for Submission for the Benchmark.

Blog posts
Social media posts
Provide support for potential submitters

Long Term:

Publish results of the AlgoPerf Training Algorithms Benchmark.
Plan next iteration of this benchmark.
Build the AlgoPerf Model Benchmark.

67

68 of 141

Best Practices

68

69 of 141

Best Practices Working Group

Purpose:

Improve portability and reproducibility of ML projects, workloads and benchmarks. The initial starting point is the MLCube^TM project that provides specifications and reference implementations to achieve this.

Goal:

Develop specifications for packaging ML projects, workloads and benchmarks as OCI (Open Container Initiative) containers.
Develop MLCube ecosystem of tools (reference runners for diverse environments, project templates for bootstrapping new MLCubes for various languages, example MLCubes).
Package MLPerf benchmarks, support MLCommons competitions, promote this technology in industry and academia.

69

70 of 141

Updates from Last Quarter

New version of MLCube 0.0.9.
Training Reference Models MLCubed (Retinanet, BERT).
Documentation updates
Optimizing test environments and project dependencies

New environment (OS and python version) for test workflows).
Removing redundant dependencies.
Splitting dependencies into production, test and development dependencies.
Upgrading project dependencies to newer versions.

Support for `~` and environment variables (e.g., `HOME`) in task parameters.
New CLI arguments for Docker and Singularity runners (--network, --security, --gpus, --memory, --cpu).

70

71 of 141

What’s Next?

Promote MLCube

Finalize a paper that surveys approaches that enable portability and reproducibility of ML projects, introduces and positions MLCube.
Develop MLCube tutorials, meet with select partners in industry and academia.

Support MLPerf benchmarks and MLCommons competitions:

Package multiple MLPerf reference training benchmarks.
Support MedPerf and DataPerf competitions.

New features:

Self-contained MLCube containers.
Improved user experience.
Python API.

71

72 of 141

Medical

72

https://mlcommons.org/en/groups/research-medical/

73 of 141

Medical Working Group

WG Purpose:

Develop and support Medical AI benchmarks in global real-world clinical settings

Goals:

Develop MedPerf to support access to federated datasets for secure and privacy-preserving workload execution
Develop GaNDLF to support zero-low code ML workloads
Research and develop benchmarks with clinical impact (e.g., bias, health equity, etc.)

73

74 of 141

Updates

MedPerf paper accepted for publication at Nature Machine Intelligence

65 co-authors
20+ companies, 20+ academic institutions and 10 hospitals across 13 countries
Big thanks to Dana Farber, Intel, Nutanix (Debo Dutta), UPenn and, of course, MLCommons community

GaNDLF paper accepted for publication at Nature Communications Engineering

16+ research groups both industry and academia

Developed MedPerf <-> Synapse interface: a) Orchestrator and b) Provides private docker registry and compute (BraTS/FeTS challenge)
Better user experience: (New commands)

To view server assets (#369)
To create MedPerf-compatible MLCube templates (#396)
To prepare MLCubes for submission to the server (#413)

New features:

Support offline execution (#400)
Flexible file hosting requirements: Support private file hosting on the Synapse platform (#378)

Enhanced documentation: New hands-on tutorials (#370, #385)

74

75 of 141

What’s next?

Support BraTS/FeTS 2023 (70+ hospitals)
Provide an interface for interfacing with Federated Training frameworks to train models
Data Layer of MedPerf:

XNAT integration support for better data ingestion
More flexible and controlled Data preparation MLCube tasks

GaNDLF improvements:

Model differential privacy training
Support for multiple medical data types (beyond radiology and pathology data)
Support for generative models (GANs, diffusion models)

Challenges

Identify sustainability model
Research Grants (NIH, NSF)

Subcontract or partner in consortia (hard to identify them, need strategy)

Foundations (specific disease)
Support Healthcare Stakeholders (e.g., clinical validation of AI )

75

76 of 141

Tiny

76

77 of 141

Tiny Overview

Tiny Working Group

What we are

A benchmark suite for ultra-low-power ML systems (TinyML)
On-device real-time batch-of-one inference.
Measure energy/inference and latency on 4 different models

77

Typical Systems�

MCUs, some accelerators
10s-100s MHz
⪍ MB Flash, SRAM
~mW Power

�

Lightweight Models (<1M Param )

Keyword Spotting	Visual Wake Words	Anomaly Detection	Image Classification
DS-CNN	MobileNet v1	FC Autoencoder	ResNet8
52 kPar	325 kPar	270 kPar	96 kPar

Current Benchmarks

78 of 141

Updates

Tiny Working Group

Latest Round

Published November 9
Submitting Organizations: 8

Including 3 new submitters

Systems Submitted: 17

11 w/ Energy

Good variety of hardware represented: Arm, RISC, FPGA, Custom Accelerators, and combinations

78

79 of 141

What’s Next

Tiny Working Group

New benchmarks in the works:

Streaming Audio Benchmark LSTM-based denoiser

Sustained inference on a continuous time-series
Exercise rapid duty-cycling for energy efficiency
Add RNN to the benchmark suite

Others under consideration

Next Submission Round v1.1

May 19 submission / June 21 publication

Join us! Mondays at 12:05 ET during winter/spring 2023

(Will revert to normal time of 12:05 ET in June)

79

80 of 141

Datasets

80

Kurt Bollacker

2023 April 20

81 of 141

Datasets Working Group

WG Purpose:

Create new datasets to fuel innovation in machine learning

Specific Goals:

Create impactful datasets without licensing encumbrances.
Host datasets with affordable access for everyone
Create tooling to scale the creation of new datasets and improvement of existing ones

81

82 of 141

Recent Release

Speech Wikimedia (March 2023)

Compilation of 1,500 hours of multilingual audio files (with transcriptions) extracted from Wikimedia commons (CC and PD). A larger, unsupervised dataset is in process.

Last Quarter

People’s Speech Update

30K hours of aligned,diverse speech. V1.1 brings Faster download, higher quality, and better docs (A tutorial!)

Dollar Street Dataset

House item image dataset that contains novel geographic and incoming information from underrepresented parts of the world.

82

83 of 141

Challenges in

Dataset Creation

83

Stable versioning vs freshness (datasets go stale and become less relevant)
Metadata management is hard and inconsistent (tags? new item labels?)
Scaling is a problem (slow/expensive to move large data or have many collaborators)
Licensing and Governance are hard
Communities need to grow around and nurture datasets

A new project: A Dataset Service for collaboration

84 of 141

Dataset Service: What will it look like?

84

First focus on the infrastructure to build a “Git for Data” service that supports:

Scalability to large datasets and many, distributed collaborators.
Support for structured metadata at dataset, item, and subset granularities
Fast versioning and forking of datasets and subsets.
Easy distributed, collaborative contribution.
Fast discovery and sharing of data (sub)sets.

85 of 141

Join The Datasets Working Group!

https://mlcommons.org/en/groups/datasets/

Google group link: Datasets Google Group

Join group to be

1. invited to the weekly meetings (Thursdays 11-12am Pacific Time)
2. Receive emails from the email list

Interested in helping? Contact the WG chair, Kurt Bollacker

85

86 of 141

Inference

86

87 of 141

Inference Working Group

WG Purpose:

Develop an Inference performance benchmark suite for measuring how fast systems can run models in a variety of deployment scenarios.

Goal:

Choose representative workloads for benchmarking and identify scenarios for realistic evaluation.
More submissions from wide range of industry participants

Join Inference WG https://groups.google.com/u/4/a/mlcommons.org/g/inference

Details on Inference benchmarks: https://github.com/mlcommons/inference

87

88 of 141

Updates from Last Quarter

V3.0 Inference submission in March-2023

Submission on March 3rd and Result publication on Apr 5
25 submitters (Alibaba, ASUSTeK, Azure, cTuning, Deci, Dell, GIGABYTE, H3C, HPE, Inspur, Intel, Krai, Lenovo, Moffett, Nettrix, Neuchips, Neural Magic, NVIDIA, Qualcomm, Quanta Cloud Technology, rebellions, SiMa, Supermicro,VMware, xFusion)
> 6700 performance results ( 1.26x increase from last submission)
> 2400 power results

New Inference benchmark task-force

DLRM v2 task force
LLM task force

Open-source CK-MLPerf automation platform to provide a unified CLI and GUI to run MLPerf inference benchmarks on any hardware, visualize and reproduce results, and organize public optimization competitions

88

89 of 141

What’s Next

V3.1 submission timelines

New model freeze: Apr 28; Code freeze: June 2
Submission: Aug 4; Results publication: Aug 30

V3.1 new benchmarks

DLRM v2
LLM (175B, 6B)

Inference benchmarks in the pipeline for 2024

Stable Diffusion, GNN, Autonomous driving benchmark

Top 2-3 challenges

Discussion on the benchmark carrying capacity
What benchmarks to retire in order to add new benchmarks

What are your top community asks?

Anyone interested in being Inference WG co-chair, please contact David K or Inference chairs
Benchmark ownership
Active participation in WG and make submissions

89

90 of 141

Automation and Reproducibility Task Force�Collective Knowledge Playground

90

91 of 141

access.cKnowledge.org

A free, open-source, technology-agnostic and on-prem automation platform for collaborative and reproducible MLPerf inference benchmarking, optimization and comparison across any software, hardware, models and data sets from any vendor: https://github.com/mlcommons/ck/tree/master/platform

Simple GUI to analyze, compare and reproduce MLPerf v3.0, 2.1 and 2.0 results with any derived metric �such as Performance/Watt or Performance/$ : https://github.com/mlcommons/cm_inference_results

92 of 141

We thank Neural Magic (Michael Goin), Pablo Gonzalez Mesa, students (Himanshu Dutta, Aditya Kumar Shaw, Sachin Mudaliyar, Thomas Zhu) and other great contributors to help us validate the MLCommons CK technology (including CM aka CK2 - the new version of our portable workflow framework) to unify, automate and reproduce MLPerf inference submissions:

80% of all results and 98% of power results
Diverse CPUs, GPUs and DSPs with PyTorch, ONNX, QAIC, TF/TFLite, TVM and TensorRT
Hardware from Nvidia (including 4090 workstation and Jetson AGX Orin edge device), Qualcomm, AMD, Intel and Apple
Deep Sparse optimization from Neural Magic and models from the Hugging Face Zoo
Cloud submissions on AWS and GCP
1st end-to-end student submissions including on Apple Metal

cKnowledge.org/mlperf-inf-v3.0-forbes

cKnowledge.org/mlperf-inf-v3.0-report

Our 1st MLPerf inf v3.0 community submission

93 of 141

cKnowledge.org/challenges

Contact Grigori and Arjun (automation and reproducibility task force co-chairs) and/or join our Discord server to learn about how to participate in the upcoming 1st reproducible optimization tournament for MLPerf inference v3.1 and suggest your own challenges: discord.gg/JjWNWXKxwT

We will continue working with all MLCommons members and researchers to adapt MLCommons CK/CM to their needs, reduce their benchmarking and optimization costs, and improve MLPerf/MLCommons value:

Integrate their software and inference engines into portable CK-MLPerf workflows
Improve CK platform to automate their MLPerf experiments and optimization
Automatically generate containers for MLPerf benchmarks with CK/CM workflows and unified CLI

Based on your feedback, we plan to enhance the CK playground to generate Pareto-efficient end-to-end �AI and ML-based applications using MLPerf results, CK technology and modular CK/CM containers - prototype is available and will be integrated with the CK playground by Q3 2023!

93

Next: join the 1st public optimization tournament for MLPerf inference v3.1!

94 of 141

Training

94

95 of 141

Training Group

WG Purpose:

Define, develop and conduct MLPerf Training benchmarks

Goal:

Benchmark Training performance of key ML workloads on variety of platforms and thereby, enable HW and SW innovation which speeds-up ML

95

96 of 141

Updates from Last Quarter

Reference benchmark code complete for 2 new benchmarks:

LLM: GPT-3 175B model on C4 dataset (available in 2 frameworks - Pytorch/Megatron-LM, PAXML)
DLRM: DCNv2 with synthetically multi-hot Criteo dataset

MLCube integration underway for 2 benchmarks (BERT, Retinanet)

Registration survey for participation in Training v3.0 out!

4 task force kick-offs to develop new benchmarks/methodologies:

Txt2Image: Stable Diffusion on LAION aesthetics dataset
GNN: R-GAT on IGB dataset
Automotive Training: PointPillars on Waymo Dataset
Power Methodology

96

97 of 141

What’s Next?

Training v3.0 submission deadline is May 19, 2023
Training v3.0 results publication on June 28, 2023

Training v3.1 will work towards landing:

Stable Diffusion benchmark
Power

Finalize benchmark roadmap for 2024:

GNN, Automotive are candidates for addition
Potentially drop some old benchmarks - discussion on-going

Continue reference clean-up & MLCube integration

To Join Training follow google group link: training@mlcommons.org

97

98 of 141

HPC

98

Chairs

Murali Emani, Argonne National Lab <memani@anl.gov>
Steven Farrell, Lawrence Berkeley National Lab <sfarrell@lbl.gov> → OUTGOING
Andreas Prodromou, NVIDIA <aprodromou@nvidia.com> → NEW

Get involved

Join the HPC group: https://mlcommons.org/en/groups/training-hpc/
Meetings: Mondays, weekly alternating between 8-9AM PT and 3-4PM PT.
Reach out to the chairs

99 of 141

HPC WG overview

Purpose:

ML performance benchmarking on supercomputer systems
We publish the MLPerf HPC benchmark suite

SciML applications relevant for HPC systems
Modeled after MLPerf Training with few adjustments
Measure time-to-train and throughput (models/min)

We participate in BoFs, tutorial submissions, etc.

Goals:

Add more benchmarks, keep things fresh and relevant
Add more metrics relevant to HPC+science (e.g. power)
Increase interest and participation

99

Top500 supercomputers November 2022

100 of 141

Updates

Working towards v3.0 submissions due Oct and announce results at SC23 in Nov
HPC rules proposals to improve popularity and increase participation
Github link (https://github.com/mlcommons/training_policies/issues/513)
Overview of proposals

Exclude data movement in timing measurement, focus on compute performance
Allow throughput extrapolation to large system size
Rename "weak scaling" to "throughput" and "strong scaling" to "Time To Train" (TTT)

Outcomes: arrived at group consensus to accept (a,c) and reject (b)
Adding new protein folding (OpenFold) benchmark
CosmoFlow pytorch reference implementation
Working with Power task force

100

101 of 141

Up next

Outreach

ISC BoF session
Reaching out to various HPC facilities and vendors to increase participation

Tutorial

Planning for a tutorial session to cover topics of DL optimization at scale on supercomputers using the HPC benchmarks
Presentations from facilities+vendors
Hands-on session on a leadership supercomputer such as Perlmutter (NERSC)

Adding Power measurement, targeting upcoming v3.0
Finalize OpenFold benchmark
MLPerf HPC v3.0

Benchmark freeze June 12
Submission deadline Oct 6

101

102 of 141

Storage Working Group

102

103 of 141

Purpose and Goals

WG Purpose:

Develop a benchmark suite that applies the same workload to a storage system as running MLPerf Training would, for different AI stacks and task types, so AI/ML teams can accurately size the storage required to support their overall AI/ML goals

SubGoals:

Simulate the load without “accelerator” hardware or real data

No GPUs required, no real-world data, no training in AI

Simulate the load from different task types

Unet-3D, DLRM, and NLP, for PyTorch and Tensorflow
Correlated w/Training: accelerator & storage perf together

Enable AI teams to compare storage vendors
Enable storage vendors to innovate & optimize for AI teams

103

PMLDB

DAWNBench

104 of 141

Beta Released, GA is Next!

Released two Betas and incorporated feedback:

This is all new so we’re using Beta releases to validate the benchmark and processes before we go GA
Beta 1 was released to the WG on February 6th

Accurately modelling reading of the dataset

Beta 2 was released to the WG and friendly partners April 7th

Added writing of periodic model checkpoints

General availability and a formal submission window opening:

We expect GA release of the benchmark in 3 to 4 weeks

The usual: open window, WG review, publish results

Expecting many vendors to participate at launch

Intel, NetApp, Samsung, Nutanix, Weka, Micron, NVIDIA
Still lining up more vendors

104

105 of 141

Short Term Next Steps

Accept and process submissions:

We have lots of work to smooth out the submission process
Lots of lessons will be learned from the submissions

Benchmark fixes and rules tightening

Add support for multi-host training to benchmark:

Crawl, walk, run – now crawling, will walk after 1st submissions
Next “feature” to be added is to simulate distributed training

For a single dataset, imposing a coordinated storage load across multiple hosts (distributing batches and using a single MPI barrier between iterations for weight exchange)

105

106 of 141

Long Term + Issues and Asks

Long Term:

O(solutions) == O(practitioners)

Need all Training workloads on PyTorch AND Tensorflow?

Issues and Asks:

Data cleaning is 50% of Watts consumed, impact on storage?

Need consensus on some form of analytical framework to represent cleaning, it’s too variable today to build a benchmark around it, yet it is at least half the workload and an entirely different access pattern

106

Data cleaning &

pre-processing

Training

107 of 141

Benchmark Infra

107

108 of 141

Benchmark Infra Group

WG Purpose:

Develop infrastructure and toolings to support benchmark submissions and facilitate compliance

Goals:

Develop and operate MLPerf submission service, and automate submission process

Now supporting Training, Inference, HPC, more in the work

Maintain benchmark compliance tools

logging lib, compliance checker, result summarizer, etc.

108

109 of 141

Updates

Submission service: completed initial migration to AppEngine

Moved to a new infrastructure
Improved scalability, availability and security

Inference 3.0 submission support

Fully supported with the new infrastructure

Training 3.0 logging, compliance, and submission support

Work in progress

109

110 of 141

What’s Next

Auth system for benchmark submission service

Update submitter authentication/authorization process
Helps improve security and submitter experience

Automate other parts of the submission process.

e.g. helper tools for review committee

DevOps improvements for availability and scalability

Internal tooling for deployment/management of submission infrastructure
Documentation, CI/CD, releasing process, etc.

Help us to help you

WGs: can we help with your submission needs?
Contribute to the benchmark infra work

110

111 of 141

Science

111

Working Group Chairs: Geoffrey Fox, Juri Papay, Jeyan Thiyagalingam

Co-founder Tony Hey steps down with WG thanks!

112 of 141

Science Working Group

WG Purpose:

Enhance AI for Science and Engineering research covering domains such as: energy, environmental, earthquake and earth sciences, material sciences, life sciences, fusion, particle physics and astronomy with training and inference applications

112

Goals:

Support scientific discovery using primary metric
Provide exemplars across a range of scientific domains
Encourage use of FAIR metadata and reproducible results
Enable educational use of our resources by students with rich documentation and experience records

113 of 141

Updates (past quarter)

Completed new GitHub and web resources for Open Division only with Rolling Submissions and Science Discovery as primary benchmark for four initial benchmarks
Designed blog like interface for informal submissions with variety of contributions
Several new benchmarks in two classes discussed

Science simulation Surrogates starting with Virtual Tissue Digital Twin and Computational Fluid Dynamics (OSMIBench)
Particle Physics (FastML) and collaboration with large NSF HDR projects

Paper at ISC published
Started three white �papers �(See next page)

113

CloudMask	Climate	Segmentation	RAL	CNN
STEMDL	Materials	Classification	ORNL	CNN
CANDLE-UNO	Medicine	Classification	ANL	MLP
TEvolOp Forecasting	Earthquake	Regression	Virginia	LSTM �Transformer

114 of 141

What’s next?

Complete submission Interface
Continue study of new benchmarks
Complete three white papers

AI Readiness of MLCommons Science (focus on FAIR issues and reproducibility)
Using Benchmarking Data to Inform Decisions Related to Machine Learning Resource Efficiency
Benchmark Carpentry for Science and Engineering

Documents open for new authors!

114

115 of 141

Research

115

116 of 141

MLCommons Research Overview

116

117 of 141

Updates

Develop benchmarks (e.g. Medical/Data/Tiny/Storage)
Disseminate knowledge (e.g., Training/Inference/Tiny/Mobile/People’s Speech/MSWC/Dollar Street/Storage)
Organize recurring workshops (e.g. MLBench)
Run tutorials at conferences (e.g. ASPLOS)
Creation of journals (e.g. DMLR)
Engage the community (e.g. SC Competition using MLPerf)
Give awards (e.g. Rising Stars)
Create internship opportunities
Raise external funding (e.g. NSF)

117

118 of 141

What’s Next? Exciting Goals for 2023.

Help existing research groups succeed
Kick off the rising stars program and make it successful
Launch new research threads in place to foster new ideas

118

119 of 141

Rising Stars

It’s Official!�
Check out the website: https://mlcommons.org/en/rising-stars-2023/�
Applications are coming in, deadline is tomorrow�
Domestic and international representation

119

120 of 141

Organizers

Udit Gupta (Incoming Assistant Professor at Cornell Tech)�
Abdulrahman Mahmoud (Postdoctoral Fellow at Harvard University)�
Lillian Pentecost (Assistant Professor of Computer Science at Amherst College)

120

121 of 141

Rising Stars: Objectives

Provide Support, Career Development, and Job Search Skills

for Emerging Researchers at the intersection of Machine Learning and Systems

Over the last ~6 years SysML/MLSys has grown into a vibrant research community

with strong academic and industry collaborations

Connect researchers across different career stages and institutions

Build community across MLSys

121

122 of 141

How to get involved?

Contact us if you want to support the rising stars program

Hosting

We need your support for hosting the event!

Funding

Sponsorship for either 2023 or 2024 rising stars program

Internships

Leverage the rising stars program as a source of future talent�

Have new ideas that are of interest that you think are worth exploring

Contact research@mlcommons.org

122

123 of 141

Lunch Break

Welcome back at

1:20 PM Pacific time

123

124 of 141

Schedule

124

9:00 AM	Breakfast
9:30 AM	VMware Welcome: Sujata Banerjee
9:45 AM	MLC Welcome: Peter Mattson
10:00 AM	MLC Update: David Kanter
10:30 AM	Break
10:50 AM	Working Group Update
12:20 PM	Lunch
1:20 PM	Power WG Showcase
1:35 PM	DataPerf WG Showcase
1:50 PM	Group discussions (in person only)
	I Benchmark value to enterprise customers: getting involved / moderator: Debojyoti Dutta II Datasets for model quality benchmarking - e.g. which is the best LLM? / moderator: Kurt Bollacker III MLCommons research: how do we deliver value for researchers? / moderator: Vijay Janapa Reddi
3:20 PM	Cake Break
3:45 PM	Social hour
4:45 PM	End

125 of 141

Power showcase

125

126 of 141

Scaling of Machine Learning Models and Cost of Compute

Scaling of Compute (FLOPS) in models overwhelming Moore’s law
Energy scaling in technology nodes is (close to) stagnant
Power consumption of ML models is as important a metric as Performance

126

Source: Riselab, UC Berkeley

127 of 141

Power Working Group - Objective and Goals

MLPerf Power: A Best Practices Working Group

Objective (O) : Make Energy Efficiency of ML benchmarks a first class metric (like Perf)
How (KR) : Provide methodology to measure energy consumed of submissions across Benchmark categories

What :

Measured metric: System Power (and Energy consumed)
Submissions enabled for Inference (Datacenter, Edge) and TinyML
Key OKR for 2023 : Enable Power in MLPerf HPC and MLPerf Training in Oct submissions

Key Goals (Next several months) :

Improve the current measurement methodology for existing Inference Power (v3.1)
Enable adoption in distributed systems - HPC , Training
Increase/Encourage Inference submissions to Power

We have had 1.2K+ power related submissions in 2021 and over 4.6K+ in 2022
Yet, < 50% of total submissions

127

Demonstrate that we can move the needle of energy efficiency over time

128 of 141

Inference Power submissions

Submitters: SiMa, NeuChips, Dell, Fujitsu, H3C, Krai, cTuning, NVIDIA, Qualcomm, HPE, Inspur, Gigabyte, Lenovo and many others.

Latest Results (v3.0):

2809 submissions on 42 unique systems
13% more power results compared to v2.1

Overall, ~7000+ power submissions done in 100+ systems in 2 years

Good trajectory, but need more adoption

128

129 of 141

Power measurement for Distributed Systems

129

Task Force for MLPerf Power in HPC and Training meeting since February 28th

Objective : Deliver a measurement and/or estimation methodology to help evaluate energy efficiency of systems running MLPerf Training and MLPerf HPC benchmarks for October submissions

Link : MLPerf Power Measurement HPC/Training

Progress

Defined system scope that needs Power to be measured or estimated

Node measurement : Agreement on methodology
Interconnect estimation : Currently being worked on
Storage estimation : Agreement to drop from measurement
Cooling : To be discussed

Meets every Wed at 8:30AM . Please write to power@mlcommons.org to participate.

Date to lock methodology : June 30th 2023

130 of 141

MLPerf Power WG meetings - Call for Action

Attendees must join “power” alias

Call for all MLPerf community members to actively participate to expand to different verticals and take part in feature testing/development efforts

Currently MLPerf Power WG meets weekly

Tuesday’s 3PM Pacific Time

Moved to accommodate attendees from Asia

Please reach out if you have any questions

For additional information : https://mlcommons.org/en/groups/best-practices-power/

130

131 of 141

DataPerf showcase

131

www.dataperf.org

132 of 141

DataPerf Working Group

WG Purpose

Create a leaderboard for data and data-centric algorithms

Specific Goals

Build community

Create a canonical place to build data centric challenges

Build standards

Establish DataPerf as an independent standard entity to provide a badge of quality for datasets

132

D

133 of 141

Data is the new bottleneck

ML-Centric Paradigm

Data-Centric Paradigm

DataPerf: Paradigm shift

Source:

Kiela, Douwe, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen et al. "Dynabench: Rethinking benchmarking in NLP." arXiv preprint arXiv:2104.14337 (2021).

134 of 141

Data Quality Bottleneck

Data Quantity Bottleneck

What is Data Bottleneck?

Poor distribution
Bias
False information
…

Data stocks grow at a much slower pace than dataset sizes
Language data will be exhausted by 2030-2040
High-quality language data will be exhausted by 2026
Vision data will be exhausted by 2030-2060

Source:

Villalobos, Pablo, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho. "Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning." arXiv preprint arXiv:2211.04325 (2022).

135 of 141

Recent Release: DataPerf v0.5

135

Data Selection

Data Cleaning

Data Creation (Adversarial)

Data Valuation

Data-Centric Tasks

136 of 141

Recent Release: DataPerf v0.5

136

Vision (Image classification)

Speech (Keyword identification)

NLP (Sentiment Analysis)

Domains

Multimodal (text-2-image)

137 of 141

Recent Release: DataPerf v0.5

137

DataPerf Challenges

Vision (Image classification)

Speech (Keyword identification)

NLP (Sentiment Analysis)

Multimodal (text-2-image)

Data Selection

Data Cleaning

Data Creation (Adversarial)

Data Valuation

138 of 141

DataPerf v0.5 Timeline

138

Open: March 30th
Close: May 26th
Winners announcement: July 28th at ICML

139 of 141

Community Engagement (21 days since launch)

139

Dynabench.org

DataPerf.org

#Submissions

Vision Selection: 3
Debugging: 4
Speech Selection: 0
Acquisition: 0

#Visits

Average: 40/day
Peak: 250/day

140 of 141

What’s next?

Research Goals

Extend current challenges for the next iterations

Fast iterations, quarterly

Diversify the portfolio of data-centric challenges

Slower, targeted, well-funded challenges

Publish results and announce winners at ICML

Community Expansion Goals

Outreach to startup/industry involvement

140

141 of 141

Call for Action

141

Join the Working Group and help us design and develop DataPerf

Participate in DataPerf v0.5 Competitions.

Join our discord channel to stay updated