1 of 50

MLPerf 2Q23

Briefing

Embargoed until 6/27/2023 @ 9am Pacific

David Kanter

Executive Director

2 of 50

Agenda

Overview of MLCommons®
MLPerf™ Training v3.0 Benchmark Suite

DLRM-dcnv2
GPT3

MLPerf Tiny v1.1 Benchmark Suite
Results
Q&A Session

2

3 of 50

Objectives

Goal

Help press, analysts, and public understand MLPerf results
Ask us any questions - we are here to help!

Ground Rules

Statements by member companies are not endorsed by MLCommons
No comparison to competitors (e.g., A is 10X better than B)
No implicit comparisons (e.g., C is the best)

3

4 of 50

MLPerf briefing materials index

Documents and recording of this session will be available at: https://drive.google.com/drive/folders/1DEktafsEpTBwcE94SWPBHeawqA9LUzGA

New version of MLPerf Mobile app available for Android and iOS, contact mobile-app@mlcommons.org for access and support

4

Document	Link
MLPerf Training 3.0 Results	https://docs.google.com/spreadsheets/d/1Lw4dCOzEeQVepaHRJ1m7ajK0HcsHjl8n1gvqmPedJFc
MLPerf Training 3.0 Supplemental Discussion	https://docs.google.com/document/d/1c4DMMfG1Iu1V3qHI905bq14CymP-iBSs0caFNJCamag
MLPerf Tiny 1.1 Results	https://docs.google.com/spreadsheets/d/1xPmfioqWRlX1Xe7iEqMjyQHmw4YOEzEF7VRuUDyShA4
MLPerf Tiny 1.1 Supplemental Discussion	https://docs.google.com/document/d/1MzqwXbnMh2EY9pu2bDQmR7l1cuhzUOAyGEdJg5T1Gtw
Press Release

5 of 50

5

Founding Members

MLCommons is a global community

Academic Institutions�

Harvard University
Polytechnique Montreal
Peng Cheng Laboratory
Stanford University
University of California, Berkeley
University of Toronto
University of Tübingen
University of Virginia
University of York, UK
Yonsei University
York University, Canada

Members

6 of 50

6

Open engineering organizations

AI/ML �organizations

We need an open engineering organization to create better ML for everyone

7 of 50

MLCommons: Better ML for Everyone

7

Benchmarks

Datasets

Best practices

ML innovation

Research

8 of 50

8

“What get measured, gets improved.” — Peter Drucker�

Benchmarks drive progress and transparency

Benchmarking aligns the entire community in pursuit of the same clear objective.

9 of 50

MLPerf Training - ahead of Moore’s Law

9

*indicates benchmark increased accuracy target in MLPerf Training v0.6

10 of 50

Benchmarking ML systems

10

Used to

Compare solutions
Inform selection
Measure and track progress
Raise the bar, advance the field

Requires

Methodology that is both fair and rigorous
Community support and consensus

Provides

Standardization of use cases and workloads
Comparability across heterogeneous HW/SW systems
Complex characterization of system compromises
Verifiable and Reproducible results

11 of 50

11

Algorithms

Software

Architecture

Silicon

Scale

( )

{

}

Data

ML is a full system problem

12 of 50

MLPerf Goals

12

Enforce performance result replicability to ensure reliable results

Use representative workloads reflecting production use-cases

Encourage innovation to improve the state-of-the-art of ML�

Accelerate progress in ML viaa fair and useful measurement

Serve both the commercial and research communities

Keep benchmarking affordable so that all can participate

13 of 50

13

MLPerf Breadth: µWatts to MegaWatts

Evolution over time

Improving technical maturity

Standardized methodology for Training

Power measurements for Inference, Tiny

Mobile App on Android, iOS, Windows

Many new workloads

Compatibility , historical comparisons

Scale	2018	2019	2020	2021	2022	2023
Training - HPC
Training
Inference - Datacenter
Inference - Edge
Inference - Mobile
Inference - Tiny (IoT)
Storage

14 of 50

MLPerf Training Benchmark

Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia

https://arxiv.org/abs/1910.01500

15 of 50

15

MLPerf Training benchmark definition

Target Quality

E.g. 75.9%

Model

Dataset

E.g. ImageNet

16 of 50

16

Two divisions with different model restrictions

Dataset

Target Quality

E.g. 75.9%

Model

E.g. ImageNet

Closed division: specific model e.g. ResNet v1.5 → direct comparisons

Open division: any model → innovation

17 of 50

17

Metric: time-to-train

Alternative is throughput

Easy / cheap to measure

Higher throughput Fewer epochs

Lower precision
Higher batch size

Higher precision
Lower batch size

But can increase throughput at cost of total time to train!

Time-to-train (end-to-end)

Time to solution!

Computationally expensive

High variance

Least bad choice

18 of 50

18

Time-to-train excludes

System initialization

Depends on cluster configuration and state

Model initialization

Disproportionate for big systems with small benchmarking datasets

Data reformatting

Mandating format would give advantage to some systems

19 of 50

MLPerf categories and divisions

Two Divisions

Closed: Mathematically equivalent to the reference model, to enable optimization on many different systems with a level playing field

Example changes: batch size, numerics, padding, framework, data layout
Cannot change: # of layers, # of weights / pruning

Open Model: not mathematically equivalent to the reference

Could be very different, or a small difference, submitters should describe

Three Categories

Available: Commercially available at submission
Preview: Commercially available soon (~6 months from submission)
RDI: Not commercially available, e.g. research, prototype, or internal systems

19

20 of 50

Listening to the results

Every result says something interesting, but it may not be obvious

Look at submissions that are similar across some dimensions, e.g., same vendor, same scale, same processor, best performance...but different in other dimensions

Scaling over time
Tuning software over time
New software stacks
Systems progress from RDI/Preview to Available
New processors
Larger systems better amortize overhead (e.g., fans) for power

20

21 of 50

MLPerf Training v3.0 Suite

21

Task	Real World Application Examples
Recommendation (NEW)	Content or shopping recommendation, e.g., search, social media, ads
Speech recognition	Speech-to-text for smartphones, hands-free driver assistance
NLP	Search, translation, chatbots, summarization
LLM (NEW)	Search, translation, chatbots, summarization
Image classification	Image labeling, general vision
Object detection	Pedestrian detection, manufacturing defect detection, red-eye reduction
3D segmentation	Medical image analysis, e.g., tumor identification
Reinforcement learning	Robotics, circuit design and optimization

22 of 50

MLPerf Training v3.0 Suite detail

22

Task	Dataset	Model	Quality Target
Recommendation (NEW)	Criteo 4TB multi-hot	DLRM-dcnv2	0.8032 AUC
Speech recognition	LibreSpeech	RNN-T	0.058 Word Error Rate
NLP	Wikipedia 2020-01-01	BERT-large	0.72 Mask-LM
LLM (NEW)	C4	GPT3	2.69 log perplexity
Image classification	ImageNet 2012	ResNet-50 v1.5	75.9% top-1
Object detection	Open Images (800x800)	RetinaNet	0.34 mAP
Object detection (heavy)	COCO 2017	Mask R-CNN	0.377 Box min AP and 0.339 Mask min AP
3D segmentation	2019 KiTS	3D U-Net	0.908 Mean DICE score

23 of 50

MLPerf Training v3.0 overview

Results: MLPerf™ Training v3.0. Results (Embargoed until 6/27/2023 9am PT)
16 Submitters: ASUSTeK, Azure, Dell, Fujitsu, GIGABYTE, H3C, IEI, Intel, Intel-Habana Labs, Krai, Lenovo, NVIDIA, NVIDIA+CoreWeave, Quanta Cloud Technology, Supermicro, xFusion
>259 performance results, 30% more results than last round

1.04-1.54X better performance in closed available
Geometric mean of 1.3X better performance over MLPerf Training suite
3 new submitters (highlighted in green above)

New recommendation and LLM benchmarks

3 GPT3 submitters
5 DLRM-dcnv2 submitters

23

24 of 50

New Recommendation�Criteo 4TB Multi-hot with

DLRM-dcnv2

25 of 50

Updating MLPerf recommendation

Recommendation: Pick the best item from a collection

E.g., Pick the best news article to read next
Search, shopping, photo highlights, etc.

Extremely valuable for large content libraries

~10K Netflix titles, ~1.1B websites

Necessary for good user experience and commerce

25

26 of 50

Why do we need to update?

Production recommendation models are increasing in scale - in size, compute, and memory operations.
The original DLRM is becoming obsolete.
We want a benchmark that is more representative of the industry:

Converge with larger batch size
Achieve higher convergence AUC
Use more compute
Use bigger (and multi-hot) embedding tables

26

27 of 50

Update to DLRM-dcnv2 - Summary

Built with a focus on being representative of industry recommender systems
Embedding tables: multi-hot lookups (>5x more memory operations)
Architecture changes: Feature interaction becomes deep-cross network v2 (5x more compute)
Optimizer: Adagrad instead of SGD (enable larger batch size)
Codebase: Adopt a widely used, production-grade recommender library, TorchRec.

27

28 of 50

Criteo 4TB Multi-hot dataset

In recommender models, embedding tables are very important. We alter the synthetic Criteo 1TB dataset to become multi-hot to mirror production model embedding lookup patterns.
We decided to build on top of a canonical recommenders dataset, Criteo 1TB and uniformly sample new indices to make it multi-hot. This increases the size of the dataset by 4x to 4TB and transforms the structure of the dataset to allow for multi-hot lookups.
Multi-hot lookup example

History over the last month. Say a column is purchases.
Look up the last 20 purchases → embedding for each row. We then average all of these rows together (pooling factor) to get a single embedding representing this user’s purchase history.

28

29 of 50

DLRM-dcnv2 reference model

DLRM-dcnv2 improves on DLRM:

Feature interaction is 3 stacked layers of DCNv2 (commonly used by reco systems of many large companies)
Embedding is multi-hot (more memory lookups)
Using Adagrad, converges for batch size up to 200K vs. 70K previously
Model AUC improves from original DLRM: 0.8025 => 0.8032.

29

30 of 50

DCNv2 details

30

Circle with dot is hadamard product (elementwise multiplication)
For W x X_i matrix multiplication, we split W = 3456x3456 into two matrices 3456x512 x 512x3456

31 of 50

New Large Language Model (LLM)�Colossal Cleaned Common Crawl (C4) with GPT-3

32 of 50

New Large Language Model (LLM) Benchmark

Understanding language is at the core of Generative AI applications such as ChatGPT
Language models take word(s) as input and predict the probability of the next word(s)

E.g., Input: “Why did the chicken cross the ____”; Output: “Road.”

LLM Pre-Training teaches a model the relationship between words
LLMs are used for essay generation, code development, language translation, summarization, and even understanding genetic sequences
MLPerf LLM is based on pre-training GPT-3 175B model, originally described by OpenAI

LLM is left-to-right (causal), decoder only
MLPerf BERT is bi-directional encoder-only and 0.34B parameters
MLPerf Transformer (retired) is seq-to-seq, encoder-decoder and 0.21B parameters

32

33 of 50

C4 Dataset

MLPerf LLM Benchmark

c4/en/3.0.1 - Colossal, Cleaned, Common Crawl corpus dataset hosted on HuggingFace

305GB and ~174B tokens

Benchmark trains on a portion of the training set

Start from an initial checkpoint that is trained on 12.5B tokens
Benchmark measures training on ~1.3B tokens

Model Accuracy evaluation happens on ~1/20th (11.5M tokens) of the validation dataset

Note:

Benchmark is a portion of LLM pre-training (0.4% of full GPT3)
Need to keep run time reasonable to allow wider participation

34 of 50

GPT-3 175B Reference Model

MLPerf LLM Benchmark

96 Transformer Layers (decoder only)

SentencePiece with BPE Tokenizer (not in diagram)

Sequence Length of 2048 Tokens

Input tokens used to make a prediction

Adam Optimizer: Adaptive learning rate optimized for large problems

Starts from initial checkpoint and reaches target accuracy (2.69 log perplexity) by training over ~1.3B tokens

Transformer Layer

Stack 96x

https://arxiv.org/abs/2005.14165

35 of 50

Transformer Layer

MLPerf LLM Benchmark

96 stacked Transformer Layers:

Multi-Head Self Attention (memory intense)
MLP (compute intense)

GPT-3 has 175B parameters (~350GB in BF16)
Plus more to save model optimizer states, activations

Must split across processors for training (e.g., reference requires at least 64 accelerators)

Inference needs multiple processors too!

36 of 50

MLPerf Tiny v1.1 Benchmark

Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, Urmish Thakker, Antonio Torrini, Peter Warden, Jay Cordaro, Giuseppe Di Guglielmo, Javier Duarte, Stephen Gibellini, Videet Parekh, Honson Tran, Nhan Tran, Niu Wenxu, Xu Xuesong

https://arxiv.org/abs/2106.07597

37 of 50

MLPerf Tiny v1.1

37

Purpose:

Develop and evolve a benchmark suite for ultra-low-power ML systems (TinyML)
On-device real-time batch-of-one inference.

Typical Systems�

10s-100s MHz
⪍ MB Flash, SRAM
~mW Power

�

Lightweight Models (<1M Param )

38 of 50

38

MLPerf Tiny v1.1 reference models - single stream only

Keyword Spotting

Visual Wake Words

Anomaly Detection

Image Classification

Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018).

Chowdhery, Aakanksha, et al. "Visual wake words dataset." arXiv preprint arXiv:1906.05721 (2019).

Yuma Koizumi, Shoichiro Saito, Noboru Harada, Hisashi Uematsu and Keisuke Imoto, "ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection," in Proc of WASPAA, 2019.

Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009): 7.

Google Speech Commands

DS-CNN

(52 Kpar)

Visual Wake Words

Dataset

MobileNetV1 .25

(325 Kpar)

DCASE2020-Task2 / ToyADMOS

FC-AutoEncoder

(270 Kpar)

CIFAR10

ResNet8

(96 Kpar)

39 of 50

MLPerf Tiny v1.1 workloads

All workloads use single stream scenario
Optional energy measurement in collaboration with EEMBC

39

Task	Dataset	Reference Network	Target Quality
Keyword Spotting	Speech Commands v2	DS-CNN	90% top-1
Visual Wake Words	COCO 2014	MobileNetV1 0.25X	80% top-1
Image Classification	CIFAR-10	ResNet-v1	85% top-1
Anomaly Detection	ToyADMOS	Deep Autoencoder	0.85 AUC

40 of 50

EEMBC EnergyRunner™ Framework

40

Performance

Energy

41 of 50

MLPerf Tiny v1.1 overview

Results: MLPerf™ Tiny v1.1 Results (Embargoed until 6/27/2023 9AM Pacific)
Submitters: Bosch, cTuning, fpgaConvNet, Kai Jiang, Krai, Nuvoton, Plumerai, Skymizer, STMicroelectronics, and Syntiant:

159 results (+63% vs prior round), 10 submitters
7 new submitters (highlighted in green above)
More power measurements (41 vs. 38 in prior round)

Much wider participation and growing adoption

41

42 of 50

Looking back at 4 releases

42

43 of 50

Thank you!

44 of 50

Backup Slides

45 of 50

New Object Detection�Open Images with RetinaNet

46 of 50

New Object Detection benchmark

Find bounding boxes around objects and label with correct class

Extremely common in computer vision, e.g., find pedestrians, count items on shelves, etc.

Updated to newer dataset and more accurate reference model

46

47 of 50

Open images dataset for Object Detection

Open Images Dataset V6 + Extensions for MLPerf

Good licensing (CC-BY)
Downsampled to 800x800, 264 object classes
Hope to adopt Open Images for other benchmarks in the future
MLCommons will archive dataset and licenses

47

Dataset	Max resolution	Classes	Train size	Validation size
COCO 2017	640×480	80	116,277	5,000
OpenImages v6.0	>1200x1600	601	1,743,042	41,620
OpenImages v6.0-MLPerf1000	>1200x1600	264	1,170,301	24,781

OpenImages v6.0-MLPerf-1000 uses a subset of the full OpenImages dataset. The dataset filter out all “super” classes and classes with fewer than 1000 samples

48 of 50

RetinaNet reference model

ResNeXt vs. ResNet (discussed in next slides)
Deeper backbone network (50 layers vs. 34) → Better accuracy
FPN improves detection at different scales by adding context → Better accuracy
Focal loss vs. SSD loss → Better accuracy, especially for ‘hard’ or ‘rare’ cases
Frozen batch norm breaks dependencies → Better scaling to large systems

48

ResNeXt-50

Feature Pyramid Network (FPN)

Class subnet: CNN+FCN

Box subnet: CNN+FCN

https://arxiv.org/abs/1708.02002

New reference model achieves much better accuracy: 0.34 mAP vs. 0.23 mAP

RetinaNet performs better for small objects, e.g., distant pedestrians due to FPN

FPN is a conv net that operates well at multiple scales

PM: FPN does a good job of keeping around multi-scale context

Focal loss: Forces the network to focus on getting hard cases right instead of just nailing easy cases. Often that improves the ‘focus’ of the network on foreground objects, which are less numerous than background objects.

The better scaling comes from the fact that BNs are frozen and we don't use group batch norm. Frozen batch norms can scale linearly with number of ranks, grouped batch norm will impact your scaling factor since you need to sync the internal layer statistics.

Old ref model: SSD-ResNet34, mAP 0.23 (https://github.com/mlcommons/training/tree/master/retired_benchmarks/single_stage_detector/ssd)

New reference model: RetinaNet-ResNeXt50, mAP 0.34

Single-shot detector (compared to two-stage Mask RCNN)
ResNeXt50 backbone: https://arxiv.org/abs/1611.05431

RetinaNet: https://arxiv.org/abs/1708.02002

https://github.com/mlcommons/training/tree/master/single_stage_detector

49 of 50

Dense Convolutions

Grouped Convolutions

Reduce the parameters and compute operations (FLOPs) vs. regular convolution
Originally used in AlexNet to fit a convolution in small memory
Groups tend to learn different aspects, e.g., AlexNet 2 groups learn B/W + color
ResNeXt uses grouped convolutions extensively

49

Depth = channels (CH)

Filter parameters in yellow

Grouped convolution (2 groups)

Split filters into groups

Groups operate on separate channels

50 of 50

ResNet

ResNeXt

Simple design like ResNet
Separate groups like Inception
ResNeXt has best of both!

Tune # of groups (cardinality)
Similar compute
Better accuracy in some settings

Input CH, Filter size, Output CHs

50

https://arxiv.org/abs/1611.05431