1 of 50

MLPerf 2Q23

Briefing

Embargoed until 6/27/2023 @ 9am Pacific

David Kanter

Executive Director

2 of 50

Agenda

  • Overview of MLCommons®
  • MLPerf™ Training v3.0 Benchmark Suite
    • DLRM-dcnv2
    • GPT3
  • MLPerf Tiny v1.1 Benchmark Suite
  • Results
  • Q&A Session

2

3 of 50

Objectives

  • Goal
    • Help press, analysts, and public understand MLPerf results
    • Ask us any questions - we are here to help!

  • Ground Rules
    • Statements by member companies are not endorsed by MLCommons
    • No comparison to competitors (e.g., A is 10X better than B)
    • No implicit comparisons (e.g., C is the best)

3

4 of 50

MLPerf briefing materials index

Documents and recording of this session will be available at: https://drive.google.com/drive/folders/1DEktafsEpTBwcE94SWPBHeawqA9LUzGA

New version of MLPerf Mobile app available for Android and iOS, contact mobile-app@mlcommons.org for access and support

4

Document

Link

MLPerf Training 3.0 Results

MLPerf Training 3.0 Supplemental Discussion

MLPerf Tiny 1.1 Results

MLPerf Tiny 1.1 Supplemental Discussion

Press Release

5 of 50

5

Founding Members

MLCommons is a global community

Academic Institutions�

  • Harvard University
  • Polytechnique Montreal
  • Peng Cheng Laboratory
  • Stanford University
  • University of California, Berkeley
  • University of Toronto
  • University of Tübingen
  • University of Virginia
  • University of York, UK
  • Yonsei University
  • York University, Canada

Members

6 of 50

6

Open engineering organizations

AI/ML �organizations

We need an open engineering organization to create better ML for everyone

7 of 50

MLCommons: Better ML for Everyone

7

Benchmarks

Datasets

Best practices

ML innovation

Research

8 of 50

8

“What get measured, gets improved.” — Peter Drucker�

Benchmarks drive progress and transparency

Benchmarking aligns the entire community in pursuit of the same clear objective.

9 of 50

MLPerf Training - ahead of Moore’s Law

9

*indicates benchmark increased accuracy target in MLPerf Training v0.6

10 of 50

Benchmarking ML systems

10

Used to

  • Compare solutions
  • Inform selection
  • Measure and track progress
  • Raise the bar, advance the field

Requires

  • Methodology that is both fair and rigorous
  • Community support and consensus

Provides

  • Standardization of use cases and workloads
  • Comparability across heterogeneous HW/SW systems
  • Complex characterization of system compromises
  • Verifiable and Reproducible results

11 of 50

11

Algorithms

Software

Architecture

Silicon

Scale

( )

{

}

Data

ML is a full system problem

12 of 50

MLPerf Goals

12

Enforce performance result replicability to ensure reliable results

Use representative workloads reflecting production use-cases

Encourage innovation to improve the state-of-the-art of ML�

Accelerate progress in ML viaa fair and useful measurement

Serve both the commercial and research communities

Keep benchmarking affordable so that all can participate

13 of 50

13

MLPerf Breadth: µWatts to MegaWatts

Evolution over time

Improving technical maturity

Standardized methodology for Training

Power measurements for Inference, Tiny

Mobile App on Android, iOS, Windows

Many new workloads

Compatibility, historical comparisons

Scale

2018

2019

2020

2021

2022

2023

Training - HPC

Training

Inference - Datacenter

Inference - Edge

Inference - Mobile

Inference - Tiny (IoT)

Storage

14 of 50

MLPerf Training Benchmark

Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia

https://arxiv.org/abs/1910.01500

15 of 50

15

MLPerf Training benchmark definition

Target Quality

E.g. 75.9%

Model

Dataset

E.g. ImageNet

16 of 50

16

Two divisions with different model restrictions

Dataset

Target Quality

E.g. 75.9%

Model

E.g. ImageNet

Closed division: specific model e.g. ResNet v1.5 → direct comparisons

Open division: any model → innovation

17 of 50

17

Metric: time-to-train

Alternative is throughput

Easy / cheap to measure

Higher throughput Fewer epochs

  • Lower precision
  • Higher batch size
  • Higher precision
  • Lower batch size

But can increase throughput at cost of total time to train!

Time-to-train (end-to-end)

Time to solution!

Computationally expensive

High variance

Least bad choice

18 of 50

18

Time-to-train excludes

System initialization

Depends on cluster configuration and state

Model initialization

Disproportionate for big systems with small benchmarking datasets

Data reformatting

Mandating format would give advantage to some systems

19 of 50

MLPerf categories and divisions

  • Two Divisions
    • Closed: Mathematically equivalent to the reference model, to enable optimization on many different systems with a level playing field
      • Example changes: batch size, numerics, padding, framework, data layout
      • Cannot change: # of layers, # of weights / pruning
    • Open Model: not mathematically equivalent to the reference
      • Could be very different, or a small difference, submitters should describe

  • Three Categories
    • Available: Commercially available at submission
    • Preview: Commercially available soon (~6 months from submission)
    • RDI: Not commercially available, e.g. research, prototype, or internal systems

19

20 of 50

Listening to the results

  • Every result says something interesting, but it may not be obvious
    • Look at submissions that are similar across some dimensions, e.g., same vendor, same scale, same processor, best performance...but different in other dimensions
  • Scaling over time
  • Tuning software over time
  • New software stacks
  • Systems progress from RDI/Preview to Available
  • New processors
  • Larger systems better amortize overhead (e.g., fans) for power

20

21 of 50

MLPerf Training v3.0 Suite

21

Task

Real World Application Examples

Recommendation (*NEW*)

Content or shopping recommendation, e.g., search, social media, ads

Speech recognition

Speech-to-text for smartphones, hands-free driver assistance

NLP

Search, translation, chatbots, summarization

LLM (*NEW*)

Search, translation, chatbots, summarization

Image classification

Image labeling, general vision

Object detection

Pedestrian detection, manufacturing defect detection, red-eye reduction

3D segmentation

Medical image analysis, e.g., tumor identification

Reinforcement learning

Robotics, circuit design and optimization

22 of 50

MLPerf Training v3.0 Suite detail

22

Task

Dataset

Model

Quality Target

Recommendation (*NEW*)

Criteo 4TB multi-hot

DLRM-dcnv2

0.8032 AUC

Speech recognition

LibreSpeech

RNN-T

0.058 Word Error Rate

NLP

Wikipedia 2020-01-01

BERT-large

0.72 Mask-LM

LLM (*NEW*)

C4

GPT3

2.69 log perplexity

Image classification

ImageNet 2012

ResNet-50 v1.5

75.9% top-1

Object detection

Open Images (800x800)

RetinaNet

0.34 mAP

Object detection (heavy)

COCO 2017

Mask R-CNN

0.377 Box min AP and 0.339 Mask min AP

3D segmentation

2019 KiTS

3D U-Net

0.908 Mean DICE score

23 of 50

MLPerf Training v3.0 overview

  • Results: MLPerf™ Training v3.0. Results (Embargoed until 6/27/2023 9am PT)
  • 16 Submitters: ASUSTeK, Azure, Dell, Fujitsu, GIGABYTE, H3C, IEI, Intel, Intel-Habana Labs, Krai, Lenovo, NVIDIA, NVIDIA+CoreWeave, Quanta Cloud Technology, Supermicro, xFusion
  • >259 performance results, 30% more results than last round
    • 1.04-1.54X better performance in closed available
    • Geometric mean of 1.3X better performance over MLPerf Training suite
    • 3 new submitters (highlighted in green above)

  • New recommendation and LLM benchmarks
    • 3 GPT3 submitters
    • 5 DLRM-dcnv2 submitters

23

24 of 50

New Recommendation�Criteo 4TB Multi-hot with

DLRM-dcnv2

25 of 50

Updating MLPerf recommendation

  • Recommendation: Pick the best item from a collection
    • E.g., Pick the best news article to read next
    • Search, shopping, photo highlights, etc.
  • Extremely valuable for large content libraries
    • ~10K Netflix titles, ~1.1B websites
  • Necessary for good user experience and commerce

25

26 of 50

Why do we need to update?

  • Production recommendation models are increasing in scale - in size, compute, and memory operations.
  • The original DLRM is becoming obsolete.
  • We want a benchmark that is more representative of the industry:
    • Converge with larger batch size
    • Achieve higher convergence AUC
    • Use more compute
    • Use bigger (and multi-hot) embedding tables

26

27 of 50

Update to DLRM-dcnv2 - Summary

  • Built with a focus on being representative of industry recommender systems
  • Embedding tables: multi-hot lookups (>5x more memory operations)
  • Architecture changes: Feature interaction becomes deep-cross network v2 (5x more compute)
  • Optimizer: Adagrad instead of SGD (enable larger batch size)
  • Codebase: Adopt a widely used, production-grade recommender library, TorchRec.

27

28 of 50

Criteo 4TB Multi-hot dataset

  • In recommender models, embedding tables are very important. We alter the synthetic Criteo 1TB dataset to become multi-hot to mirror production model embedding lookup patterns.
  • We decided to build on top of a canonical recommenders dataset, Criteo 1TB and uniformly sample new indices to make it multi-hot. This increases the size of the dataset by 4x to 4TB and transforms the structure of the dataset to allow for multi-hot lookups.
  • Multi-hot lookup example
    • History over the last month. Say a column is purchases.
    • Look up the last 20 purchases → embedding for each row. We then average all of these rows together (pooling factor) to get a single embedding representing this user’s purchase history.

28

29 of 50

DLRM-dcnv2 reference model

  • DLRM-dcnv2 improves on DLRM:
    • Feature interaction is 3 stacked layers of DCNv2 (commonly used by reco systems of many large companies)
    • Embedding is multi-hot (more memory lookups)
    • Using Adagrad, converges for batch size up to 200K vs. 70K previously
    • Model AUC improves from original DLRM: 0.8025 => 0.8032.

29

30 of 50

DCNv2 details

30

  • Circle with dot is hadamard product (elementwise multiplication)
  • For W x Xi matrix multiplication, we split W = 3456x3456 into two matrices 3456x512 x 512x3456

31 of 50

New Large Language Model (LLM)Colossal Cleaned Common Crawl (C4) with GPT-3

32 of 50

New Large Language Model (LLM) Benchmark

  • Understanding language is at the core of Generative AI applications such as ChatGPT
  • Language models take word(s) as input and predict the probability of the next word(s)
    • E.g., Input: “Why did the chicken cross the ____”; Output: “Road.”
  • LLM Pre-Training teaches a model the relationship between words
  • LLMs are used for essay generation, code development, language translation, summarization, and even understanding genetic sequences
  • MLPerf LLM is based on pre-training GPT-3 175B model, originally described by OpenAI
    • LLM is left-to-right (causal), decoder only
    • MLPerf BERT is bi-directional encoder-only and 0.34B parameters
    • MLPerf Transformer (retired) is seq-to-seq, encoder-decoder and 0.21B parameters

32

33 of 50

C4 Dataset

MLPerf LLM Benchmark

  • c4/en/3.0.1 - Colossal, Cleaned, Common Crawl corpus dataset hosted on HuggingFace
    • 305GB and ~174B tokens

  • Benchmark trains on a portion of the training set
    • Start from an initial checkpoint that is trained on 12.5B tokens
    • Benchmark measures training on ~1.3B tokens
  • Model Accuracy evaluation happens on ~1/20th (11.5M tokens) of the validation dataset

Note:

  • Benchmark is a portion of LLM pre-training (0.4% of full GPT3)
  • Need to keep run time reasonable to allow wider participation

34 of 50

GPT-3 175B Reference Model

MLPerf LLM Benchmark

  • 96 Transformer Layers (decoder only)

  • SentencePiece with BPE Tokenizer (not in diagram)

  • Sequence Length of 2048 Tokens
    • Input tokens used to make a prediction

  • Adam Optimizer: Adaptive learning rate optimized for large problems

  • Starts from initial checkpoint and reaches target accuracy (2.69 log perplexity) by training over ~1.3B tokens

Transformer Layer

Stack 96x

35 of 50

Transformer Layer

MLPerf LLM Benchmark

  • 96 stacked Transformer Layers:
    • Multi-Head Self Attention (memory intense)
    • MLP (compute intense)

  • GPT-3 has 175B parameters (~350GB in BF16)
  • Plus more to save model optimizer states, activations

  • Must split across processors for training (e.g., reference requires at least 64 accelerators)

  • Inference needs multiple processors too!

36 of 50

MLPerf Tiny v1.1 Benchmark

Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, Urmish Thakker, Antonio Torrini, Peter Warden, Jay Cordaro, Giuseppe Di Guglielmo, Javier Duarte, Stephen Gibellini, Videet Parekh, Honson Tran, Nhan Tran, Niu Wenxu, Xu Xuesong

https://arxiv.org/abs/2106.07597

37 of 50

MLPerf Tiny v1.1

37

Purpose:

  • Develop and evolve a benchmark suite for ultra-low-power ML systems (TinyML)
  • On-device real-time batch-of-one inference.

Typical Systems�

  • 10s-100s MHz
  • ⪍ MB Flash, SRAM
  • ~mW Power

  • Lightweight Models (<1M Param )

38 of 50

38

MLPerf Tiny v1.1 reference models - single stream only

Keyword Spotting

Visual Wake Words

Anomaly Detection

Image Classification

Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018).

Chowdhery, Aakanksha, et al. "Visual wake words dataset." arXiv preprint arXiv:1906.05721 (2019).

Yuma Koizumi, Shoichiro Saito, Noboru Harada, Hisashi Uematsu and Keisuke Imoto, "ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection," in Proc of WASPAA, 2019.

Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009): 7.

Google Speech Commands

DS-CNN

(52 Kpar)

Visual Wake Words

Dataset

MobileNetV1 .25

(325 Kpar)

DCASE2020-Task2 / ToyADMOS

FC-AutoEncoder

(270 Kpar)

CIFAR10

ResNet8

(96 Kpar)

39 of 50

MLPerf Tiny v1.1 workloads

  • All workloads use single stream scenario
  • Optional energy measurement in collaboration with EEMBC

39

Task

Dataset

Reference Network

Target Quality

Keyword Spotting

Speech Commands v2

DS-CNN

90% top-1

Visual Wake Words

COCO 2014

MobileNetV1 0.25X

80% top-1

Image Classification

CIFAR-10

ResNet-v1

85% top-1

Anomaly Detection

ToyADMOS

Deep Autoencoder

0.85 AUC

40 of 50

EEMBC EnergyRunner™ Framework

40

Performance

Energy

41 of 50

MLPerf Tiny v1.1 overview

  • Results: MLPerf™ Tiny v1.1 Results (Embargoed until 6/27/2023 9AM Pacific)
  • Submitters: Bosch, cTuning, fpgaConvNet, Kai Jiang, Krai, Nuvoton, Plumerai, Skymizer, STMicroelectronics, and Syntiant:
    • 159 results (+63% vs prior round), 10 submitters
    • 7 new submitters (highlighted in green above)
    • More power measurements (41 vs. 38 in prior round)
  • Much wider participation and growing adoption

41

42 of 50

Looking back at 4 releases

42

43 of 50

Thank you!

44 of 50

Backup Slides

45 of 50

New Object Detection�Open Images with RetinaNet

46 of 50

New Object Detection benchmark

  • Find bounding boxes around objects and label with correct class

  • Extremely common in computer vision, e.g., find pedestrians, count items on shelves, etc.

  • Updated to newer dataset and more accurate reference model

46

47 of 50

Open images dataset for Object Detection

  • Open Images Dataset V6 + Extensions for MLPerf
    • Good licensing (CC-BY)
    • Downsampled to 800x800, 264 object classes
    • Hope to adopt Open Images for other benchmarks in the future
    • MLCommons will archive dataset and licenses

47

Dataset

Max resolution

Classes

Train size

Validation size

640×480

80

116,277

5,000

>1200x1600

601

1,743,042

41,620

OpenImages v6.0-MLPerf1000

>1200x1600

264

1,170,301

24,781

OpenImages v6.0-MLPerf-1000 uses a subset of the full OpenImages dataset. The dataset filter out all “super” classes and classes with fewer than 1000 samples

48 of 50

RetinaNet reference model

  • ResNeXt vs. ResNet (discussed in next slides)
  • Deeper backbone network (50 layers vs. 34) → Better accuracy
  • FPN improves detection at different scales by adding context → Better accuracy
  • Focal loss vs. SSD loss → Better accuracy, especially for ‘hard’ or ‘rare’ cases
  • Frozen batch norm breaks dependencies → Better scaling to large systems

48

ResNeXt-50

Feature Pyramid Network (FPN)

Class subnet: CNN+FCN

Box subnet: CNN+FCN

New reference model achieves much better accuracy: 0.34 mAP vs. 0.23 mAP

49 of 50

Dense Convolutions

Grouped Convolutions

  • Reduce the parameters and compute operations (FLOPs) vs. regular convolution
  • Originally used in AlexNet to fit a convolution in small memory
  • Groups tend to learn different aspects, e.g., AlexNet 2 groups learn B/W + color
  • ResNeXt uses grouped convolutions extensively

49

Depth = channels (CH)

Filter parameters in yellow

Grouped convolution (2 groups)

Split filters into groups

Groups operate on separate channels

50 of 50

ResNet

ResNeXt

  • Simple design like ResNet
  • Separate groups like Inception
  • ResNeXt has best of both!

  • Tune # of groups (cardinality)
  • Similar compute
  • Better accuracy in some settings

Input CH, Filter size, Output CHs

50