MLPerf 2Q23
Briefing
Embargoed until 6/27/2023 @ 9am Pacific
David Kanter
Executive Director
Agenda
2
Objectives
3
MLPerf briefing materials index
Documents and recording of this session will be available at: https://drive.google.com/drive/folders/1DEktafsEpTBwcE94SWPBHeawqA9LUzGA
New version of MLPerf Mobile app available for Android and iOS, contact mobile-app@mlcommons.org for access and support
4
Document | Link |
MLPerf Training 3.0 Results | |
MLPerf Training 3.0 Supplemental Discussion | |
MLPerf Tiny 1.1 Results | |
MLPerf Tiny 1.1 Supplemental Discussion | |
Press Release | |
5
Founding Members
MLCommons is a global community
Academic Institutions�
Members
6
Open engineering organizations
AI/ML �organizations
We need an open engineering organization to create better ML for everyone
MLCommons: Better ML for Everyone
7
Benchmarks
Datasets
Best practices
ML innovation
Research
8
“What get measured, gets improved.” — Peter Drucker�
Benchmarks drive progress and transparency
Benchmarking aligns the entire community in pursuit of the same clear objective.
MLPerf Training - ahead of Moore’s Law
9
*indicates benchmark increased accuracy target in MLPerf Training v0.6
Benchmarking ML systems
10
Used to
Requires
Provides
11
Algorithms
Software
Architecture
Silicon
Scale
( )
{
}
Data
ML is a full system problem
MLPerf Goals
12
Enforce performance result replicability to ensure reliable results
Use representative workloads reflecting production use-cases
Encourage innovation to improve the state-of-the-art of ML�
Accelerate progress in ML viaa fair and useful measurement
Serve both the commercial and research communities
Keep benchmarking affordable so that all can participate
13
MLPerf Breadth: µWatts to MegaWatts
Evolution over time
Improving technical maturity
Standardized methodology for Training
Power measurements for Inference, Tiny
Mobile App on Android, iOS, Windows
Many new workloads
Compatibility, historical comparisons
Scale | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 |
Training - HPC | | | | | | |
Training | | | | | | |
Inference - Datacenter | | | | | | |
Inference - Edge | | | | | | |
Inference - Mobile | | | | | | |
Inference - Tiny (IoT) | | | | | | |
Storage | | | | | | |
MLPerf Training Benchmark
Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia
https://arxiv.org/abs/1910.01500
15
MLPerf Training benchmark definition
Target Quality
E.g. 75.9%
Model
Dataset
E.g. ImageNet
16
Two divisions with different model restrictions
Dataset
Target Quality
E.g. 75.9%
Model
E.g. ImageNet
Closed division: specific model e.g. ResNet v1.5 → direct comparisons
Open division: any model → innovation
17
Metric: time-to-train
Alternative is throughput
Easy / cheap to measure
Higher throughput Fewer epochs
But can increase throughput at cost of total time to train!
Time-to-train (end-to-end)
Time to solution!
Computationally expensive
High variance
Least bad choice
18
Time-to-train excludes
System initialization
Depends on cluster configuration and state
Model initialization
Disproportionate for big systems with small benchmarking datasets
Data reformatting
Mandating format would give advantage to some systems
MLPerf categories and divisions
19
Listening to the results
20
MLPerf Training v3.0 Suite
21
Task | Real World Application Examples |
Recommendation (*NEW*) | Content or shopping recommendation, e.g., search, social media, ads |
Speech recognition | Speech-to-text for smartphones, hands-free driver assistance |
NLP | Search, translation, chatbots, summarization |
LLM (*NEW*) | Search, translation, chatbots, summarization |
Image classification | Image labeling, general vision |
Object detection | Pedestrian detection, manufacturing defect detection, red-eye reduction |
3D segmentation | Medical image analysis, e.g., tumor identification |
Reinforcement learning | Robotics, circuit design and optimization |
MLPerf Training v3.0 Suite detail
22
Task | Dataset | Model | Quality Target |
Recommendation (*NEW*) | Criteo 4TB multi-hot | DLRM-dcnv2 | 0.8032 AUC |
Speech recognition | LibreSpeech | RNN-T | 0.058 Word Error Rate |
NLP | Wikipedia 2020-01-01 | BERT-large | 0.72 Mask-LM |
LLM (*NEW*) | C4 | GPT3 | 2.69 log perplexity |
Image classification | ImageNet 2012 | ResNet-50 v1.5 | 75.9% top-1 |
Object detection | Open Images (800x800) | RetinaNet | 0.34 mAP |
Object detection (heavy) | COCO 2017 | Mask R-CNN | 0.377 Box min AP and 0.339 Mask min AP |
3D segmentation | 2019 KiTS | 3D U-Net | 0.908 Mean DICE score |
MLPerf Training v3.0 overview
23
New Recommendation�Criteo 4TB Multi-hot with
DLRM-dcnv2
Updating MLPerf recommendation
25
Why do we need to update?
26
Update to DLRM-dcnv2 - Summary
27
Criteo 4TB Multi-hot dataset
28
DLRM-dcnv2 reference model
29
DCNv2 details
30
New Large Language Model (LLM)�Colossal Cleaned Common Crawl (C4) with GPT-3
New Large Language Model (LLM) Benchmark
32
C4 Dataset
MLPerf LLM Benchmark
Note:
GPT-3 175B Reference Model
MLPerf LLM Benchmark
Transformer Layer
Stack 96x
Transformer Layer
MLPerf LLM Benchmark
MLPerf Tiny v1.1 Benchmark
Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, Urmish Thakker, Antonio Torrini, Peter Warden, Jay Cordaro, Giuseppe Di Guglielmo, Javier Duarte, Stephen Gibellini, Videet Parekh, Honson Tran, Nhan Tran, Niu Wenxu, Xu Xuesong
MLPerf Tiny v1.1
37
Purpose:
Typical Systems�
�
38
MLPerf Tiny v1.1 reference models - single stream only
Keyword Spotting
Visual Wake Words
Anomaly Detection
Image Classification
Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018).
Chowdhery, Aakanksha, et al. "Visual wake words dataset." arXiv preprint arXiv:1906.05721 (2019).
Yuma Koizumi, Shoichiro Saito, Noboru Harada, Hisashi Uematsu and Keisuke Imoto, "ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection," in Proc of WASPAA, 2019.
Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009): 7.
Google Speech Commands
DS-CNN
(52 Kpar)
Visual Wake Words
Dataset
MobileNetV1 .25
(325 Kpar)
DCASE2020-Task2 / ToyADMOS
FC-AutoEncoder
(270 Kpar)
CIFAR10
ResNet8
(96 Kpar)
MLPerf Tiny v1.1 workloads
39
Task | Dataset | Reference Network | Target Quality |
Keyword Spotting | Speech Commands v2 | DS-CNN | 90% top-1 |
Visual Wake Words | COCO 2014 | MobileNetV1 0.25X | 80% top-1 |
Image Classification | CIFAR-10 | ResNet-v1 | 85% top-1 |
Anomaly Detection | ToyADMOS | Deep Autoencoder | 0.85 AUC |
EEMBC EnergyRunner™ Framework
40
Performance
Energy
MLPerf Tiny v1.1 overview
41
Looking back at 4 releases
42
Thank you!
Backup Slides
New Object Detection�Open Images with RetinaNet
New Object Detection benchmark
46
Open images dataset for Object Detection
47
Dataset | Max resolution | Classes | Train size | Validation size |
640×480 | 80 | 116,277 | 5,000 | |
>1200x1600 | 601 | 1,743,042 | 41,620 | |
OpenImages v6.0-MLPerf1000 | >1200x1600 | 264 | 1,170,301 | 24,781 |
OpenImages v6.0-MLPerf-1000 uses a subset of the full OpenImages dataset. The dataset filter out all “super” classes and classes with fewer than 1000 samples
RetinaNet reference model
48
ResNeXt-50
Feature Pyramid Network (FPN)
Class subnet: CNN+FCN
Box subnet: CNN+FCN
New reference model achieves much better accuracy: 0.34 mAP vs. 0.23 mAP
Dense Convolutions
Grouped Convolutions
49
Depth = channels (CH)
Filter parameters in yellow
Grouped convolution (2 groups)
Split filters into groups
Groups operate on separate channels
ResNet
ResNeXt
Input CH, Filter size, Output CHs
50