Competitions and Benchmarks
Tutorial
Isabelle Guyon
ChaLearn and Google
1
Credit: Dall-E
http://codabench.org
Hands-on Tutorial
2
http://codabench.org
Slides
Data Science at the Singularity
A data-science-driven phase transition, crossing a kind of singularity, is happening now.
Not the ‘AI singularity’ — the moment when AGI is achieved — this hasn’t happened. Rather, some fields in computation-driven research are approaching a Reproducibility Singularity, the moment when all the ingredients of frictionless reproducibility (FR) of computational research are assembled in a package that makes it, essentially, immediate to build on the work of other researchers on an open, global stage.
The emergence of FR follows from the maturation of 3 data science principles that came together after decades of work by many technologists and numerous research communities:
3
David Donoho
Principal Obstacles to Reproducible Science
4
Credit: Dream-studio
Competitions and Benchmarks:
the Science behind the Contests
Book in preparation, Pavao, Guyon, Viegas, Eds.
Fundamentals
The best of challenges and benchmarks
Practical issues and open problems
5
Challenge Design Roadmap
Hugo-Jair Escalante, Isabelle Guyon, Addison Howard, Walter Reade, Sébastien Treguer
6
Source: Dall-E
Gathering Your Tools
7
Source: Dall-E
Chapter 2 preprint
Gathering Your Tools
8
Gathering Your Tools
9
Gathering Your Tools
10
Gathering Your Tools
11
Assembling your team
Complementary skills needed:
12
Mission objectives
Different types of challenges:
13
Knowing the locals
Determine your targeted audience:
They differ in:
14
Source: Dall-E
Discovering the Terrain
Experiment to find the right competition protocol:
Find the right level of difficulty (use inverted competitions)
15
Source Dall-E
Key to the treasure
Focus the challenge on one main scientific question (more is less):
16
Source Dall-E
Charting the course
Peer review is important:
17
Source: insight.org
Avoiding common pitfalls
18
Source: Teamly.com
Enough Data?
This is a skill-based contest
in which chance plays no role
19
Source: Dall-E
Filtering participants improves generalization in competitions and benchmarks. A Pavao, Z Liu, I Guyon, 2022
Bias in Data / Data Leakage
20
Image source: Dall-E
More on this: Kaggle data leakage tutorial
Reaping the rewards
For the organizers, the rewards come from
Harvesting the Challenge Results:
21
Image source: Dall-E
22
Image source: Dall-E
http://codabench.org
How to implement your challenge on:
Hands-on Tutorial
23
http://codabench.org
Slides
What is Codabench?
Free and open-source platform for hosting competitions and benchmarks.
24
Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. ZhenXu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, Isabelle Guyon. Patterns, Volume 3, Issue 7, 8 July 2022, https://arxiv.org/abs/2110.05802
History of Codabench
25
2013 Microsoft open-sources Codalab
Medical data. Result submission.
Codalab starts
Computer vision, speech, NLP, IR.
MSCOCO: 361 participants.
AutoML: 687 participants
Hackathons. Coopetitions.
Codalab competitions to U. PSaclay
480 challenges, 10000 users.
See.4c: EU prize, 2 million Euros
Google sponsors AutoDL series.
Codabench development starts
RTE organizes L2RPN challenge in
50,000 users, 1000 competitions
600 submissions per day!
Codabench used in HADACA and COMETH
EU projects.
4 physical servers, each with 12 disks of 16 TB
spread over 2 buildings. 100 competitions/month
20 GPUs
COMPETITION BUNDLES
2013
CODE SUBMISSION
BLIND TESTING
2014
USE IN EDUCATION
2016
COMPUTE WORKERS, DOCKERS
SCALABILITY,
REUSABILITY,
2017
REINFORCEMENT
LEARNING
2019
New INFRASTRUCTURE
2020/21
2022/23
Solving AI challenges
26
Scientific
Industrial
Societal
Ethical
27
challenge or benchmark task
Result submission
28
Code submission
29
predictions
30
Wizard / editor OR Bundles
Create your own
competition or benchmark
31
Upload bundle
Benchmarks > Management
33
TA DAH!
Editor
1: select phases
2: edit test phase
3: change end date
4: save once
5: save twice
35
Participate
1
2
Upload
Select to
show on
leaderboard
36
Leaderboard
37
Admin Functions
38
Conclusion
Upcoming open-access book
Croissant format for ML datasets
An open format for ML datasets, based on Web standards, that represents ML data and metadata, and supports Responsible AI, to:
Croissant enables dataset consumers to
�Croissant supports dataset creators with:
Find out more at: mlcommons.org/croissant
TFDS
Dataset metadata
RecordSets
Resources
FileSet(s)
FileObject(s)
Single files:
Directories / sets of homogeneous files:
Tabular structure over structured and unstructured resources.
Supports joining & flattening in preparation for ML loading.�
Fields (schema):
field1 | field2 | field3 |
a | 1 | img1 |
a | 2 | img2 |
Advertizing
New Journal
Advertizing
AI for Education AAAI’24
Workshop and Competition
Advertizing
Imagine that you want to calculate the test set size allowing you to obtain a error bar with v significant digits (e.g. 1+- 0.1 or 10+-1 would be 1 significant digit). The error bar is a symmetric confidence interval (CI) measured in number k of sigmas away from the mean; for example, if the noise is normally distributed, a 1-sigma error bar (k=1) is a 68% CI and a 2-sigma error bar (k=2) is a 95% CI.
Imagine that you can evaluate on some sample data with some baseline method what you anticipate the mean μ of your loss function will be, as well as its standard deviation σ (for your top ranking competitors). Then, the number of examples needed to get a k-sigma error bar with v significant digits is:
For the special case of classification problems for which μ=p the probability of error following a Bernoulli distribution and σ2= p(1-p). For p<<1, we have σ2~ p thus σ2/μ2 ~ 1/p. So for k=1 (1-sigma error bar), ν=1 (1 significant digit; least demanding) we get the rule-of-thumb: np = 100
Test set size formula
Bonus
43
Thanks for sharing the journey!
Upcoming open-access book
Chapter 2 preprint