1 of 43

Competitions and Benchmarks

Tutorial

Isabelle Guyon

ChaLearn and Google

1

Credit: Dall-E

http://codabench.org

2 of 43

Hands-on Tutorial

2

http://codabench.org

Slides

3 of 43

Data Science at the Singularity

A data-science-driven phase transition, crossing a kind of singularity, is happening now.

Not the ‘AI singularity’ — the moment when AGI is achieved — this hasn’t happened. Rather, some fields in computation-driven research are approaching a Reproducibility Singularity, the moment when all the ingredients of frictionless reproducibility (FR) of computational research are assembled in a package that makes it, essentially, immediate to build on the work of other researchers on an open, global stage.

The emergence of FR follows from the maturation of 3 data science principles that came together after decades of work by many technologists and numerous research communities:

data sharing,
code sharing, and
competitive challenges, however implemented in the particularly strong form of frictionless open services.

3

David Donoho

https://arxiv.org/pdf/2310.00865.pdf

4 of 43

Principal Obstacles to Reproducible Science

Private code
Private data
Biased data
Cost of experiments
Restricted access to trained models
Leaderboard overfitting
No reporting of negative results

4

Credit: Dream-studio

5 of 43

Competitions and Benchmarks:

the Science behind the Contests

Book in preparation, Pavao, Guyon, Viegas, Eds.

The life cycle of challenges and benchmarks

Fundamentals

Challenge design roadmap [preview]
Dataset development
How to judge a competition
How to ensure a long-lasting impact of a challenge

The best of challenges and benchmarks

Academic competitions
Industry/hiring competitions and benchmarks
Competitions and challenges in education
Benchmarks

Practical issues and open problems

Competition platforms
Hands-on tutorial on how to create your own challenge or benchmark [blog]
Special designs and competition protocols
Practical issues

5

6 of 43

Challenge Design Roadmap

Hugo-Jair Escalante, Isabelle Guyon, Addison Howard, Walter Reade, Sébastien Treguer

6

Source: Dall-E

7 of 43

Gathering Your Tools

7

Source: Dall-E

Chapter 2 preprint

8 of 43

Gathering Your Tools

8

9 of 43

Gathering Your Tools

9

10 of 43

Gathering Your Tools

10

11 of 43

Gathering Your Tools

11

12 of 43

Assembling your team

Complementary skills needed:

Coordination
Challenge design
Domain expertise
Methods (ML, AI)
Data quality
Beta-testing
Fundraising
Communication
Analysis and report writing

12

Image By vecstock

13 of 43

Mission objectives

Different types of challenges:

Recruiting
Research and Development
Academic
Public Relation
Branding

13

One of the first tasks of the team is to determine the objective of the challenge. There are different types of challenges:

Recruiting Challenges:

Objective: Find a champion or top talent for employment.
Approach: Select problems representative of real work scenarios.
Participant Expectations: Handle all aspects of the problem, including data preprocessing and API design.
Deliverable: Technical report from top-ranking participants for evaluation.

Research and Development Challenges:

Objective: Benchmark algorithms, focusing on specific bottlenecks in a process.
Approach: Design a detailed API for participants to work with.
Winners may be required to open-source their solutions.

Academic Challenges:

Objective: Discover new methods for complex problems.
Approach: Proceed in incremental steps, organize first dry runs or hackathons, then organize a series of competitions ramping up difficulty
To select winners, include both quantitative evaluations measurable by leaderboard ranking and qualitative evaluations rewarded by a best paper award.

Public Relation Challenges:

Objective: Increase visibility and attract new customers or students.
Approach: Simplify challenge design and make it educational.
Communication: Use mass media and organize interviews to highlight participant progress.
Structure: Include intermediate goals and prizes to maintain engagement.

Branding Challenges:

Objective: Gain recognition in the data science community.
Approach: Release new technologies or datasets, e.g., promote the use of specific software or introduce state-of-the-art datasets.
Goal: Establish brand presence and influence within the community.

14 of 43

Knowing the locals

Determine your targeted audience:

Novices
Professional data scientists
Researchers and academics
Industry engineers
ML enthusiasts/hobbyists
Students
Competitive coders
Innovators and entrepreneurs

They differ in:

Level of expertise (domain and ML)
Motivation
Time commitment

14

Source: Dall-E

15 of 43

Discovering the Terrain

Experiment to find the right competition protocol:

Type of data
Metrics
Baseline methods
Information available

Find the right level of difficulty (use inverted competitions)

15

Source Dall-E

16 of 43

Key to the treasure

Focus the challenge on one main scientific question (more is less):

Whether (Comparative)
What (Discovery)
Why (Causality)
How (Prescriptive)
What For (Purpose-driven)

16

Source Dall-E

After your initial exploration, you must decide on one main scientific question to be addressed. All too often the goal of a challenge is not very clear. This both confuses the participants and does not allow the organizers to get conclusive results. Refrain from pursuing multiple objectives, home into one main question: more is less.

Here are a few examples of questions that can be addressed by challenges.

Whether (Comparative):

The most common types of questions are comparative. Typically whether there is a method that can significantly outperform a baseline methods or outperform others on a given criterion. More subtle: Compare performance trade-offs in different settings. For example, in the Agnostic Learning vs. Prior Knowledge challenge there were 2 tracks and the problem was to decide whether compare prior knowledge was useful or not in preprocessing.

What (Discovery):

A far less common type of challenges is discovery challenges. The questions range from what patterns or signatures, to what significant features, or natural segments can be found in data. An example of discovery challenge could be the Higgs Boson challenge in which the problem was to discover what signatures in data revealed a new high energy particle, not previously detected. Note that to evaluate a discovery challenge you need to know the ground truth, which is not known for real discovery problems. This is why discovery challenge must resort to using simulated data. In the case of the Higgs Boson challenge, we used data generated from a realistic physics simulator.

Why (Causality):

Another type of challenge concerns understanding the reasons (the why) behind specific events or outcomes and their influencing variables.

ChaLearn has organized a series of causality challenges to push the state of the art in methods that identify causal relationships. Why did sales decrease? Why did crime increase? Why did the opioid epidemic spread? Etc. Note that each single question does not lend itself to a challenge because it is one yes/no question (hence you could easily get the right answer by chance). There is also the problem that the organizers should knowing the ground truth (which is typically not known if the problem is an open problem). This is why causality challenge often resort to using synthetic data.

How (Prescriptive):

Yet another type of challenges tries to identify the how, that is to determine efficient resource allocation and actions for maximize rewards in various settings. Such questions are typical of optimization or reinforcement learning challenges, which try to uncover optimal policies.

Examples of such prescriptive competitions are: The Black box optimization challenge for hyperparameter optimization and L2RPN challenges for efficient power grid control.

What For (Purpose-driven):

A final type of questions you may want to consider are “what for” questions.

This type of questions concern assessing fairness, biases, and the applicability of models to new tasks or populations.

For whom might this model be unfair or biased?

For what populations does this model perform sub-optimally?

For what new tasks or domains can knowledge from one domain be useful?

An example of such challenge would be the Jigsaw Unintended Bias in Toxicity Classification challenge to predict toxicity in comments while considering biases.

17 of 43

Charting the course

Peer review is important:

Identify a second tier landing conference and run a first “draft” of your competition
Write a NeurIPS competition track proposal (NeurIPS template) and take the reviews into account.

17

Source: insight.org

18 of 43

Avoiding common pitfalls

Clarity:

ONE primary question

Difficulty calibration:

Match difficulty to audience

Barrier to entry:

Starting kit

Beta testing:

Test EVERYTHING!

Incentives and communication:

Monitor participation

Cheating:

Prefer code submission

Quantity and quality of data …

18

Source: Teamly.com

Hopefully, the reviewers will be helping you identify common pitfalls. But you can do that too using this check-list.

Clarity: Challenges often fail when the problem definition and goals are unclear or too complex.

Make sure to have a clear, concise objective with a straightforward metric to enhance understanding and engagement.
It is crucial to prioritize a primary question and base the competition around it. While secondary questions can be included, they should not detract from the main objective.

Difficulty Calibration: Adjusting the difficulty level to suit the target audience is vital. This involves selecting datasets and tasks that are neither too easy nor excessively challenging.

Barrier to Entry: High entry barriers or excessive prerequisites can deter serious competitors. It's important to keep the entry process as inclusive and straightforward as possible. People are often busy. Make is possible to assess whether they want to enter the challenge by facilitating making the first submission with a well-done starting kit, which allows them to make their first submission in just a few minutes. Then they might get hooked.

Beta-testing: Beta-testing the challenge with volunteers or a smaller group before the official launch helps identify and rectify any flaws, ensuring the competition runs smoothly. Make sure to devote enough time to have your close collaborators test everything before you open the challenge to the public.

Incentives and Communication: Providing attractive incentives and maintaining clear communication with participants are crucial for active engagement.

Prizes have been found to be an important motivation factor, but it is not the only one. The potential of learning new interesting techniques while having fun is an important driver.
Advertise the challenge ahead of time so people block time in their schedule to participate and make sure deadlines do not coincide with other important events, such as conferences or conference deadlines.
Provide regular updates, feedback sessions, and promotional activities such as tutorials or bootcamps to boost participation.

Quantity and quality of data: Ensuring the quality of data and having sufficiently large datasets are key. This includes avoiding data bias, leakage, and ensuring enough data for meaningful results. In what follows, let me develop the issues of quantity and quality of data.

19 of 43

Enough Data?

Bottom line: get a stable ranking of participants

This is a skill-based contest

in which chance plays no role

Rule-of-thumb (classification): np = 100 (n = test set size; p = anticipated error rate of winner)

Importance of dry runs: Check stability of ranking
Use 2-phase challenges

Eliminatory phase: Keep only participants above baseline
Final phase: Single submission

19

Source: Dall-E

Filtering participants improves generalization in competitions and benchmarks. A Pavao, Z Liu, I Guyon, 2022

What size test set gives good error rate estimates? I Guyon et al 1998

First quantity of data.

The problem is to obtain a ranking of participants, which is stable under various perturbations.
In one of the first challenges I organized with the sponsorship of a large company, the lawyers of that company added to the rule: “this is a skilled-based contest in which chance plays no role”. This has always stuck to my mind. How can we organize machine learning contests in which chance plays no role. Obviously there are factors of variability, for example the data split, the choice of initialization. If we fix the data split and seed all random number generators, the results of the contest are perfectly reproducible. Does this mean that chance played no role?
In the 1990’s, I analyzed the problem of the size of the test set needed to get desired error bars for binary classification problems in the case of iid data. The insight gained was quite interesting: the smaller the anticipated error rate of the winner, the larger the test set should be to get a desired error bar. Even though this may sound counter-intuitive, this makes sense. If you want to precisely measure an error rate, you must see at least some errors. Assume that the error rate in 1%, on average, to see one error, you need to see 100 examples. But now if the error rate is 0.1%, to see 1 error you need to see 1000 examples. The analysis can be brought back to a simple rule-of-thumb: np = 100: the product of the test set size n and the probability of error p is 100, meaning that you should see on average 100 errors to get relatively reasonable error bars. In supplemental bonus material I have given you a formula to compute the test set size that works for all additive losses.
But this analysis has several limitations: it addresses only additive losses, and it takes into account only the test set variability. Other sources of variability may include the training set variability and the initialization of algorithms. This highlights the importance of dry runs. You should try out your competition protocol end-to-end and clone the trial competition multiple times and re-run it while varying all factors of variability to check the stability of ranking. Increase the amount of data until the ranking is stable.
However, there is one last problem: the stability of ranking depends on the number of participants. The largest the number of participants, the most successful a competition is perceived to be. But the highest the risk that the winners win by chance. One final precaution you should take is to run your competitive challenge in 2 phases to limit the number of participants in the final phase. For example, run an eliminatory phase in which you keep only the participants that perform better than a baseline method. Then run a final phase on fresh data, allowing the participants to make only a single submission.

20 of 43

Bias in Data / Data Leakage

20

Image source: Dall-E

More on this: Kaggle data leakage tutorial

Sampling bias:

Some groups under/over represented
Example: face classification trained predominantly w. white male subjects

Spurious dependency bias:

Some variables are “shortcuts” to the target
Example: indoor/outdoor pet classification

Information bias (data preparation)

Target information leakage into features
Example: Use treatment to predict diagnosis

But we not only need to worry about quantity of data and stability of ranking, we need to worry about the quality of data, and particularly bias in data or data leakage, which a pernicious way in which data are corrupted that often goes unnoticed. Data leakage in machine learning competitions refers to the unintentional inclusion of information in the training data that is not available during actual prediction scenarios, leading to overly optimistic model performance.

Sampling bias occurs when some groups of data samples are under or over represented in the data used for the competition, compared to a target or ideal distribution. This can happen for various reasons, but often times simply because of availability of data. For example, if your study population is NeurIPS participants, you may get an over-representation of white males. That alone may not be a problem, unless you build a face recognizer with pictures of NeurIPS participants as you may get poor recognition accuracy of under-represented groups, like black females.

Spurious dependency bias (which includes confounding bias) occurs when variables, which are not legitimate predictors of the target end up being more predictive of the target than legitimate variables. To keep things simple, “legitimate variables” usually participate to direct causal effect while illegitimate variables don’t. Which variables are illegitimate variables is a dicey issue. Sometimes they are legally protected (for example age, gender, and race are protected variables in the United States), sometimes they are known to be suspicious (for example experimental conditions such as luminosity, temperature, pressure, are frequent confounders). However, sometimes the legitimacy of variables is unknown. One should be wary about potential unknown confounding variables and record as much meta-data as possible to try to unveil them and test whether they are more predictive that legitimate variables. For example, in a pet classification problem you should be careful not to take only pictures of indoor cats and outdoor dogs as luminosity may be a confounding factor.

A last case of data leakage is information bias or target information leakage. For example, treatment of disease follows diagnosis. So, when preparing data to predict diagnosis, treatment should not be used as a predictor. For instance, whether the patient underwent surgery should not be use to predict whether the patient should be diagnosed with cancer.

There are many inadvertent clues in data that can cause leakage, including sample IDs, variations in formats, resolution, aspect ratio for data, file size, time stamps, etc., so I invite you to check out the Kaggle tutorial on data leakage.

21 of 43

Reaping the rewards

For the organizers, the rewards come from

Harvesting the Challenge Results:

Create “Fact Sheets”
Ask to open-source code
Organize a workshop
Edit proceedings
Co-author a paper with the winners
Organize a follow-up challenge

21

Image source: Dall-E

Challenge organizers frequently feel a sense of relief as a competition concludes, often evaluating success merely by the number of participants. However, the true value for organizers lies in effectively harvesting the results of the challenge, an endeavor that requires additional effort and planning. To maximize the benefits from a challenge, consider implementing the following strategies:

Create 'Fact Sheets' that are detailed questionnaires for participants to elaborate on their methodologies, difficulties faced, and the strategies employed to address them.
Ask to open-source code: Make it a prerequisite for winners to open-source their code in order to claim their prizes.
Organize a workshop;
Edit proceedings;
Co-author a paper with the winners.
Also, consider Organizing a follow-up challenge: challenge series often lead to stimulating the community to make continuous progress and after many years sometimes lead to a breakthrough. Such is the case for instance of the CASP competition series in protein folding, which lead to the record breaking method alpha-fold of Google DeepMind.

22 of 43

22

Image source: Dall-E

http://codabench.org

How to implement your challenge on:

23 of 43

Hands-on Tutorial

23

http://codabench.org

Slides

24 of 43

What is Codabench?

Free and open-source platform for hosting competitions and benchmarks.

24

Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. ZhenXu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, Isabelle Guyon. Patterns, Volume 3, Issue 7, 8 July 2022, https://arxiv.org/abs/2110.05802

25 of 43

History of Codabench

25

2013 Microsoft open-sources Codalab

Medical data. Result submission.

Codalab starts

Computer vision, speech, NLP, IR.

MSCOCO: 361 participants.

AutoML: 687 participants

Hackathons. Coopetitions.

Codalab competitions to U. PSaclay

480 challenges, 10000 users.

See.4c: EU prize, 2 million Euros

Google sponsors AutoDL series.

Codabench development starts

RTE organizes L2RPN challenge in

50,000 users, 1000 competitions

600 submissions per day!

Codabench used in HADACA and COMETH

EU projects.

4 physical servers, each with 12 disks of 16 TB

spread over 2 buildings. 100 competitions/month

20 GPUs

COMPETITION BUNDLES

2013

CODE SUBMISSION

BLIND TESTING

2014

USE IN EDUCATION

2016

COMPUTE WORKERS, DOCKERS

SCALABILITY,

REUSABILITY,

2017

REINFORCEMENT

LEARNING

2019

New INFRASTRUCTURE

2020/21

2022/23

26 of 43

Solving AI challenges

26

Scientific

Industrial

Societal

Ethical

27 of 43

27

challenge or benchmark task

28 of 43

Result submission

28

29 of 43

Code submission

29

predictions

30 of 43

30

Wizard / editor OR Bundles

Create your own

competition or benchmark

https://github.com/codalab/codabench/wiki/Getting-started-with-Codabench

31 of 43

31

https://github.com/codalab/codabench/wiki/Getting-started-with-Codabench

Download the sample competition bundle and the sample submission.

32 of 43

Upload bundle

Benchmarks > Management

33 of 43

33

TA DAH!

34 of 43

Editor

1: select phases

2: edit test phase

3: change end date

4: save once

5: save twice

35 of 43

35

Participate

1

2

Upload

Select to

show on

leaderboard

Secret key
View execution and logs

36 of 43

36

Leaderboard

Competition or benchmark mode
Multiple tasks
Multiple metrics
Detailed results

37 of 43

37

Admin Functions

Manage participant list
Monitor submissions
Make hot changes
Add compute workers

38 of 43

38

Conclusion

Codabench = versatile open-source platform
Functionality:

Result, code, and dataset submissions.
Challenges or benchmarks.

Hosting:

Free public access to UP-Saclay instance.
Can supply own compute workers.
Local deployment possible.

Current Limitations:

No double-entry leaderboards (code+data).
No hardware or human-in-the-loop benchmarks.

Future Enhancements:

Competition template library.
Templates for fact sheets.
Support for coopetitions.

Upcoming open-access book

In conclusion:

Codabench is a versatile open-source challenge platform, designed for diverse data science benchmarks, including supervised learning and reinforcement learning.
It supports result, code, and dataset submissions, ensuring ease of use and reproducibility through Docker technology.
It is available publicly for free use, hosted at Université Paris-Saclay, with the option to supply your own compute workers to organize large scale competitions. It also provides the option of local deployment using the provided technology stack.
modular algorithm evaluation and ablation studies.
Its current limitations include that there is no support for benchmarks with double entry leaderboards where both datasets and code can be submitted. There is also no support for hardware-related and human-in-the-loop benchmarks.
Future enhancements include: creating a library of competition templates to help future organizers, creating templates of fact sheets to survey participants about their methods, and providing support for competitions, stimulating both competition and collaboration.

39 of 43

Croissant format for ML datasets

An open format for ML datasets, based on Web standards, that represents ML data and metadata, and supports Responsible AI, to:

Reduce friction for using datasets across ML tools and platforms
Make it easy to publish, discover and reuse ML datasets.

Croissant enables dataset consumers to

Search for ML datasets in Dataset Search
Download ML datasets from repositories like Kaggle, Hugging Face, OpenML, TensorFlow Datasets catalog
Load ML datasets into ML frameworks like TensorFlow, JAX and PyTorch.

�Croissant supports dataset creators with:

A visual editor to create and modify datasets, automate the description of the data (e.g., CSV columns), and get recommendations to improve metadata
A Python library to validate, manipulate and convert datasets

Find out more at: mlcommons.org/croissant

TFDS

name, description, license, …
ML-specific attributes: splits, features, labels, …
Responsible AI attributes

Dataset metadata

RecordSets

Resources

FileSet(s)

FileObject(s)

Single files:

CSV
JSON
Zip, …

Directories / sets of homogeneous files:

images
text, …

Tabular structure over structured and unstructured resources.

Supports joining & flattening in preparation for ML loading.�

Fields (schema):

name
type
references
nesting, ...

field1	field2	field3
a	1	img1
a	2	img2

Advertizing

40 of 43

New Journal

Advertizing

41 of 43

AI for Education AAAI’24

Workshop and Competition

Advertizing

https://ai4ed.cc/

42 of 43

Imagine that you want to calculate the test set size allowing you to obtain a error bar with v significant digits (e.g. 1+- 0.1 or 10+-1 would be 1 significant digit). The error bar is a symmetric confidence interval (CI) measured in number k of sigmas away from the mean; for example, if the noise is normally distributed, a 1-sigma error bar (k=1) is a 68% CI and a 2-sigma error bar (k=2) is a 95% CI.

Imagine that you can evaluate on some sample data with some baseline method what you anticipate the mean μ of your loss function will be, as well as its standard deviation σ (for your top ranking competitors). Then, the number of examples needed to get a k-sigma error bar with v significant digits is:

For the special case of classification problems for which μ=p the probability of error following a Bernoulli distribution and σ²= p(1-p). For p<<1, we have σ²~ p thus σ²/μ² ~ 1/p. So for k=1 (1-sigma error bar), ν=1 (1 significant digit; least demanding) we get the rule-of-thumb: np = 100

Test set size formula

Bonus

43 of 43

43

Thanks for sharing the journey!

Upcoming open-access book

Chapter 2 preprint