1 of 43

Competitions and Benchmarks

Tutorial

Isabelle Guyon

ChaLearn and Google

1

Credit: Dall-E

http://codabench.org

2 of 43

Hands-on Tutorial

2

http://codabench.org

Slides

3 of 43

Data Science at the Singularity

A data-science-driven phase transition, crossing a kind of singularity, is happening now.

Not the ‘AI singularity’ — the moment when AGI is achieved — this hasn’t happened. Rather, some fields in computation-driven research are approaching a Reproducibility Singularity, the moment when all the ingredients of frictionless reproducibility (FR) of computational research are assembled in a package that makes it, essentially, immediate to build on the work of other researchers on an open, global stage.

The emergence of FR follows from the maturation of 3 data science principles that came together after decades of work by many technologists and numerous research communities:

  • data sharing,
  • code sharing, and
  • competitive challenges, however implemented in the particularly strong form of frictionless open services.

3

4 of 43

Principal Obstacles to Reproducible Science

  • Private code
  • Private data
  • Biased data
  • Cost of experiments
  • Restricted access to trained models
  • Leaderboard overfitting
  • No reporting of negative results

4

Credit: Dream-studio

5 of 43

Competitions and Benchmarks:

the Science behind the Contests

Book in preparation, Pavao, Guyon, Viegas, Eds.

  1. The life cycle of challenges and benchmarks

Fundamentals

  • Challenge design roadmap [preview]
  • Dataset development
  • How to judge a competition
  • How to ensure a long-lasting impact of a challenge

The best of challenges and benchmarks

  • Academic competitions
  • Industry/hiring competitions and benchmarks
  • Competitions and challenges in education
  • Benchmarks

Practical issues and open problems

  • Competition platforms
  • Hands-on tutorial on how to create your own challenge or benchmark [blog]
  • Special designs and competition protocols
  • Practical issues

5

6 of 43

Challenge Design Roadmap

Hugo-Jair Escalante, Isabelle Guyon, Addison Howard, Walter Reade, Sébastien Treguer

6

Source: Dall-E

7 of 43

Gathering Your Tools

7

Source: Dall-E

Chapter 2 preprint

8 of 43

Gathering Your Tools

8

9 of 43

Gathering Your Tools

9

10 of 43

Gathering Your Tools

10

11 of 43

Gathering Your Tools

11

12 of 43

Assembling your team

Complementary skills needed:

  • Coordination
  • Challenge design
  • Domain expertise
  • Methods (ML, AI)
  • Data quality
  • Beta-testing
  • Fundraising
  • Communication
  • Analysis and report writing

12

13 of 43

Mission objectives

Different types of challenges:

  • Recruiting
  • Research and Development
  • Academic
  • Public Relation
  • Branding

13

14 of 43

Knowing the locals

Determine your targeted audience:

  • Novices
  • Professional data scientists
  • Researchers and academics
  • Industry engineers
  • ML enthusiasts/hobbyists
  • Students
  • Competitive coders
  • Innovators and entrepreneurs

They differ in:

  • Level of expertise (domain and ML)
  • Motivation
  • Time commitment

14

Source: Dall-E

15 of 43

Discovering the Terrain

Experiment to find the right competition protocol:

  • Type of data
  • Metrics
  • Baseline methods
  • Information available

Find the right level of difficulty (use inverted competitions)

15

Source Dall-E

16 of 43

Key to the treasure

Focus the challenge on one main scientific question (more is less):

  • Whether (Comparative)
  • What (Discovery)
  • Why (Causality)
  • How (Prescriptive)
  • What For (Purpose-driven)

16

Source Dall-E

17 of 43

Charting the course

Peer review is important:

  • Identify a second tier landing conference and run a first “draft” of your competition
  • Write a NeurIPS competition track proposal (NeurIPS template) and take the reviews into account.

17

Source: insight.org

18 of 43

Avoiding common pitfalls

  • Clarity:
    • ONE primary question
  • Difficulty calibration:
    • Match difficulty to audience
  • Barrier to entry:
    • Starting kit
  • Beta testing:
    • Test EVERYTHING!
  • Incentives and communication:
    • Monitor participation
  • Cheating:
    • Prefer code submission
  • Quantity and quality of data …

18

Source: Teamly.com

19 of 43

Enough Data?

  • Bottom line: get a stable ranking of participants

This is a skill-based contest

in which chance plays no role

  • Rule-of-thumb (classification): np = 100 (n = test set size; p = anticipated error rate of winner)

  • Importance of dry runs: Check stability of ranking
  • Use 2-phase challenges
    • Eliminatory phase: Keep only participants above baseline
    • Final phase: Single submission

19

Source: Dall-E

20 of 43

Bias in Data / Data Leakage

20

Image source: Dall-E

  • Sampling bias:
    • Some groups under/over represented
    • Example: face classification trained predominantly w. white male subjects

  • Spurious dependency bias:
    • Some variables are “shortcuts” to the target
    • Example: indoor/outdoor pet classification

  • Information bias (data preparation)
    • Target information leakage into features
    • Example: Use treatment to predict diagnosis

21 of 43

Reaping the rewards

For the organizers, the rewards come from

Harvesting the Challenge Results:

  • Create “Fact Sheets”
  • Ask to open-source code
  • Organize a workshop
  • Edit proceedings
  • Co-author a paper with the winners
  • Organize a follow-up challenge

21

Image source: Dall-E

22 of 43

22

Image source: Dall-E

http://codabench.org

How to implement your challenge on:

23 of 43

Hands-on Tutorial

23

http://codabench.org

Slides

24 of 43

What is Codabench?

Free and open-source platform for hosting competitions and benchmarks.

24

Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. ZhenXu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, Isabelle Guyon. Patterns, Volume 3, Issue 7, 8 July 2022, https://arxiv.org/abs/2110.05802

25 of 43

History of Codabench

25

2013 Microsoft open-sources Codalab

Medical data. Result submission.

Codalab starts

Computer vision, speech, NLP, IR.

MSCOCO: 361 participants.

AutoML: 687 participants

Hackathons. Coopetitions.

Codalab competitions to U. PSaclay

480 challenges, 10000 users.

See.4c: EU prize, 2 million Euros

Google sponsors AutoDL series.

Codabench development starts

RTE organizes L2RPN challenge in

50,000 users, 1000 competitions

600 submissions per day!

Codabench used in HADACA and COMETH

EU projects.

4 physical servers, each with 12 disks of 16 TB

spread over 2 buildings. 100 competitions/month

20 GPUs

COMPETITION BUNDLES

2013

CODE SUBMISSION

BLIND TESTING

2014

USE IN EDUCATION

2016

COMPUTE WORKERS, DOCKERS

SCALABILITY,

REUSABILITY,

2017

REINFORCEMENT

LEARNING

2019

New INFRASTRUCTURE

2020/21

2022/23

26 of 43

Solving AI challenges

26

Scientific

Industrial

Societal

Ethical

27 of 43

27

challenge or benchmark task

28 of 43

Result submission

28

29 of 43

Code submission

29

predictions

30 of 43

30

Wizard / editor OR Bundles

Create your own

competition or benchmark

31 of 43

31

32 of 43

Upload bundle

Benchmarks > Management

33 of 43

33

TA DAH!

34 of 43

Editor

1: select phases

2: edit test phase

3: change end date

4: save once

5: save twice

35 of 43

35

Participate

1

2

Upload

Select to

show on

leaderboard

  • Secret key
  • View execution and logs

36 of 43

36

Leaderboard

  • Competition or benchmark mode
  • Multiple tasks
  • Multiple metrics
  • Detailed results

37 of 43

37

Admin Functions

  • Manage participant list
  • Monitor submissions
  • Make hot changes
  • Add compute workers

38 of 43

38

Conclusion

  • Codabench = versatile open-source platform
  • Functionality:
    • Result, code, and dataset submissions.
    • Challenges or benchmarks.
  • Hosting:
    • Free public access to UP-Saclay instance.
    • Can supply own compute workers.
    • Local deployment possible.
  • Current Limitations:
    • No double-entry leaderboards (code+data).
    • No hardware or human-in-the-loop benchmarks.
  • Future Enhancements:
    • Competition template library.
    • Templates for fact sheets.
    • Support for coopetitions.

Upcoming open-access book

39 of 43

Croissant format for ML datasets

An open format for ML datasets, based on Web standards, that represents ML data and metadata, and supports Responsible AI, to:

  • Reduce friction for using datasets across ML tools and platforms
  • Make it easy to publish, discover and reuse ML datasets.

Croissant enables dataset consumers to

�Croissant supports dataset creators with:

  • A visual editor to create and modify datasets, automate the description of the data (e.g., CSV columns), and get recommendations to improve metadata
  • A Python library to validate, manipulate and convert datasets

Find out more at: mlcommons.org/croissant

TFDS

  • name, description, license, …
  • ML-specific attributes: splits, features, labels, …
  • Responsible AI attributes

Dataset metadata

RecordSets

Resources

FileSet(s)

FileObject(s)

Single files:

  • CSV
  • JSON
  • Zip, …

Directories / sets of homogeneous files:

  • images
  • text, …

Tabular structure over structured and unstructured resources.

Supports joining & flattening in preparation for ML loading.�

Fields (schema):

  • name
  • type
  • references
  • nesting, ...

field1

field2

field3

a

1

img1

a

2

img2

Advertizing

40 of 43

New Journal

Advertizing

41 of 43

AI for Education AAAI’24

Workshop and Competition

Advertizing

42 of 43

Imagine that you want to calculate the test set size allowing you to obtain a error bar with v significant digits (e.g. 1+- 0.1 or 10+-1 would be 1 significant digit). The error bar is a symmetric confidence interval (CI) measured in number k of sigmas away from the mean; for example, if the noise is normally distributed, a 1-sigma error bar (k=1) is a 68% CI and a 2-sigma error bar (k=2) is a 95% CI.

Imagine that you can evaluate on some sample data with some baseline method what you anticipate the mean μ of your loss function will be, as well as its standard deviation σ (for your top ranking competitors). Then, the number of examples needed to get a k-sigma error bar with v significant digits is:

For the special case of classification problems for which μ=p the probability of error following a Bernoulli distribution and σ2= p(1-p). For p<<1, we have σ2~ p thus σ22 ~ 1/p. So for k=1 (1-sigma error bar), ν=1 (1 significant digit; least demanding) we get the rule-of-thumb: np = 100

Test set size formula

Bonus

 

43 of 43

43

Thanks for sharing the journey!

Upcoming open-access book

Chapter 2 preprint