1 of 21

WiDS Datathon 2023�Team zn_k

Zeyneb N. Kaya

2 of 21

2

Agenda

Kaggle Winner Presentation Template

3 of 21

3

  1. Background
  2. Summary
  3. Feature selection & engineering
  4. Training methods
  5. Important findings
  6. Simple model

Agenda

Kaggle Winner Presentation Template

4 of 21

4

Background

Kaggle Winner Presentation Template

5 of 21

5

Zeyneb N. Kaya

  • Junior, Saratoga High School (CA, USA)
  • WiDS Student Ambassador
  • National Winner, NCWIT Aspirations in Computing Award

  • 2022 Datathon competitor
  • Data Science Certifications
    • Data Science and Machine Learning Certificate
    • NLP Specialization Certificate
  • Machine Learning Research
      • NLP, Data analysis (published @ EACL, UCB TextXD…)

Background

Kaggle Winner Presentation Template

6 of 21

6

Overview

  • Models: Gradient Boosting
  • CatBoost, LightGBM
  • Key Features: Forecasts
  • nmme0-tmp2m-34w___
  • Runtime: ~1 hour
  • Key Method: “Iterative Pseudolabeling”

Summary

Kaggle Winner Presentation Template

7 of 21

7

Features Selection/

Engineering

Kaggle Winner Presentation Template

8 of 21

8

  • Most important feature types
  • Forecast features
  • Wind
  • Location
  • Feature Engineering
  • loc_group: number each lat-lon location
  • label encoded climate region
  • year/month/day from startdate
  • Feature Selection
  • Drop highly correlated features
  • Note: categorical features not indicated in CatBoost

Features Selection/Engineering

Kaggle Winner Presentation Template

9 of 21

9

Features Selection/Engineering

Variable Importance Plot

Variable Importance

Features Selection/Engineering

Kaggle Winner Presentation Template

10 of 21

10

Features Selection/Engineering

Kaggle Winner Presentation Template

11 of 21

11

Features Selection/Engineering

Kaggle Winner Presentation Template

12 of 21

12

Training Methods

Kaggle Winner Presentation Template

13 of 21

13

  • Climate Region Experts
    • One set of models ensembled are “experts” in only their particular climate region. They train with all the data, but the points from their own “expert” region have a higher sample weight. When predicting, use predictions from the corresponding expert model.
  • “Iterative Pseudolabeling”
    • CatBoost + LGBM ensembled, then predictions are generated on test set and used as pseudolabels
    • A threshold is set in ensemble (only ensemble if the absolute difference to the predicted values in previous round is close). If they differ a lot then use the new model. The idea is that it favors the most recent model.

Kaggle Winner Presentation Template

Training Methods

Kaggle Winner Presentation Template

14 of 21

14

Kaggle Winner Presentation Template

Training Methods

Kaggle Winner Presentation Template

15 of 21

15

Important and Interesting Findings

Kaggle Winner Presentation Template

16 of 21

16

What sets me apart?

  • Interest to both learn and apply!
  • Build upon ideas while also being creative to further it ☺

Interesting thing found while exploring the data?

  • (drumroll, please)…prediction is that the anonymized region is mid + west USA
  • From climate region, observe a very peculiar pattern in the climate regions—very similar to that in mid+west US
  • Scale lat + lon between this area and plot true climate regions
  • Seemed to match?…hmmm
  • P.S. not used at all, just a bonus observation!

Kaggle Winner Presentation Template

Important and Interesting Findings

Kaggle Winner Presentation Template

17 of 21

17

Other Experiments

  • TabNet, RNN
  • Data augmentation (GAN, noise)
  • Predicting forecast error

Kaggle Winner Presentation Template

Important and Interesting Findings

Kaggle Winner Presentation Template

18 of 21

18

Simple Model

Kaggle Winner Presentation Template

19 of 21

19

  • CatBoost + Feature Engineering + Tuned Hyperparameters
    • RMSE ~0.8

Simple Model

Kaggle Winner Presentation Template

20 of 21

20

Question and Answer

Kaggle Winner Presentation Template

21 of 21

21

Kaggle Winner Presentation Template