1 of 30

Kaggle Winner Presentation

Team: TUIASI_FML_AI_Synergy

Maria-Gabriela Fodor

Teodor Gorghe

2 of 30

2

Agenda

Kaggle Winner Presentation Template

3 of 30

3

  1. Background
  2. Summary
  3. Feature selection & engineering
  4. Training methods
  5. Important findings
  6. Simple model

Agenda

Kaggle Winner Presentation Template

4 of 30

4

Background

Kaggle Winner Presentation Template

5 of 30

5

  • We are first-year master students at Artificial Intelligence, from Faculty of Automatic Control and Computer Engineering, “Gheorghe Asachi” Technical University of Iași.

  • We have graduated Information Technology at 2024, from the same faculty.

Background

Kaggle Winner Presentation Template

6 of 30

6

  • Training methods used: Random Forest, Decision Tree, KNN, Feed-Forward Neural Network variants and stacking techniques.

  • Feature engineering: handling features with missing data, data standardization and encoding.

  • Tools used: Python, scikit-learn, PyTorch, Matplotlib.

Summary

Kaggle Winner Presentation Template

7 of 30

7

Feature

Engineering

Kaggle Winner Presentation Template

8 of 30

8

Data preprocessing steps for metadata features:

          • dealing with outliers
          • scaling numerical features
          • handling features with missing values
          • categorical features encoding: One-Hot encoding

Feature Engineering

Kaggle Winner Presentation Template

9 of 30

9

Dealing with outliers:

Feature Engineering

  • We used Interquartile range (IQR) measure with [0.1; 0.9] thresholds to check if one feature has outliers.

  • The only feature that has outlier values is “bmi”.

Kaggle Winner Presentation Template

10 of 30

10

Features Selection/Engineering

Variable Importance Plot

Box plot for “BMI”

BEFORE

Feature Engineering

Kaggle Winner Presentation Template

11 of 30

11

Features Selection/Engineering

Variable Importance Plot

Box plot for “BMI”

AFTER

Feature Engineering

Kaggle Winner Presentation Template

12 of 30

12

Scaling numerical features:

Feature Engineering

  • Standardized numerical features due to varying distributions.

  • This is crucial because some machine learning methods are not scale-invariant.

Kaggle Winner Presentation Template

13 of 30

13

Handling features with missing values

Feature Engineering

Kaggle Winner Presentation Template

14 of 30

14

Handling features with missing values

Feature Engineering

  • Removing “race” and “parent_1_education” attributes

  • Fill in the remaining attributes with missing data using KNN imputation (n=5), separately on numeric and categorical features.

Kaggle Winner Presentation Template

15 of 30

15

Training Methods

Kaggle Winner Presentation Template

16 of 30

16

Random Forest

  • Trained using default scikit-learn hyperparameters.
  • Takes long to train on full dataset.
  • Achieves 0.739 RMSE and 0.451 R-squared on CV (k=5), full training data.

Gradient Boosting Regressor

  • Trained using default scikit-learn hyperparameters.
  • It is faster than Random Forest on training, but slower on prediction.
  • Achieves 0.688 RMSE and 0.52 R-squared on CV (k=5), full training data.

Kaggle Winner Presentation Template

Training Methods

Kaggle Winner Presentation Template

17 of 30

17

Kaggle Winner Presentation Template

Training Methods

Feed-forward neural network (without residual blocks)

  • Searched the hyper-parameters using grid-search cross-validation.
  • Takes shorter to train than Random Forest and Gradient Boosting Regressor.
  • Achieves 0.739 RMSE and 0.451 R-squared on CV (k=5), full training data.

Hyper-parameter name

Value

Learning rate

0.0001

Batch size

32

Number of hidden layers

3

Hidden layers size

256

128

64

Kaggle Winner Presentation Template

18 of 30

18

Kaggle Winner Presentation Template

Training Methods

Feed-forward neural network with residual blocks

  • Searched the hyper-parameters using grid-search cross-validation.
  • Has the same training time, but slightly better prediction than MLP without residual blocks.
  • Achieves 0.616 RMSE and 0.618 R-squared on CV (k=5), full training data.

Hyper-parameter name

Value

Learning rate

0.0001

Batch size

32

Number of residual blocks

2

Block size

1024

Kaggle Winner Presentation Template

19 of 30

19

  • We also used a stacking technique, which helped us achieve better prediction score.

  • The base estimators are the following:
  • Gradient Boosting Regressor
  • Random Forest
  • Feed-Forward Neural Network without residual blocks.
  • Feed-Forward Neural Network with residual blocks.

Kaggle Winner Presentation Template

Training Methods

Kaggle Winner Presentation Template

20 of 30

20

  • Hyper-parameters for the base estimators are the same with the individual trained models.
  • The final estimator is a model RidgeCV, with default hyper-parameters from scikit-learn.
  • Training and predict time is the longest among all training methods, but provides the best prediction score.

Kaggle Winner Presentation Template

Training Methods

Kaggle Winner Presentation Template

21 of 30

21

Important and Interesting Findings

Kaggle Winner Presentation Template

22 of 30

22

  • Sometimes missing values cannot be inferred effectively, so a strategy such as removing attributes with more than 14% missing values is recommended.

  • While a single model might provide most of the accuracy, overfitting is further mitigated by using ensembles.

  • The best models don’t always from the best ensemble.

Kaggle Winner Presentation Template

Important and Interesting Findings

Kaggle Winner Presentation Template

23 of 30

23

Correlation matrix highlighting the weak relationships among numeric attributes

Important and Interesting Findings

Kaggle Winner Presentation Template

24 of 30

24

Simple Model

Kaggle Winner Presentation Template

25 of 30

25

  • The ResidualMLPRegressorModel achieved a cross-validation score of 1.93401.

  • The full ensemble (4 regressors + meta-model) reached 1.87112.

Simple Model

Kaggle Winner Presentation Template

26 of 30

26

Solution Overview

Kaggle Winner Presentation Template

27 of 30

27

Overview & Insights�

  • Explored neural networks & variations.

  • Focused on low-resource methods.

  • Tried diverse preprocessing techniques.

  • Well-tuned lightweight models can rival heavy ones.

Solution Overview

Kaggle Winner Presentation Template

28 of 30

28

Advantages & Fun Aspects�

  • Teamwork—parallel experimentation and fast idea sharing.

  • Working with real-world data.�
  • Fine-tuning models and seeing direct performance improvements.

Solution Overview

Kaggle Winner Presentation Template

29 of 30

29

Question and Answer

Kaggle Winner Presentation Template

30 of 30

30

Kaggle Winner Presentation Template