1 of 30

Kaggle Winner Presentation

Team: TUIASI_FML_AI_Synergy

Maria-Gabriela Fodor

Teodor Gorghe

2 of 30

Agenda

Kaggle Winner Presentation Template

3 of 30

Background
Summary
Feature selection & engineering
Training methods
Important findings
Simple model

Agenda

Kaggle Winner Presentation Template

4 of 30

Background

Kaggle Winner Presentation Template

5 of 30

We are first-year master students at Artificial Intelligence, from Faculty of Automatic Control and Computer Engineering, “Gheorghe Asachi” Technical University of Iași.

We have graduated Information Technology at 2024, from the same faculty.

Background

Kaggle Winner Presentation Template

6 of 30

Training methods used: Random Forest, Decision Tree, KNN, Feed-Forward Neural Network variants and stacking techniques.

Feature engineering: handling features with missing data, data standardization and encoding.

Tools used: Python, scikit-learn, PyTorch, Matplotlib.

Summary

Kaggle Winner Presentation Template

7 of 30

Feature

Engineering

Kaggle Winner Presentation Template

8 of 30

Data preprocessing steps for metadata features:

dealing with outliers
scaling numerical features
handling features with missing values
categorical features encoding: One-Hot encoding

Feature Engineering

Kaggle Winner Presentation Template

9 of 30

Dealing with outliers:

Feature Engineering

We used Interquartile range (IQR) measure with [0.1; 0.9] thresholds to check if one feature has outliers.

The only feature that has outlier values is “bmi”.

Kaggle Winner Presentation Template

10 of 30

Features Selection/Engineering

Variable Importance Plot

Box plot for “BMI”

BEFORE

Feature Engineering

Kaggle Winner Presentation Template

11 of 30

Features Selection/Engineering

Variable Importance Plot

Box plot for “BMI”

AFTER

Feature Engineering

Kaggle Winner Presentation Template

12 of 30

Scaling numerical features:

Feature Engineering

Standardized numerical features due to varying distributions.

This is crucial because some machine learning methods are not scale-invariant.

Kaggle Winner Presentation Template

13 of 30

Handling features with missing values

Feature Engineering

Kaggle Winner Presentation Template

14 of 30

Handling features with missing values

Feature Engineering

Removing “race” and “parent_1_education” attributes

Fill in the remaining attributes with missing data using KNN imputation (n=5), separately on numeric and categorical features.

Kaggle Winner Presentation Template

15 of 30

Training Methods

Kaggle Winner Presentation Template

16 of 30

Random Forest

Trained using default scikit-learn hyperparameters.
Takes long to train on full dataset.
Achieves 0.739 RMSE and 0.451 R-squared on CV (k=5), full training data.

Gradient Boosting Regressor

Trained using default scikit-learn hyperparameters.
It is faster than Random Forest on training, but slower on prediction.
Achieves 0.688 RMSE and 0.52 R-squared on CV (k=5), full training data.

Kaggle Winner Presentation Template

Training Methods

Kaggle Winner Presentation Template

17 of 30

Kaggle Winner Presentation Template

Training Methods

Feed-forward neural network (without residual blocks)

Searched the hyper-parameters using grid-search cross-validation.
Takes shorter to train than Random Forest and Gradient Boosting Regressor.
Achieves 0.739 RMSE and 0.451 R-squared on CV (k=5), full training data.

Hyper-parameter name	Value
Learning rate	0.0001
Batch size	32
Number of hidden layers	3
Hidden layers size	256	128	64

Kaggle Winner Presentation Template

18 of 30

Kaggle Winner Presentation Template

Training Methods

Feed-forward neural network with residual blocks

Searched the hyper-parameters using grid-search cross-validation.
Has the same training time, but slightly better prediction than MLP without residual blocks.
Achieves 0.616 RMSE and 0.618 R-squared on CV (k=5), full training data.

Hyper-parameter name	Value
Learning rate	0.0001
Batch size	32
Number of residual blocks	2
Block size	1024

Kaggle Winner Presentation Template

19 of 30

We also used a stacking technique, which helped us achieve better prediction score.

The base estimators are the following:
Gradient Boosting Regressor
Random Forest
Feed-Forward Neural Network without residual blocks.
Feed-Forward Neural Network with residual blocks.

Kaggle Winner Presentation Template

Training Methods

Kaggle Winner Presentation Template

20 of 30

Hyper-parameters for the base estimators are the same with the individual trained models.
The final estimator is a model RidgeCV, with default hyper-parameters from scikit-learn.
Training and predict time is the longest among all training methods, but provides the best prediction score.

Kaggle Winner Presentation Template

Training Methods

Kaggle Winner Presentation Template

21 of 30

Important and Interesting Findings

Kaggle Winner Presentation Template

22 of 30

Sometimes missing values cannot be inferred effectively, so a strategy such as removing attributes with more than 14% missing values is recommended.

While a single model might provide most of the accuracy, overfitting is further mitigated by using ensembles.

The best models don’t always from the best ensemble.

Kaggle Winner Presentation Template

Important and Interesting Findings

Kaggle Winner Presentation Template

23 of 30

Correlation matrix highlighting the weak relationships among numeric attributes

Important and Interesting Findings

Kaggle Winner Presentation Template

24 of 30

Simple Model

Kaggle Winner Presentation Template

25 of 30

The ResidualMLPRegressorModel achieved a cross-validation score of 1.93401.

The full ensemble (4 regressors + meta-model) reached 1.87112.

Simple Model

Kaggle Winner Presentation Template

26 of 30

Solution Overview

Kaggle Winner Presentation Template

27 of 30

Overview & Insights�

Explored neural networks & variations.

Focused on low-resource methods.

Tried diverse preprocessing techniques.

Well-tuned lightweight models can rival heavy ones.

Solution Overview

Kaggle Winner Presentation Template

28 of 30

Advantages & Fun Aspects�

Teamwork—parallel experimentation and fast idea sharing.

Working with real-world data.�
Fine-tuning models and seeing direct performance improvements.

Solution Overview

Kaggle Winner Presentation Template

1 of 30

2 of 30

3 of 30

4 of 30

5 of 30

6 of 30

7 of 30

8 of 30

9 of 30

10 of 30

11 of 30

12 of 30

13 of 30

14 of 30

15 of 30

16 of 30

17 of 30

18 of 30

19 of 30

20 of 30

21 of 30

22 of 30

23 of 30

24 of 30

25 of 30

26 of 30

27 of 30

28 of 30

29 of 30

30 of 30