1 of 37

Neural Networks

and Deep Learning

DATA 442 & 621

Cristiano Fanelli

02/04/2025 - Lecture 4

2 of 37

Outline

Building Training Datasets

Missing data: eliminating, imputing
Handling categorical data: mapping ordinal features, one-hot encoding for nominal features
Partitioning into training and test datasets
Feature scaling
Feature importance

Regularization
Strategies to assess feature importance

References:

Raschka et al, chap 4

3 of 37

We typically see missing values as blank spaces in data tables, or placeholder strings such as “NaN”, or “NULL” in relational databases

isnull() (Pandas) - returns a boolean DataFrame or Series indicating which values are NaN or None

Missing Data

df.isnull()

4 of 37

Missing Data

isnan() (NumPy) - works with certain data types; returns a boolean DataFrame or Series indicating which values are NaN or None

5 of 37

dropna() (Pandas) - removes rows or columns that contain missing information

Eliminating examples or features

# only drop rows where ‘all’ columns are NaN

# drop rows that have less than 4 real values

6 of 37

Imputing is the process of replacing missing or incomplete data in a dataset with substituted values, often using techniques like mean, median, mode, or more advanced methods like regression or machine learning models.

Imputing

7 of 37

An even simpler method is from Pandas

fillna() (Pandas)

Imputing

8 of 37

Categorical Data

Ordinal or Nominal

Ordinal: categorical features that can be sorted

E.g., “S, M, L, XL”

Nominal: Cannot be sorted*

E.g., “green, blue, yellow”, etc

*Depending on the context and applications

9 of 37

Ordinal Categorical Data

10 of 37

LabelEncoder (sklearn)

Ordinal Categorical Data

11 of 37

In the previous example, labeling the colors, e.g., ‘blue’, ‘green’, ‘red’ with values like 0, 1, 2 would assign ordinal values to nominal categorical features with no inherent order
This can ‘mislead’ machine learning algorithms
A common workaround is one-hot-encoding (for other methods see sklearn here)

Nominal Data: One-hot Encoding

# Initial Dataset

# Effect on the column to encode as Ohe

# Overall transformation of the initial dataset

Leave the other columns untouched

12 of 37

One-hot Encoding

An even simpler way to do Ohe

Issue with Ohe:

introduces multicollinearity - occurs when two or more features are highly linearly dependent (e.g., if it is neither blue or green, it is necessarily red)

Solution: remove one feature column

13 of 37

In order to reduce the generalization error, and minimize overfitting, we need to divide the dataset in training and test dataset

Partitioning a Dataset

train_test_split (sklearn)

The stratify parameter in train_test_split ensures that the class proportions in the y labels are maintained in both the training and testing sets. This is important when dealing with imbalanced datasets where certain classes are underrepresented.

E.g., if 70% of your data belongs to class A and 30% to class B, the train_test_split will ensure that this ratio is maintained in both the training and testing sets.

14 of 37

Partitioning a Dataset

train_test_split (sklearn)

Some Recipes:

The smaller the test test, the more inaccurate the generalization error
Most common splits are 60:40, 70:30, 80:20. If you have more stats, say 100k training samples, withholding 10k for test is okay (90:10). If more data, even 99:1 can be fine
If you are tuning hyperparameters, you may want to further split in training/validation/test datasets, e.g., 60/20/20 or 80/10/10 depending on the stats

15 of 37

This is a crucial step in preprocessing for most ML algorithms (with the exception of, e.g., Decision trees and Random Forests, which are scale invariant)
Normalization: refers to the rescaling of the features to a range [0,1]. Useful when we need values in a bounded interval

Standardization: more practical for ML and optimization algorithms (such as gradient descent). Some ML algorithms initialize weights to small random close to 0, and using standardization we center the features to 0 with stdev 1 which makes easier to learn the weights

Scaling Features

16 of 37

Selecting Meaningful Features

L2 regularization

MSE is “spherical”

The larger the value of the regularization parameter λ the faster the penalization loss grows, which leads to a narrower ball

17 of 37

Selecting Meaningful Features

L1 regularization

L1 provide sharp contours, and encourages sparsity

L1 inherently serves as a method for feature selection

18 of 37

Implementations in sklearn

liblinear (open source library for large-scale linear classification; uses coordinate descent and other methods)

One vs Rest: splits a multi-class classification into one binary classification problem per class.

During inference, the model calculates the probability for each class, and the class with the highest probability is selected.

19 of 37

Impact of Regularization on weights

Using L1

performs feature selection

20 of 37

Sequential Backward Selection (SBS)

Removes sequentially features from the full feature subset that cause the least performance loss
Notice that SBS - a greedy algorithm - is challenging for DLL architectures with high dimensional dataset

Other methods for feature selection

21 of 37

Feature importance derived from a Random Forest can be used for regression problems as well as for classification problems.
E.g.,

Feature Importance with Random Forest

Impurity Reduction (Classification): For classification, the importance of a feature is measured by the decrease in Gini impurity or entropy when a feature is used to split a node. The more a feature decreases the impurity, the more important it is.

22 of 37

Feature Importance with Random Forest

N.B.: if two or more features are highly correlated, one feature may be ranked very highly while the information from the other features may be not fully captured

Variance Reduction (Regression): For regression, feature importance is assessed based on how much a feature decreases the variance of the split. A significant reduction in variance implies a higher importance of the feature.

23 of 37

Intro to

Optimization Techniques

24 of 37

Methods potentially covered in this course

Bayesian Optimization:

A probabilistic model-based approach for global optimization.
Ideal for optimizing expensive functions with unknown derivatives.

Evolutionary Optimization:

Inspired by natural selection principles.
Utilizes mechanisms like mutation, crossover, and selection to iteratively improve solutions.

Many other methods/approaches - not covered.

25 of 37

Acquisition Functions

Best found so far

We are sampling x

Exploitation

Exploration

“Exploitation”: search where μ is high
“Exploration”: search where σ is high

26 of 37

Acquisition Functions

Many acquisition functions, e.g., Probability of Improvement, Expected Improvement, Upper (Lower) confidence bound, etc

In most cases, acquisition functions provide knobs for controlling the exploration-exploitation tradeoff

When optimization is more complex (more dimensions), then a random acquisition might perform poorly

random

N calls

E. Brochu, Eric, V. M. Cora, and N. De Freitas. "A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning." arXiv:1012.2599 (2010).

See also Kriging, geostatistical interpolation technique,

https://gisgeography.com/kriging-interpolation-prediction/

27 of 37

Gaussian Processes in a nutshell

Naively, a Gaussian Process is a probability distribution over possible functions
GP helps describing the probability distribution over functions. Bayes’ theorem allows to update our distribution of functions by collecting more data / observations

What kind of problems are we talking about?

Suppose your data follow a function y=f(x). Given x, you have a response y through a function f.
Suppose now that you do not know the function f and you want to “learn” it.

In the figure we are using GP to approximate this function f. Intuitively, the observed points constrain the modeling. The more points, the more accurate the model.

28 of 37

GA Optimization Strategies

Actually a variety of types of crossovers: Single point crossover, Linear crossover, Blend crossover, Simulated binary crossover (SBX).

SBX is an efficient crossover for real variables, which mimics the crossover of binary encoded variables. It uses probability density function that simulates the single-point crossover in binary-coded GAs.

29 of 37

MOO Pipelines: e.g., MOBO

As the project evolved, so did our understanding of the design space and the possible ranges for each design parameter.
With MOBO, we aim to determine a more accurate approximation of the Pareto front

Multi-Objective Bayesian Optimization

(e.g., Ax: adaptive experimentation platform supported by Meta AI)

See 2nd AI4EIC workshop, https://indico.bnl.gov/e/AI4EIC

Using Ax/BoTorch and novel qNEHVI acq. function with improved computational performance arXiv:2105.08195
Currently in the process of generalizing the design problem and increase its complexity

An extension of BO we already discussed about

30 of 37

K-Fold Cross-Validation

31 of 37

Learning Curves

32 of 37

Confusion Matrix

33 of 37

Classification Metrics

34 of 37

Classification Metrics

You want to maximize REC if you want to minimize the chance of not detecting a malignant tumor
You want to optimize precision if we emphasize correctness of prediction that is malign

35 of 37

Classification Metrics

MCC useful with class imbalance

MCC = +1 Perfect prediction
MCC = 0 No better than random prediction
MCC = -1 Complete disagreement

Matthews Correlation Coefficient

Ranges from 0 (worst possible) to 1 (best possible)
Does not have explicitly TN (see REC and PRE)

1 of 37

2 of 37

3 of 37

4 of 37

5 of 37

6 of 37

7 of 37

8 of 37

9 of 37

10 of 37

11 of 37

12 of 37

13 of 37

14 of 37

15 of 37

16 of 37

17 of 37

18 of 37

19 of 37

20 of 37

21 of 37

22 of 37

23 of 37

24 of 37

25 of 37

26 of 37

27 of 37

28 of 37

29 of 37

30 of 37

31 of 37

32 of 37

33 of 37

34 of 37

35 of 37

36 of 37

37 of 37