1 of 37

Neural Networks

and Deep Learning

DATA 442 & 621

Cristiano Fanelli

02/04/2025 - Lecture 4

2 of 37

Outline

2

  • Building Training Datasets
    • Missing data: eliminating, imputing
    • Handling categorical data: mapping ordinal features, one-hot encoding for nominal features
    • Partitioning into training and test datasets
    • Feature scaling
    • Feature importance
      • Regularization
      • Strategies to assess feature importance

References:

Raschka et al, chap 4

3 of 37

3

  • We typically see missing values as blank spaces in data tables, or placeholder strings such as “NaN”, or “NULL” in relational databases
    • isnull() (Pandas) - returns a boolean DataFrame or Series indicating which values are NaN or None

Missing Data

df.isnull()

df

4 of 37

4

Missing Data

    • isnan() (NumPy) - works with certain data types; returns a boolean DataFrame or Series indicating which values are NaN or None

5 of 37

5

  • dropna() (Pandas) - removes rows or columns that contain missing information

Eliminating examples or features

# only drop rows where ‘all’ columns are NaN

# drop rows that have less than 4 real values

6 of 37

6

  • Imputing is the process of replacing missing or incomplete data in a dataset with substituted values, often using techniques like mean, median, mode, or more advanced methods like regression or machine learning models.

Imputing

df

7 of 37

7

  • An even simpler method is from Pandas
    • fillna() (Pandas)

Imputing

8 of 37

8

Categorical Data

  • Ordinal or Nominal
    • Ordinal: categorical features that can be sorted
      • E.g., “S, M, L, XL”
    • Nominal: Cannot be sorted*
      • E.g., “green, blue, yellow”, etc

*Depending on the context and applications

9 of 37

9

Ordinal Categorical Data

10 of 37

10

  • LabelEncoder (sklearn)

Ordinal Categorical Data

11 of 37

11

  • In the previous example, labeling the colors, e.g., ‘blue’, ‘green’, ‘red’ with values like 0, 1, 2 would assign ordinal values to nominal categorical features with no inherent order
  • This can ‘mislead’ machine learning algorithms
  • A common workaround is one-hot-encoding (for other methods see sklearn here)

Nominal Data: One-hot Encoding

# Initial Dataset

# Effect on the column to encode as Ohe

# Overall transformation of the initial dataset

Leave the other columns untouched

12 of 37

12

One-hot Encoding

  • An even simpler way to do Ohe
  • Issue with Ohe:
    • introduces multicollinearity - occurs when two or more features are highly linearly dependent (e.g., if it is neither blue or green, it is necessarily red)

  • Solution: remove one feature column

13 of 37

13

  • In order to reduce the generalization error, and minimize overfitting, we need to divide the dataset in training and test dataset

Partitioning a Dataset

  • train_test_split (sklearn)

The stratify parameter in train_test_split ensures that the class proportions in the y labels are maintained in both the training and testing sets. This is important when dealing with imbalanced datasets where certain classes are underrepresented.

E.g., if 70% of your data belongs to class A and 30% to class B, the train_test_split will ensure that this ratio is maintained in both the training and testing sets.

14 of 37

14

Partitioning a Dataset

  • train_test_split (sklearn)

Some Recipes:

    • The smaller the test test, the more inaccurate the generalization error
    • Most common splits are 60:40, 70:30, 80:20. If you have more stats, say 100k training samples, withholding 10k for test is okay (90:10). If more data, even 99:1 can be fine
    • If you are tuning hyperparameters, you may want to further split in training/validation/test datasets, e.g., 60/20/20 or 80/10/10 depending on the stats

15 of 37

15

  • This is a crucial step in preprocessing for most ML algorithms (with the exception of, e.g., Decision trees and Random Forests, which are scale invariant)
  • Normalization: refers to the rescaling of the features to a range [0,1]. Useful when we need values in a bounded interval

  • Standardization: more practical for ML and optimization algorithms (such as gradient descent). Some ML algorithms initialize weights to small random close to 0, and using standardization we center the features to 0 with stdev 1 which makes easier to learn the weights

Scaling Features

16 of 37

16

Selecting Meaningful Features

  • L2 regularization

MSE is “spherical”

The larger the value of the regularization parameter λ the faster the penalization loss grows, which leads to a narrower ball

17 of 37

17

Selecting Meaningful Features

  • L1 regularization

L1 provide sharp contours, and encourages sparsity

L1 inherently serves as a method for feature selection

18 of 37

18

Implementations in sklearn

liblinear (open source library for large-scale linear classification; uses coordinate descent and other methods)

One vs Rest: splits a multi-class classification into one binary classification problem per class.

During inference, the model calculates the probability for each class, and the class with the highest probability is selected.

19 of 37

19

Impact of Regularization on weights

Using L1

performs feature selection

20 of 37

20

  • Sequential Backward Selection (SBS)
    • Removes sequentially features from the full feature subset that cause the least performance loss
    • Notice that SBS - a greedy algorithm - is challenging for DLL architectures with high dimensional dataset

Other methods for feature selection

21 of 37

21

  • Feature importance derived from a Random Forest can be used for regression problems as well as for classification problems.
  • E.g.,

Feature Importance with Random Forest

Impurity Reduction (Classification): For classification, the importance of a feature is measured by the decrease in Gini impurity or entropy when a feature is used to split a node. The more a feature decreases the impurity, the more important it is.

22 of 37

22

Feature Importance with Random Forest

  • N.B.: if two or more features are highly correlated, one feature may be ranked very highly while the information from the other features may be not fully captured

Variance Reduction (Regression): For regression, feature importance is assessed based on how much a feature decreases the variance of the split. A significant reduction in variance implies a higher importance of the feature.

23 of 37

23

Intro to

Optimization Techniques

24 of 37

Methods potentially covered in this course

24

  • Bayesian Optimization:
    • A probabilistic model-based approach for global optimization.
    • Ideal for optimizing expensive functions with unknown derivatives.
  • Evolutionary Optimization:
    • Inspired by natural selection principles.
    • Utilizes mechanisms like mutation, crossover, and selection to iteratively improve solutions.

Many other methods/approaches - not covered.

25 of 37

Acquisition Functions

25

Best found so far

We are sampling x

Exploitation

Exploration

f

x

  • “Exploitation”: search where μ is high
  • “Exploration”: search where σ is high

26 of 37

Acquisition Functions

26

  • Many acquisition functions, e.g., Probability of Improvement, Expected Improvement, Upper (Lower) confidence bound, etc

  • In most cases, acquisition functions provide knobs for controlling the exploration-exploitation tradeoff

  • When optimization is more complex (more dimensions), then a random acquisition might perform poorly

random

RS

EI

N calls

See also Kriging, geostatistical interpolation technique,

https://gisgeography.com/kriging-interpolation-prediction/

27 of 37

Gaussian Processes in a nutshell

27

  • Naively, a Gaussian Process is a probability distribution over possible functions
  • GP helps describing the probability distribution over functions. Bayes’ theorem allows to update our distribution of functions by collecting more data / observations

What kind of problems are we talking about?

  • Suppose your data follow a function y=f(x). Given x, you have a response y through a function f.
  • Suppose now that you do not know the function f and you want to “learn” it.
  • In the figure we are using GP to approximate this function f. Intuitively, the observed points constrain the modeling. The more points, the more accurate the model.

28 of 37

28

GA Optimization Strategies

  • Actually a variety of types of crossovers: Single point crossover, Linear crossover, Blend crossover, Simulated binary crossover (SBX).

  • SBX is an efficient crossover for real variables, which mimics the crossover of binary encoded variables. It uses probability density function that simulates the single-point crossover in binary-coded GAs.

29 of 37

29

MOO Pipelines: e.g., MOBO

  • As the project evolved, so did our understanding of the design space and the possible ranges for each design parameter.
  • With MOBO, we aim to determine a more accurate approximation of the Pareto front

Multi-Objective Bayesian Optimization

(e.g., Ax: adaptive experimentation platform supported by Meta AI)

See 2nd AI4EIC workshop, https://indico.bnl.gov/e/AI4EIC

  • Using Ax/BoTorch and novel qNEHVI acq. function with improved computational performance arXiv:2105.08195
  • Currently in the process of generalizing the design problem and increase its complexity

An extension of BO we already discussed about

30 of 37

30

K-Fold Cross-Validation

31 of 37

31

Learning Curves

32 of 37

32

Confusion Matrix

33 of 37

33

Classification Metrics

34 of 37

34

Classification Metrics

  • You want to maximize REC if you want to minimize the chance of not detecting a malignant tumor
  • You want to optimize precision if we emphasize correctness of prediction that is malign

35 of 37

35

Classification Metrics

  • MCC useful with class imbalance
    • MCC = +1 Perfect prediction
    • MCC = 0 No better than random prediction
    • MCC = -1 Complete disagreement

Matthews Correlation Coefficient

  • Ranges from 0 (worst possible) to 1 (best possible)
  • Does not have explicitly TN (see REC and PRE)

36 of 37

36

Receiver Operating Characteristic

37 of 37

37

Spares