JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 8

ניתוח נתונים ולמידת מכונה

אמית רננים – חלופה ליחידה 3

מוטי בזר

mottibz@gmail.com

https://www.commridge.com

Exploratory Data Analysis (EDA) and Feature Engineering

2 of 8

Introduction to Exploratory Data Analysis (EDA)

What is EDA?

The process of analyzing and investigating a dataset to uncover its structure, identify patterns, and gain insights
A crucial first step in the data science workflow, as it helps understand the data and supports the subsequent steps (feature engineering, model building)

Importance of EDA in the data science workflow

Helps identify data quality issues (e.g., missing values, outliers, data distribution irregularities)

Addressing the issues early on can significantly improve the performance of your machine learning models

Can also reveal relationships between features and target variables, which can guide your feature engineering efforts

2

3 of 8

Exploratory Data Analysis Techniques

Univariate analysis (ניתוח חד משתני)

Numerical features

Measures of central tendency (mean, median, mode)
Measures of dispersion (standard deviation, variance, range)
Histograms (show the distribution of a numerical feature), box plots (display the median, quartiles, and outliers), scatter plots (visualize the relationship between two numerical features)

Categorical features

Frequency distributions (show how often each category appears in the dataset)
Bar plots (display the frequencies of each category), pie charts (show the relative proportions of each category)

Bivariate analysis (ניתוח דו-משתני)

Correlation analysis

Examines the relationship between two numerical features, with the correlation coefficient ranging from -1 to 1. A positive correlation indicates a direct relationship, a negative correlation indicates an inverse relationship

Scatter plots (visualize the relationship between two numerical features), Heat maps (display the correlation coefficients between all pairs of features in a dataset)

Multivariate analysis (ניתוח רב משתני)

Correlation matrices (show the pairwise correlation coefficients between all numerical features in a dataset)
Pair plots (display a matrix of scatter plots, showing the relationships between all pairs of numerical features)

3

4 of 8

Introduction to Feature Engineering

Feature engineering is the process of creating, selecting, and transforming features from raw data to improve the performance of machine learning models
Effective feature engineering can:

Greatly improve a model's ability to capture the underlying patterns in the data, leading to better predictive performance
Help address issues like data imbalance, overfitting, and model interpretability

Common feature engineering techniques

Feature scaling

Standardization transforms features to have a mean of 0 and a standard deviation of 1
Normalization scales features to a common range, typically between 0 and 1

Handling missing values

Usually handled using imputation (replacing with values such as the mean or median) or dropping rows/columns with missing data

Encoding categorical features

One-hot Encoding creates binary columns for each unique category, Label Encoding assigns a numerical label to each category

Feature transformation

Transforming features, such as applying logarithmic or polynomial functions, can help capture non-linear relationships in the data

Feature selection

Identify the most important features in a dataset, reducing the dimensionality and improving model performance

4

5 of 8

The dataset

We will use the Titanic dataset again and present the different approaches to EDA and Feature Engineering.

Let’s move to the notebook.

5

6 of 8

6

7 of 8

7

8 of 8

8