1 of 8

ניתוח נתונים ולמידת מכונה

אמית רננים – חלופה ליחידה 3

מוטי בזר

mottibz@gmail.com

https://www.commridge.com

Exploratory Data Analysis (EDA) and Feature Engineering

2 of 8

Introduction to Exploratory Data Analysis (EDA)

  • What is EDA?
    • The process of analyzing and investigating a dataset to uncover its structure, identify patterns, and gain insights
    • A crucial first step in the data science workflow, as it helps understand the data and supports the subsequent steps (feature engineering, model building)
  • Importance of EDA in the data science workflow
    • Helps identify data quality issues (e.g., missing values, outliers, data distribution irregularities)
      • Addressing the issues early on can significantly improve the performance of your machine learning models
    • Can also reveal relationships between features and target variables, which can guide your feature engineering efforts

2

3 of 8

Exploratory Data Analysis Techniques

  • Univariate analysis (ניתוח חד משתני)
    • Numerical features
      • Measures of central tendency (mean, median, mode)
      • Measures of dispersion (standard deviation, variance, range)
      • Histograms (show the distribution of a numerical feature), box plots (display the median, quartiles, and outliers), scatter plots (visualize the relationship between two numerical features)
    • Categorical features
      • Frequency distributions (show how often each category appears in the dataset)
      • Bar plots (display the frequencies of each category), pie charts (show the relative proportions of each category)
  • Bivariate analysis (ניתוח דו-משתני)
    • Correlation analysis
      • Examines the relationship between two numerical features, with the correlation coefficient ranging from -1 to 1. A positive correlation indicates a direct relationship, a negative correlation indicates an inverse relationship
    • Scatter plots (visualize the relationship between two numerical features), Heat maps (display the correlation coefficients between all pairs of features in a dataset)
  • Multivariate analysis (ניתוח רב משתני)
    • Correlation matrices (show the pairwise correlation coefficients between all numerical features in a dataset)
    • Pair plots (display a matrix of scatter plots, showing the relationships between all pairs of numerical features)

3

4 of 8

Introduction to Feature Engineering

  • Feature engineering is the process of creating, selecting, and transforming features from raw data to improve the performance of machine learning models
  • Effective feature engineering can:
    • Greatly improve a model's ability to capture the underlying patterns in the data, leading to better predictive performance
    • Help address issues like data imbalance, overfitting, and model interpretability
  • Common feature engineering techniques
    • Feature scaling
      • Standardization transforms features to have a mean of 0 and a standard deviation of 1
      • Normalization scales features to a common range, typically between 0 and 1
    • Handling missing values
      • Usually handled using imputation (replacing with values such as the mean or median) or dropping rows/columns with missing data
    • Encoding categorical features
      • One-hot Encoding creates binary columns for each unique category, Label Encoding assigns a numerical label to each category
    • Feature transformation
      • Transforming features, such as applying logarithmic or polynomial functions, can help capture non-linear relationships in the data
    • Feature selection
      • Identify the most important features in a dataset, reducing the dimensionality and improving model performance

4

5 of 8

The dataset

We will use the Titanic dataset again and present the different approaches to EDA and Feature Engineering.

Let’s move to the notebook.

5

6 of 8

6

7 of 8

7

8 of 8

8