1 of 12

MACHINE LEARNING · DATA PREPARATION

Sales Forecasting

ML Data Preparation

End-to-end data cleaning, feature engineering & scaling

for an e-commerce revenue prediction model

200

Records

10→17

Features

0

Missing Values

32

Outliers Fixed

Adham Amr Morgan · Machine Learning Engineer

2 of 12

PROJECT OVERVIEW

Client Brief

Platform: Kafiil Freelance Platform�

Service: ML Data Preparation�

Rating: ⭐ 5.0 / 5.0 (All metrics)�

Dataset Size: 200 rows × 10 features

Objective

Prepare, clean, and transform raw e-commerce transaction data to train a high-accuracy Sales Forecasting model for Revenue prediction.

Deliverables

✓ Raw Data (reference)

✓ Cleaned & Engineered Data

✓ ML-Ready Scaled Data

✓ Quality Report Sheet

E-Commerce Sales Forecasting · ML Data Preparation

3 of 12

DATASET DESCRIPTION

Column

Type

Description

Sample Value

Order_Date

Date

Transaction date

2026-01-01

Customer_ID

String

Unique customer ID

C-1001

Category

Categorical

Product category

Electronics

Product

Categorical

Product name

iPhone 15

Qty

Integer

Quantity ordered

3

Price

Float

Unit price (EGP)

1200

Discount

Float

Discount rate (0–15%)

0.05

City

Categorical

Customer city

Cairo

Rating

Integer

Customer rating (2–5)

5

Return

Binary

Item returned?

No

200 records · 5 Egyptian cities · 3 product categories · Jan–Jul 2026

4 of 12

WORKFLOW & PIPELINE

01

Data Ingestion

Load raw CSV/Excel data

200 records × 10 columns

02

Data Cleaning

Fix outliers (IQR capping)

Handle missing values

03

Feature Engineering

Revenue formula

Date & time features

04

Encoding

Category, City, Product

Label & Frequency Enc.

05

Scaling

StandardScaler on 9

numeric columns

🏁 OUTPUT: ML-Ready Dataset (17 features) — Revenue as Target Variable

E-Commerce Sales Forecasting · ML Data Preparation

5 of 12

DATA CLEANING PROCESS

🔍 Outlier Detection & Capping

Applied IQR method to cap extreme values:

• Qty: 17 values capped

• Price: 4 values capped

• Discount: 11 values capped

Total: 32 outliers treated

✅ Missing Values

Full dataset audit conducted:

• 0 missing values found

• All 200 rows retained

• No imputation needed

Data quality: 100%

📐 Data Type Standardization

Ensured correct dtypes:

• Dates → datetime64

• Prices/Qty → float64

• Categoricals → object

Consistent schema enforced

IQR Capping ensures model robustness without data loss

6 of 12

FEATURE ENGINEERING

New Features Created (10 → 19)

TARGET

Revenue

= Qty × Price × (1 – Discount)

DATE

Year

Extracted from Order_Date

DATE

Month

Extracted from Order_Date

DATE

DayOfWeek

0 = Monday … 6 = Sunday

DATE

DayOfMonth

Day number in month (1–31)

ENC

Category_Enc

Label Encoding (3 classes)

ENC

City_Enc

Label Encoding (5 cities)

ENC

Return_Enc

Binary: No=0, Yes=1

ENC

Product_FreqEnc

Frequency Encoding by product

Encoding Summary

🏷️ Label Encoding

• Category → {0,1,2}

• City → {0,1,2,3,4}

• Return → {0,1}

📊 Frequency Encoding

• Product mapped to occurrence

• count in dataset

• Preserves popularity signal

E-Commerce Sales Forecasting · ML Data Preparation

7 of 12

BEFORE vs AFTER COMPARISON

BEFORE — Raw Data

AFTER — ML-Ready Data

✗ 10 raw columns only

✗ Mixed text & numeric data

✗ Outliers present in Qty, Price, Discount

✗ No Revenue column (target missing)

✗ No time features extracted

✗ Category/City as text strings

✗ No standardization applied

✓ 17 engineered features

✓ All numeric — model-ready

✓ 32 outliers capped via IQR

✓ Revenue column = Target Variable

✓ Month, DayOfWeek, DayOfMonth added

✓ Encoded: Category/City/Return/Product

✓ 9 columns StandardScaled (μ=0, σ=1)

VS

E-Commerce Sales Forecasting · ML Data Preparation

8 of 12

DATA VISUALIZATIONS & INSIGHTS

Mean Revenue: EGP 849 · Median: EGP 570 · Max: EGP 2,633

9 of 12

DATA SCALING & STANDARDIZATION

StandardScaler Formula:

z = (x − μ) / σ → Mean = 0, Standard Deviation = 1

Feature

Original Mean

Scaled Mean

Result

Qty_Scaled

4.59

0.0

✓ Normalized

Price_Scaled

482.0

0.0

✓ Normalized

Discount_Scaled

0.035

0.0

✓ Normalized

Revenue_Scaled

848.8

0.0

✓ Normalized

Rating_Scaled

4.265

0.0

✓ Normalized

Month_Scaled

4.0

0.0

✓ Normalized

DayOfWeek_Scaled

3.05

0.0

✓ Normalized

DayOfMonth_Scaled

15.8

0.0

✓ Normalized

Product_FreqEnc_Scaled

6.05

0.0

✓ Normalized

All 9 numeric features normalized to unit scale for optimal model performance

10 of 12

TOOLS & TECHNOLOGIES

Language

Python 3.x

Core programming language

Library

Pandas

Data manipulation & analysis

Library

NumPy

Numerical computing & arrays

ML

Scikit-learn

StandardScaler & ML utilities

Viz

Matplotlib

Data visualization & EDA

Viz

Seaborn

Statistical visualizations

Export

OpenPyXL

Excel file generation

IDE

Jupyter Notebook

Interactive development & docs

E-Commerce Sales Forecasting · ML Data Preparation

11 of 12

KEY INSIGHTS & RESULTS

100%

Data Completeness

Zero missing values

32

Outliers Treated

IQR capping method

17

Final Features

From 10 original

μ=0

Post-Scale Mean

All numeric columns

Key Data Insights

🏆 Electronics leads revenue with highest average order value (AOV)

📍 Cairo dominates with 30% of total orders — key market segment

↩️ Return rate is only 8.5% — indicating high customer satisfaction

💡 Revenue (Target) ranges EGP 150–2,633; right-skewed distribution

📅 Month and Category_Enc are the strongest predictors for the model

⚡ Frequency Encoding captures product popularity without label leakage

Dataset fully validated — ready for Regression / Gradient Boosting models

12 of 12

SKILLS DEMONSTRATED

✓ Data Cleaning & Quality Assurance

✓ Outlier Detection (IQR Method)

✓ Feature Engineering & Extraction

✓ Label & Frequency Encoding

✓ StandardScaler Normalization

✓ Time-series Feature Extraction

✓ Python · Pandas · Scikit-learn

✓ Excel / OpenPyXL Reporting

✓ ML Pipeline Design

⭐⭐⭐⭐⭐ Client Rating: 5.0/5.0 — Work Quality · Communication · On-Time Delivery

Available for ML Data Preparation, Cleaning & Pipeline Projects

Adham Amr Said Morgan · ML Engineer