MACHINE LEARNING · DATA PREPARATION
Sales Forecasting
ML Data Preparation
End-to-end data cleaning, feature engineering & scaling
for an e-commerce revenue prediction model
200
Records
10→17
Features
0
Missing Values
32
Outliers Fixed
Adham Amr Morgan · Machine Learning Engineer
PROJECT OVERVIEW
Client Brief
Platform: Kafiil Freelance Platform�
Service: ML Data Preparation�
Rating: ⭐ 5.0 / 5.0 (All metrics)�
Dataset Size: 200 rows × 10 features
Objective
Prepare, clean, and transform raw e-commerce transaction data to train a high-accuracy Sales Forecasting model for Revenue prediction.
Deliverables
✓ Raw Data (reference)
✓ Cleaned & Engineered Data
✓ ML-Ready Scaled Data
✓ Quality Report Sheet
E-Commerce Sales Forecasting · ML Data Preparation
DATASET DESCRIPTION
Column | Type | Description | Sample Value |
Order_Date | Date | Transaction date | 2026-01-01 |
Customer_ID | String | Unique customer ID | C-1001 |
Category | Categorical | Product category | Electronics |
Product | Categorical | Product name | iPhone 15 |
Qty | Integer | Quantity ordered | 3 |
Price | Float | Unit price (EGP) | 1200 |
Discount | Float | Discount rate (0–15%) | 0.05 |
City | Categorical | Customer city | Cairo |
Rating | Integer | Customer rating (2–5) | 5 |
Return | Binary | Item returned? | No |
200 records · 5 Egyptian cities · 3 product categories · Jan–Jul 2026
WORKFLOW & PIPELINE
01
Data Ingestion
Load raw CSV/Excel data
200 records × 10 columns
02
Data Cleaning
Fix outliers (IQR capping)
Handle missing values
03
Feature Engineering
Revenue formula
Date & time features
04
Encoding
Category, City, Product
Label & Frequency Enc.
05
Scaling
StandardScaler on 9
numeric columns
🏁 OUTPUT: ML-Ready Dataset (17 features) — Revenue as Target Variable
E-Commerce Sales Forecasting · ML Data Preparation
DATA CLEANING PROCESS
🔍 Outlier Detection & Capping
Applied IQR method to cap extreme values:
• Qty: 17 values capped
• Price: 4 values capped
• Discount: 11 values capped
Total: 32 outliers treated
✅ Missing Values
Full dataset audit conducted:
• 0 missing values found
• All 200 rows retained
• No imputation needed
Data quality: 100%
📐 Data Type Standardization
Ensured correct dtypes:
• Dates → datetime64
• Prices/Qty → float64
• Categoricals → object
Consistent schema enforced
IQR Capping ensures model robustness without data loss
FEATURE ENGINEERING
New Features Created (10 → 19)
TARGET
Revenue
= Qty × Price × (1 – Discount)
DATE
Year
Extracted from Order_Date
DATE
Month
Extracted from Order_Date
DATE
DayOfWeek
0 = Monday … 6 = Sunday
DATE
DayOfMonth
Day number in month (1–31)
ENC
Category_Enc
Label Encoding (3 classes)
ENC
City_Enc
Label Encoding (5 cities)
ENC
Return_Enc
Binary: No=0, Yes=1
ENC
Product_FreqEnc
Frequency Encoding by product
Encoding Summary
🏷️ Label Encoding
• Category → {0,1,2}
• City → {0,1,2,3,4}
• Return → {0,1}
📊 Frequency Encoding
• Product mapped to occurrence
• count in dataset
• Preserves popularity signal
E-Commerce Sales Forecasting · ML Data Preparation
BEFORE vs AFTER COMPARISON
BEFORE — Raw Data
AFTER — ML-Ready Data
✗ 10 raw columns only
✗ Mixed text & numeric data
✗ Outliers present in Qty, Price, Discount
✗ No Revenue column (target missing)
✗ No time features extracted
✗ Category/City as text strings
✗ No standardization applied
✓ 17 engineered features
✓ All numeric — model-ready
✓ 32 outliers capped via IQR
✓ Revenue column = Target Variable
✓ Month, DayOfWeek, DayOfMonth added
✓ Encoded: Category/City/Return/Product
✓ 9 columns StandardScaled (μ=0, σ=1)
VS
E-Commerce Sales Forecasting · ML Data Preparation
DATA VISUALIZATIONS & INSIGHTS
Mean Revenue: EGP 849 · Median: EGP 570 · Max: EGP 2,633
DATA SCALING & STANDARDIZATION
StandardScaler Formula:
z = (x − μ) / σ → Mean = 0, Standard Deviation = 1
Feature | Original Mean | Scaled Mean | Result |
Qty_Scaled | 4.59 | 0.0 | ✓ Normalized |
Price_Scaled | 482.0 | 0.0 | ✓ Normalized |
Discount_Scaled | 0.035 | 0.0 | ✓ Normalized |
Revenue_Scaled | 848.8 | 0.0 | ✓ Normalized |
Rating_Scaled | 4.265 | 0.0 | ✓ Normalized |
Month_Scaled | 4.0 | 0.0 | ✓ Normalized |
DayOfWeek_Scaled | 3.05 | 0.0 | ✓ Normalized |
DayOfMonth_Scaled | 15.8 | 0.0 | ✓ Normalized |
Product_FreqEnc_Scaled | 6.05 | 0.0 | ✓ Normalized |
All 9 numeric features normalized to unit scale for optimal model performance
TOOLS & TECHNOLOGIES
Language
Python 3.x
Core programming language
Library
Pandas
Data manipulation & analysis
Library
NumPy
Numerical computing & arrays
ML
Scikit-learn
StandardScaler & ML utilities
Viz
Matplotlib
Data visualization & EDA
Viz
Seaborn
Statistical visualizations
Export
OpenPyXL
Excel file generation
IDE
Jupyter Notebook
Interactive development & docs
E-Commerce Sales Forecasting · ML Data Preparation
KEY INSIGHTS & RESULTS
100%
Data Completeness
Zero missing values
32
Outliers Treated
IQR capping method
17
Final Features
From 10 original
μ=0
Post-Scale Mean
All numeric columns
Key Data Insights
🏆 Electronics leads revenue with highest average order value (AOV)
📍 Cairo dominates with 30% of total orders — key market segment
↩️ Return rate is only 8.5% — indicating high customer satisfaction
💡 Revenue (Target) ranges EGP 150–2,633; right-skewed distribution
📅 Month and Category_Enc are the strongest predictors for the model
⚡ Frequency Encoding captures product popularity without label leakage
Dataset fully validated — ready for Regression / Gradient Boosting models
SKILLS DEMONSTRATED
✓ Data Cleaning & Quality Assurance
✓ Outlier Detection (IQR Method)
✓ Feature Engineering & Extraction
✓ Label & Frequency Encoding
✓ StandardScaler Normalization
✓ Time-series Feature Extraction
✓ Python · Pandas · Scikit-learn
✓ Excel / OpenPyXL Reporting
✓ ML Pipeline Design
⭐⭐⭐⭐⭐ Client Rating: 5.0/5.0 — Work Quality · Communication · On-Time Delivery
Available for ML Data Preparation, Cleaning & Pipeline Projects
Adham Amr Said Morgan · ML Engineer