MACHINE LEARNING · AUTOMOTIVE DATA
Used Car Price
Prediction
ML Data Preparation & Feature Engineering
500
Raw Records
7→18
Features
44→0
Missing Values
18
Outliers Removed
Adham Amr Morgan · Machine Learning Engineer · Kafiil Platform ⭐ 5.0
PROJECT OVERVIEW
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
Client Brief
Platform:
Kafiil Freelance Platform
Service:
ML Data Preparation
Rating:
⭐ 5.0 / 5.0 (All metrics)
Dataset:
500 rows × 7 columns (CSV)
Domain:
Automotive — Used Cars
Objective
Prepare, clean and engineer a used car dataset (500 records) to train a Machine Learning model capable of accurately predicting used car prices.
Client Requirements
→ Remove incorrect / outlier prices
→ Handle missing Year & Mileage values
→ Standardize Mileage format (224k→224000)
→ Encode text columns (Transmission, Fuel)
→ Prepare ML-ready numeric dataset
DATASET DESCRIPTION
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
Column | Type | Issue / Notes | Sample Value |
Car_Brand | Categorical | 8 brands — needs One-Hot Encoding | Mercedes, Ford, Toyota… |
Model | Categorical | Reference only — not used in ML | C-Class, Camry, Accord |
Year | Float | 26 missing values → imputed by Median | 2016.0 |
Mileage | String | Mixed format: '224k', '115,468' | 224k / 150471 |
Transmission | Categorical | Manual / Automatic → Label Encoding | Manual, Automatic |
Fuel_Type | Categorical | Petrol / Diesel / Hybrid → Encoding | Petrol, Diesel, Hybrid |
Price | Integer | 8 zeros + 10 extreme outliers > 2M | 63898 / 4232513 |
Dataset: 500 rows · 8 car brands · Jan 2000 – 2024 · Mileage: mixed text+numeric · Price: 0 to 4.97M (outliers present)
WORKFLOW & PIPELINE
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
01
Load & Inspect
Read CSV
Audit shape,
missing & dtypes
02
Mileage Clean
Convert '224k'
to 224000
Remove commas
03
Price Cleaning
Remove 8 zeros
& 10 outliers
>2,000,000
04
Imputation
Year: 26 nulls
Mileage: 18 nulls
→ Median fill
05
Encoding
Transmission
Fuel_Type
One-Hot Brands
06
Feature Eng.
Car_Age =
2024 - Year
New ML feature
🏁 OUTPUT: 482 clean records · 18 features · Revenue-ready for Random Forest / XGBoost
DATA CLEANING PROCESS
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
🔢 Mileage Standardization
Problem: Mixed format strings
'224k' → 224,000
'115,468' → 115468
'196k' → 196,000
Result: All numeric float64
Affected rows: 500 (full column)
💰 Price Outlier Removal
Problem: Corrupt & extreme prices
Removed 8 rows with Price = 0
Removed 10 rows with Price > 2M
Example: Nissan Patrol @ 4.23M
Example: S-Class @ 4.97M
Total removed: 18 rows
🩹 Missing Value Imputation
Year: 26 missing → Median (2012)
Mileage: 18 missing → Median
Strategy: Median (robust to skew)
Zero data loss from imputation
Final missing values: 0 / 0
Data completeness: 100%
FEATURE ENGINEERING & ENCODING
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
New Features (7 → 18 columns)
NEW
Car_Age
= 2024 – Year (stronger predictor)
ENC
Transmission_Encoded
Manual=0 / Automatic=1
ENC
Fuel_Type_Encoded
Petrol=0 / Diesel=1 / Hybrid=2
1-HOT
Brand_BMW
1 if BMW, else 0
1-HOT
Brand_Ford
1 if Ford, else 0
1-HOT
Brand_Honda
1 if Honda, else 0
1-HOT
Brand_Hyundai
1 if Hyundai, else 0
1-HOT
Brand_Mercedes
1 if Mercedes, else 0
1-HOT
Brand_Nissan
1 if Nissan, else 0
1-HOT
Brand_Toyota
1 if Toyota, else 0
Why Car_Age?
Year (e.g. 2012) is less informative for an ML model than
Car_Age = 12 years because the model
learns depreciation curves more naturally from age (a linear, continuous variable) than from year labels.
Encoding Strategy
Label Encoding
Cols: Transmission, Fuel_Type
Why: Ordinal / binary columns
One-Hot Encoding
Cols: Car_Brand (8 brands)
Why: Nominal — avoids ordinal bias
BEFORE vs AFTER COMPARISON
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
BEFORE — Raw Data
AFTER — ML-Ready Data
✗ 500 rows (18 corrupt/outlier rows)
✗ 7 columns only
✗ 44 missing values (Year + Mileage)
✗ Mileage: '224k', '115,468' — text
✗ Price includes 0s and 4.97M values
✗ Text columns: Manual/Automatic
✗ No Car_Age feature
✗ Model not usable directly
✓ 482 rows (clean, validated)
✓ 18 feature columns
✓ 0 missing values
✓ Mileage: all float64 numeric
✓ Price range: EGP 27 – 149,886
✓ Encoded: 0/1/2 numerics
✓ Car_Age derived (stronger predictor)
✓ 8 brand One-Hot columns added
VS
DATA VISUALIZATIONS & INSIGHTS
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
TOOLS & TECHNOLOGIES
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
Language
Python 3.x
Core language for all processing
Library
Pandas
Data loading, cleaning & manipulation
Library
NumPy
Numerical operations & array handling
ML
Scikit-learn
Encoding, imputation & ML utilities
Viz
Matplotlib
Visualizations & EDA plots
Viz
Seaborn
Statistical distribution charts
IDE
Jupyter Notebook
Interactive dev & documentation
QA
Pandas Profiling
Automated data quality reporting
KEY RESULTS & KPIs
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
96.4%
Data Retained
482 of 500 rows kept
100%
Completeness
0 missing after cleaning
18
Features Ready
From 7 original columns
3.6%
Outliers Removed
18 corrupt price rows
Key Data Insights
🚗 Ford leads with 72 cars (14.4%) — most represented brand in dataset
⚖️ Near-even split: Manual 252 vs Automatic 230 — balanced for training
⛽ Fuel types almost equal: Diesel 35% · Petrol 33% · Hybrid 33%
📉 Car_Age feature (avg 11.8 yrs) improves model interpretability vs Year
💡 18 outlier rows removed keeps price max at 149,886 (realistic range)
🎯 Recommended models: Random Forest · XGBoost for tabular regression
CHALLENGES & SOLUTIONS
Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan
⚡ Inconsistent Mileage Format
Solution: Built a custom parser to detect and convert '224k' → 224000 and strip comma-formatted numbers like '115,468'
Impact: Entire Mileage column converted to float64 — usable in model
⚡ Extreme Price Outliers
Solution: Applied domain knowledge to set a ceiling. Prices > 2M flagged as data entry errors, not real market prices for this dataset
Impact: Price max dropped from 4.97M to 149,886 (realistic range)
⚡ Missing Year & Mileage
Solution: Used Median imputation (not Mean) because both columns are right-skewed. Median is robust to the outliers that existed
Impact: 44 missing cells resolved — zero data loss from rows
⚡ Categorical Brand Encoding
Solution: Chose One-Hot over Label Encoding for Car_Brand because brands have no natural order — Label would create false hierarchy (BMW=7 > Ford=1)
Impact: 8 clean binary columns — each brand tested independently
SKILLS DEMONSTRATED
✓ Data Cleaning & Validation
✓ Outlier Detection & Removal
✓ Missing Value Imputation
✓ Mileage Format Parsing
✓ Label & One-Hot Encoding
✓ Feature Engineering (Car_Age)
✓ Python · Pandas · NumPy
✓ Scikit-learn Preprocessing
✓ ML Pipeline Design
⭐⭐⭐⭐⭐ Client Rating: 5.0 / 5.0 — Work Quality · Communication · On-Time Delivery
Recommended Model: Random Forest · XGBoost | Target: Price | Features: Car_Age, Mileage, Transmission_Encoded, Fuel_Type_Encoded, Brand_*
Adham Amr Said Morgan · ML Engineer