1 of 12

MACHINE LEARNING · AUTOMOTIVE DATA

Used Car Price

Prediction

ML Data Preparation & Feature Engineering

500

Raw Records

7→18

Features

44→0

Missing Values

Outliers Removed

Adham Amr Morgan · Machine Learning Engineer · Kafiil Platform ⭐ 5.0

2 of 12

PROJECT OVERVIEW

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

Client Brief

Platform:

Kafiil Freelance Platform

Service:

ML Data Preparation

Rating:

⭐ 5.0 / 5.0 (All metrics)

Dataset:

500 rows × 7 columns (CSV)

Domain:

Automotive — Used Cars

Objective

Prepare, clean and engineer a used car dataset (500 records) to train a Machine Learning model capable of accurately predicting used car prices.

Client Requirements

→ Remove incorrect / outlier prices

→ Handle missing Year & Mileage values

→ Standardize Mileage format (224k→224000)

→ Encode text columns (Transmission, Fuel)

→ Prepare ML-ready numeric dataset

3 of 12

DATASET DESCRIPTION

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

Column	Type	Issue / Notes	Sample Value
Car_Brand	Categorical	8 brands — needs One-Hot Encoding	Mercedes, Ford, Toyota…
Model	Categorical	Reference only — not used in ML	C-Class, Camry, Accord
Year	Float	26 missing values → imputed by Median	2016.0
Mileage	String	Mixed format: '224k', '115,468'	224k / 150471
Transmission	Categorical	Manual / Automatic → Label Encoding	Manual, Automatic
Fuel_Type	Categorical	Petrol / Diesel / Hybrid → Encoding	Petrol, Diesel, Hybrid
Price	Integer	8 zeros + 10 extreme outliers > 2M	63898 / 4232513

Dataset: 500 rows · 8 car brands · Jan 2000 – 2024 · Mileage: mixed text+numeric · Price: 0 to 4.97M (outliers present)

4 of 12

WORKFLOW & PIPELINE

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

Load & Inspect

Read CSV

Audit shape,

missing & dtypes

Mileage Clean

Convert '224k'

to 224000

Remove commas

Price Cleaning

Remove 8 zeros

& 10 outliers

>2,000,000

Imputation

Year: 26 nulls

Mileage: 18 nulls

→ Median fill

Encoding

Transmission

Fuel_Type

One-Hot Brands

Feature Eng.

Car_Age =

2024 - Year

New ML feature

🏁 OUTPUT: 482 clean records · 18 features · Revenue-ready for Random Forest / XGBoost

5 of 12

DATA CLEANING PROCESS

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

🔢 Mileage Standardization

Problem: Mixed format strings

'224k' → 224,000

'115,468' → 115468

'196k' → 196,000

Result: All numeric float64

Affected rows: 500 (full column)

💰 Price Outlier Removal

Problem: Corrupt & extreme prices

Removed 8 rows with Price = 0

Removed 10 rows with Price > 2M

Example: Nissan Patrol @ 4.23M

Example: S-Class @ 4.97M

Total removed: 18 rows

🩹 Missing Value Imputation

Year: 26 missing → Median (2012)

Mileage: 18 missing → Median

Strategy: Median (robust to skew)

Zero data loss from imputation

Final missing values: 0 / 0

Data completeness: 100%

6 of 12

FEATURE ENGINEERING & ENCODING

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

New Features (7 → 18 columns)

NEW

Car_Age

= 2024 – Year (stronger predictor)

ENC

Transmission_Encoded

Manual=0 / Automatic=1

ENC

Fuel_Type_Encoded

Petrol=0 / Diesel=1 / Hybrid=2

1-HOT

Brand_BMW

1 if BMW, else 0

1-HOT

Brand_Ford

1 if Ford, else 0

1-HOT

Brand_Honda

1 if Honda, else 0

1-HOT

Brand_Hyundai

1 if Hyundai, else 0

1-HOT

Brand_Mercedes

1 if Mercedes, else 0

1-HOT

Brand_Nissan

1 if Nissan, else 0

1-HOT

Brand_Toyota

1 if Toyota, else 0

Why Car_Age?

Year (e.g. 2012) is less informative for an ML model than

Car_Age = 12 years because the model

learns depreciation curves more naturally from age (a linear, continuous variable) than from year labels.

Encoding Strategy

Label Encoding

Cols: Transmission, Fuel_Type

Why: Ordinal / binary columns

One-Hot Encoding

Cols: Car_Brand (8 brands)

Why: Nominal — avoids ordinal bias

7 of 12

BEFORE vs AFTER COMPARISON

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

BEFORE — Raw Data

AFTER — ML-Ready Data

✗ 500 rows (18 corrupt/outlier rows)

✗ 7 columns only

✗ 44 missing values (Year + Mileage)

✗ Mileage: '224k', '115,468' — text

✗ Price includes 0s and 4.97M values

✗ Text columns: Manual/Automatic

✗ No Car_Age feature

✗ Model not usable directly

✓ 482 rows (clean, validated)

✓ 18 feature columns

✓ 0 missing values

✓ Mileage: all float64 numeric

✓ Price range: EGP 27 – 149,886

✓ Encoded: 0/1/2 numerics

✓ Car_Age derived (stronger predictor)

✓ 8 brand One-Hot columns added

8 of 12

DATA VISUALIZATIONS & INSIGHTS

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

9 of 12

TOOLS & TECHNOLOGIES

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

Language

Python 3.x

Core language for all processing

Library

Pandas

Data loading, cleaning & manipulation

Library

NumPy

Numerical operations & array handling

Scikit-learn

Encoding, imputation & ML utilities

Viz

Matplotlib

Visualizations & EDA plots

Viz

Seaborn

Statistical distribution charts

IDE

Jupyter Notebook

Interactive dev & documentation

Pandas Profiling

Automated data quality reporting

10 of 12

KEY RESULTS & KPIs

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

96.4%

Data Retained

482 of 500 rows kept

100%

Completeness

0 missing after cleaning

Features Ready

From 7 original columns

3.6%

Outliers Removed

18 corrupt price rows

Key Data Insights

🚗 Ford leads with 72 cars (14.4%) — most represented brand in dataset

⚖️ Near-even split: Manual 252 vs Automatic 230 — balanced for training

⛽ Fuel types almost equal: Diesel 35% · Petrol 33% · Hybrid 33%

📉 Car_Age feature (avg 11.8 yrs) improves model interpretability vs Year

💡 18 outlier rows removed keeps price max at 149,886 (realistic range)

🎯 Recommended models: Random Forest · XGBoost for tabular regression

11 of 12

CHALLENGES & SOLUTIONS

Used Car Price Prediction · ML Data Preparation | Adham Amr Said Morgan

⚡ Inconsistent Mileage Format

Solution: Built a custom parser to detect and convert '224k' → 224000 and strip comma-formatted numbers like '115,468'

Impact: Entire Mileage column converted to float64 — usable in model

⚡ Extreme Price Outliers

Solution: Applied domain knowledge to set a ceiling. Prices > 2M flagged as data entry errors, not real market prices for this dataset

Impact: Price max dropped from 4.97M to 149,886 (realistic range)

⚡ Missing Year & Mileage

Solution: Used Median imputation (not Mean) because both columns are right-skewed. Median is robust to the outliers that existed

Impact: 44 missing cells resolved — zero data loss from rows

⚡ Categorical Brand Encoding

Solution: Chose One-Hot over Label Encoding for Car_Brand because brands have no natural order — Label would create false hierarchy (BMW=7 > Ford=1)

Impact: 8 clean binary columns — each brand tested independently

12 of 12

SKILLS DEMONSTRATED

✓ Data Cleaning & Validation

✓ Outlier Detection & Removal

✓ Missing Value Imputation

✓ Mileage Format Parsing

✓ Label & One-Hot Encoding

✓ Feature Engineering (Car_Age)

✓ Python · Pandas · NumPy

✓ Scikit-learn Preprocessing

✓ ML Pipeline Design

⭐⭐⭐⭐⭐ Client Rating: 5.0 / 5.0 — Work Quality · Communication · On-Time Delivery

Recommended Model: Random Forest · XGBoost | Target: Price | Features: Car_Age, Mileage, Transmission_Encoded, Fuel_Type_Encoded, Brand_*

Adham Amr Said Morgan · ML Engineer