NUS IS5126 AY 2025-2026 Sem 1, Group 11
Movie Box ROI Prediction
NUS IS5126, AY25–26, G11
Cham Jin Wei�A0318721H | He Zean�A0297000N | Lian Jie Nicholas�A0108553H | Liang Junyi�A0314698R | Yang Yue�A0194569J |
Content
Problem & Research
Define goals & review literature
1
Clean Data
Handle missing values & outliers
3
Method Development
Build & compare ML models
5
Web Demo
Interactive profit prediction tool
7
Data Explore
Analyze distributions & relationships
2
Feature Engineering
Transform & create new variables
4
Result/Business Analysis
Interpret results & business impact
6
Our project’s workflow also follows this flow
Problem & Gap
Current Challenge
Pre-release uncertainty
Prior work: macro / binary
Limited actionable insights
Industry struggles with pre-release forecasting
The Gap
Need film-level ROI regression
Granular insights required
Time-aware analysis
Specific ROI targets needed
Bridge: Film-level time-aware forecasting
ROI Definition & Formula
Return on Investment
ROI = (Revenue - Budget) / Budget
Express as percentage�for easier interpretation
ROI Interpretation
Loss
ROI < 0: Investment lost
-20%
Breakeven
ROI = 0: No gain, no loss
0%
Profit
ROI > 0: Investment gained
+50%
ROI enables decision thresholds for investment strategies
Initial data cleaning
Initial Feature Exclusion
Budget, revenue, release dates
Cast, crew, genres
High-Missing Text Handled
Language imputation
Genre classification
Duplicates Removed
Exact duplicates
Data Leakage
Profit
revenue
roi
Post-release
popularity
vote_count
Distributions & Log Transformation
Right-skewed Distributions
Budgets & revenues show right skew
Heavy tail on right side
Apply Log
Multiplicative → Additive
Log-Transformed Distributions
More symmetric distribution
Stabilizes variance
Benefits of Log Transformation
Makes models more robust
Stabilizes variance
Simplifies interpretation
Temporal Trend & Leakage
Revenue Trends
Revenues trend upward
Temporal dependence
Data Leakage
Avoid serial correlation
Use time-based splits
Mean & Median Revenue Over Years
Mean
Median
Timeline Split Diagram
Feature Correlation
Correlation Analysis
Correlation Pruning
Remove features with |r| ≥ 0.8
Keep higher variance one
Keep more unique one
Information Score = (unique_count, variance)
Advanced Feature Engineering
📝 Text Processing - overview
🏢 Production Company
⭐Cast
📅Number of movie releases in the same month
Advanced Feature Engineering
Positive Effect | Some or Negative Effect |
Romance x Comedy | Horror x Comedy |
Fantasy x Adventure | Horror x Sci-Fi |
Sci-Fi x Action | Action x Thriller |
Drama x Comedy | Action x Comedy |
| Action x Adventure |
🔀Feature Interaction
Basic Machine Learning Models
Basic Machine Learning Models
Random Forest (Tuned)
XGBoost(Tuned)
Transformer
Enriched dataset: LLM-based overview summaries, text length indicators, multilingual signals, and temporal roll-ups.
ROI Decision Making
Quick DEMO
English fuzzy match: confidence 75%
English AI fix + fuzzy match: confidence 100%
English fuzzy match: failed with confidence 63%
English AI fix + fuzzy match: confidence 100%
Chinese fuzzy match: failed
Chinese (even with wrong name) AI fix + fuzzy match: confidence 100%
Thank you for your attention!