1 of 19

NUS IS5126 AY 2025-2026 Sem 1, Group 11

Movie Box ROI Prediction

NUS IS5126, AY25–26, G11

Cham Jin Wei�A0318721H

He Zean�A0297000N

Lian Jie Nicholas�A0108553H

Liang Junyi�A0314698R

Yang Yue�A0194569J

2 of 19

Content

Problem & Research

Define goals & review literature

1

Clean Data

Handle missing values & outliers

3

Method Development

Build & compare ML models

5

Web Demo

Interactive profit prediction tool

7

Data Explore

Analyze distributions & relationships

2

Feature Engineering

Transform & create new variables

4

Result/Business Analysis

Interpret results & business impact

6

Our project’s workflow also follows this flow

3 of 19

Problem & Gap

Current Challenge

Pre-release uncertainty

Prior work: macro / binary

Limited actionable insights

Industry struggles with pre-release forecasting

The Gap

Need film-level ROI regression

Granular insights required

Time-aware analysis

Specific ROI targets needed

Bridge: Film-level time-aware forecasting

4 of 19

ROI Definition & Formula

Return on Investment

ROI = (Revenue - Budget) / Budget

Express as percentage�for easier interpretation

ROI Interpretation

Loss

ROI < 0: Investment lost

-20%

Breakeven

ROI = 0: No gain, no loss

0%

Profit

ROI > 0: Investment gained

+50%

ROI enables decision thresholds for investment strategies

5 of 19

Initial data cleaning

Initial Feature Exclusion

Budget, revenue, release dates

Cast, crew, genres

High-Missing Text Handled

Language imputation

Genre classification

Duplicates Removed

Exact duplicates

Data Leakage

Profit

revenue

roi

Post-release

popularity

vote_count

6 of 19

Distributions & Log Transformation

Right-skewed Distributions

Budgets & revenues show right skew

Heavy tail on right side

Apply Log

Multiplicative → Additive

Log-Transformed Distributions

More symmetric distribution

Stabilizes variance

Benefits of Log Transformation

Makes models more robust

Stabilizes variance

Simplifies interpretation

7 of 19

Temporal Trend & Leakage

Revenue Trends

Revenues trend upward

Temporal dependence

Data Leakage

Avoid serial correlation

Use time-based splits

Mean & Median Revenue Over Years

Mean

Median

Timeline Split Diagram

8 of 19

Feature Correlation

Correlation Analysis

Correlation Pruning

Remove features with |r| ≥ 0.8

Keep higher variance one

Keep more unique one

Information Score = (unique_count, variance)

9 of 19

Advanced Feature Engineering

📝 Text Processing - overview

  • TF-IDF vectorisation
  • GPT summarisation

🏢 Production Company

Cast

  • Lead Actor and Actress
    • Star power? Based on number of past movies
  • Top 4 billed Cast Members
    • Star power? Based on average number of past movies

📅Number of movie releases in the same month

10 of 19

Advanced Feature Engineering

Positive Effect

Some or Negative Effect

Romance x Comedy

Horror x Comedy

Fantasy x Adventure

Horror x Sci-Fi

Sci-Fi x Action

Action x Thriller

Drama x Comedy

Action x Comedy

Action x Adventure

🔀Feature Interaction

  • Genre x Release Period
  • Release Period x Star Power
  • Genre with Lead Star vs Genre with Star Ensemble
  • Genre x Genre

11 of 19

Basic Machine Learning Models

  • Ridge Regression
  • Random Forest
  • XGBoost

12 of 19

Basic Machine Learning Models

Random Forest (Tuned)

XGBoost(Tuned)

13 of 19

Transformer

Enriched dataset: LLM-based overview summaries, text length indicators, multilingual signals, and temporal roll-ups.

14 of 19

ROI Decision Making

15 of 19

Quick DEMO

16 of 19

English fuzzy match: confidence 75%

English AI fix + fuzzy match: confidence 100%

17 of 19

English fuzzy match: failed with confidence 63%

English AI fix + fuzzy match: confidence 100%

18 of 19

Chinese fuzzy match: failed

Chinese (even with wrong name) AI fix + fuzzy match: confidence 100%

19 of 19

Thank you for your attention!