1 of 24

Genre Classification of IMDb

Group 14

110522092 林廷翰

110522110 何名曜

109526011 彭彥霖

2 of 24

Outline

  • Introduction
  • Data Analysis
  • Experiments
  • Concussion

1

3 of 24

Introduction

2

4 of 24

Motivation

  • Labeling the movie genre on hand is time consuming.

  • There are lots of movie information on IMDb, e.g. published date, description and movie genre.

  • We want to use what we learn from this class and IMDb database to design a model that can classify the movie genre.

3

5 of 24

Data Analysis

4

6 of 24

Dataset - Genre Classification Dataset IMDb

  • Using feature:
    • Title
      • ex. The Unrecovered
    • Year
      • ex. 2007
    • Summary
      • ex. The film's title refers not only to the un-rec…
  • There are some missing values in year feature.
    • we consider “Year” to be categorical feature and use one-hot encoding
    • we set the missing year value to “unknown”.

5

7 of 24

Experiments

6

8 of 24

Data information

We use 4 types of inputs:

  • summary + Title + year
  • summary + Title
  • summary + Year
  • Summary

7

9 of 24

Word Embedding

  • We use 4 types of embedding models:
    • NNLM* 128 (Google News)
    • Word2Vec* 250 (Wikipedia)
    • Word2Vec* 500 (Wikipedia)
    • Bert 128 (Wikipedia and BooksCorpus)
  • �Preprocess words in summary and title:
    • Split space
    • Lower case

8

*NNLM: Neural Network Language Model

*Word2Vec: skipgram version

10 of 24

Scikit-learn Model

  • KNN
    • k = 5, 10, 15, 20, 25
  • Decision Tree
    • max_depth = 5, 10, 15, 20
  • Logistic Regression
    • penalty= ‘l2’, ‘none’
  • SVM
    • default parameters

9

11 of 24

DNN Model

  • 7 Types of hidden units size: 
    • 32, 64, 128, 256, 512, 1024, 2048
  • 2 Types of Layer setting: 
    • 2 Layer or 4 Layer
  • Activation: LeakyRelu
  • Use Batch Normalization
  • Use Dropout = 0.4
  • Use Early Stop
  • Output Size = 27

10

12 of 24

Evaluation

We will compare different embedding methods and models using metrics below:

  • Accuracy
  • Recall (macro)
  • Precision (macro)
  • F1 (macro)

11

13 of 24

Results

  • Find the better Input Type.
  • Find the better Embedding.
  • Find the better Classifier Model.

12

14 of 24

Which Input Type is better? (NNML)

13

Model name

Type

(inputs)

Train F1

Test F1

KNN(k=5)

2

0.40658

0.229477

Decision Tree�(max_depth = 15)

3

0.180207

0.104106

Logistic Regression�(penalty = norm)

1

0.387988

0.337281

SVM

1

0.410342

0.301155

DNN - 2 layers(units = 2048)

1

0.492853

0.352759

DNN - 4 layers(units = 1024)

1

0.425655

0.345399

15 of 24

Which Input Type is better? (Word2Vec 250)

14

Model name

type

(inputs)

Train F1

Test F1

KNN(k=5)

2

0.374074

0.200157

Decision Tree�(max_depth = 15)

1

0.575786

0.132437

Logistic Regression�(penalty = norm)

1

0.310636

0.288147

SVM

1

0.261718

0.23843

DNN - 2 layers(units = 1024)

1

0.387333

0.326913

DNN - 4 layers(units = 1024)

1

0.362031

0.323764

16 of 24

Which Input Type is better? (Word2Vec 500)

15

Model name

type

(inputs)

Train F1

Test F1

KNN(k=5)

4

0.36244

0.197064

Decision Tree�(max_depth = 15)

3

0.568303

0.135419

Logistic Regression�(penalty = norm)

1

0.339528

0.309808

SVM

1

0.279512

0.252096

DNN - 2 layers(units = 2048)

1

0.429399

0.351215

DNN - 4 layers(units = 1024)

1

0.352299

0.318998

17 of 24

Which Input Type is better? (BERT 128)

16

Model name

type

(inputs)

Train F1

Test F1

KNN(k=5)

1

0.359

0.173

Decision Tree�(max_depth = 15)

2

0.600

0.134

Logistic Regression�(penalty = norm)

2

0.233

0.221

SVM

1

0.193

0.180

DNN - 2 layers(units = 2048)

1

0.339

0.284

DNN - 4 layers(units = 1024)

1

0.296

0.269

18 of 24

Which Embedding is better? (with Input Type 1)

17

Model name

Embedding

Train F1

Test F1

KNN(k=5)

NNLM

0.373766

0.187712

Decision Tree�(max_depth = 15)

Word2Vec250

0.575786

0.132437

Logistic Regression�(penalty = norm)

NNLM

0.387988

0.337281

SVM

NNLM

0.410342

0.301155

DNN - 2 layers(units = 2048)

Word2Vec500

0.429399

0.351215

DNN - 4 layers(units = 1024)

NNLM

0.425655

0.345399

19 of 24

Which Classifier Model is better?

18

Model name

Train accuracy

Test accuracy

Train pression

Test

pression

Train recall

Test recall

Train F1

KNN(k=5)

0.599476

0.417232

0.619982

0.327112

0.318672

0.164411

0.373766

Decision Tree(max_depth=15)

0.726731

0.32893

0.749377

0.125926

0.450795

0.099525

0.535664

Logistic Regression�(norm)

0.594699

0.573303

0.521879

0.434586

0.347518

0.30451

0.387988

SVM

0.6673

0.58845

0.741006

0.462719

0.35852

0.268077

0.410342

DNN - 2 layers�(unit = 512)

0.649419

0.588111

0.652319

0.475497

0.431722

0.325616

0.476395

DNN - 4 layers�(unit = 512)

0.640916

0.587387

0.590735

0.460928

0.395678

0.323581

0.428486

20 of 24

Conclusion

  • For Input Type, using more information will help improve overall performance.

  • NNLM performs better than other embedding methods, might because of the dataset used in pre-training more suitable for this task.

  • In terms of classification models, DNN Model is better than traditional ML methods(because there are more parameters to learn).

19

21 of 24

Q & A

20

22 of 24

Genre distribution

  • Training Data
    • Top 4: Drama, Documentary, Comedy, Short

21

Drama

Documentary

Comedy

short

23 of 24

Genre distribution

  • Testing Data
    • Top 4: Drama, Documentary, Comedy, Short

22

Comedy Drama Short

Documentary

24 of 24

NNLM (Neural Network Language Model)

23

A neural network language model is a language model based on Neural Networks , exploiting their ability to learn distributed representations .