JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 24

Genre Classification of IMDb

Group 14

110522092 林廷翰

110522110 何名曜

109526011 彭彥霖

2 of 24

Outline

Introduction
Data Analysis
Experiments
Concussion

3 of 24

Introduction

4 of 24

Motivation

Labeling the movie genre on hand is time consuming.

There are lots of movie information on IMDb, e.g. published date, description and movie genre.

We want to use what we learn from this class and IMDb database to design a model that can classify the movie genre.

5 of 24

Data Analysis

6 of 24

Dataset - Genre Classification Dataset IMDb

Using feature:

Title

ex. The Unrecovered

Year

ex. 2007

Summary

ex. The film's title refers not only to the un-rec…

There are some missing values in year feature.

we consider “Year” to be categorical feature and use one-hot encoding
we set the missing year value to “unknown”.

7 of 24

Experiments

8 of 24

Data information

We use 4 types of inputs:

summary + Title + year
summary + Title
summary + Year
Summary

9 of 24

Word Embedding

We use 4 types of embedding models:

NNLM* 128 (Google News)
Word2Vec* 250 (Wikipedia)
Word2Vec* 500 (Wikipedia)
Bert 128 (Wikipedia and BooksCorpus)

�Preprocess words in summary and title:

Split space
Lower case

*NNLM: Neural Network Language Model

*Word2Vec: skipgram version

10 of 24

Scikit-learn Model

k = 5, 10, 15, 20, 25

Decision Tree

max_depth = 5, 10, 15, 20

Logistic Regression

penalty= ‘l2’, ‘none’

default parameters

11 of 24

DNN Model

7 Types of hidden units size:

32, 64, 128, 256, 512, 1024, 2048

2 Types of Layer setting:

2 Layer or 4 Layer

Activation: LeakyRelu
Use Batch Normalization
Use Dropout = 0.4
Use Early Stop
Output Size = 27

12 of 24

Evaluation

We will compare different embedding methods and models using metrics below:

Accuracy
Recall (macro)
Precision (macro)
F1 (macro)

13 of 24

Results

Find the better Input Type.
Find the better Embedding.
Find the better Classifier Model.

14 of 24

Which Input Type is better? (NNML)

Model name	Type (inputs)	Train F1	Test F1
KNN(k=5)	2	0.40658	0.229477
Decision Tree�(max_depth = 15)	3	0.180207	0.104106
Logistic Regression�(penalty = norm)	1	0.387988	0.337281
SVM	1	0.410342	0.301155
DNN - 2 layers(units = 2048)	1	0.492853	0.352759
DNN - 4 layers(units = 1024)	1	0.425655	0.345399

15 of 24

Which Input Type is better? (Word2Vec 250)

Model name	type (inputs)	Train F1	Test F1
KNN(k=5)	2	0.374074	0.200157
Decision Tree�(max_depth = 15)	1	0.575786	0.132437
Logistic Regression�(penalty = norm)	1	0.310636	0.288147
SVM	1	0.261718	0.23843
DNN - 2 layers(units = 1024)	1	0.387333	0.326913
DNN - 4 layers(units = 1024)	1	0.362031	0.323764

16 of 24

Which Input Type is better? (Word2Vec 500)

Model name	type (inputs)	Train F1	Test F1
KNN(k=5)	4	0.36244	0.197064
Decision Tree�(max_depth = 15)	3	0.568303	0.135419
Logistic Regression�(penalty = norm)	1	0.339528	0.309808
SVM	1	0.279512	0.252096
DNN - 2 layers(units = 2048)	1	0.429399	0.351215
DNN - 4 layers(units = 1024)	1	0.352299	0.318998

17 of 24

Which Input Type is better? (BERT 128)

Model name	type (inputs)	Train F1	Test F1
KNN(k=5)	1	0.359	0.173
Decision Tree�(max_depth = 15)	2	0.600	0.134
Logistic Regression�(penalty = norm)	2	0.233	0.221
SVM	1	0.193	0.180
DNN - 2 layers(units = 2048)	1	0.339	0.284
DNN - 4 layers(units = 1024)	1	0.296	0.269

18 of 24

Which Embedding is better? (with Input Type 1)

Model name	Embedding	Train F1	Test F1
KNN(k=5)	NNLM	0.373766	0.187712
Decision Tree�(max_depth = 15)	Word2Vec250	0.575786	0.132437
Logistic Regression�(penalty = norm)	NNLM	0.387988	0.337281
SVM	NNLM	0.410342	0.301155
DNN - 2 layers(units = 2048)	Word2Vec500	0.429399	0.351215
DNN - 4 layers(units = 1024)	NNLM	0.425655	0.345399

19 of 24

Which Classifier Model is better?

Model name	Train accuracy	Test accuracy	Train pression	Test pression	Train recall	Test recall	Train F1
KNN(k=5)	0.599476	0.417232	0.619982	0.327112	0.318672	0.164411	0.373766
Decision Tree(max_depth=15)	0.726731	0.32893	0.749377	0.125926	0.450795	0.099525	0.535664
Logistic Regression�(norm)	0.594699	0.573303	0.521879	0.434586	0.347518	0.30451	0.387988
SVM	0.6673	0.58845	0.741006	0.462719	0.35852	0.268077	0.410342
DNN - 2 layers�(unit = 512)	0.649419	0.588111	0.652319	0.475497	0.431722	0.325616	0.476395
DNN - 4 layers�(unit = 512)	0.640916	0.587387	0.590735	0.460928	0.395678	0.323581	0.428486

20 of 24

Conclusion

For Input Type, using more information will help improve overall performance.

NNLM performs better than other embedding methods, might because of the dataset used in pre-training more suitable for this task.

In terms of classification models, DNN Model is better than traditional ML methods(because there are more parameters to learn).

21 of 24

Q & A

22 of 24

Genre distribution

Training Data

Top 4: Drama, Documentary, Comedy, Short

Drama

Documentary

Comedy

short

23 of 24

Genre distribution

Testing Data

Top 4: Drama, Documentary, Comedy, Short

Comedy Drama Short

Documentary

24 of 24

NNLM (Neural Network Language Model)

A neural network language model is a language model based on Neural Networks , exploiting their ability to learn distributed representations .