Genre Classification of IMDb
Group 14
110522092 林廷翰
110522110 何名曜
109526011 彭彥霖
Outline
1
Introduction
2
Motivation
3
Data Analysis
4
Dataset - Genre Classification Dataset IMDb
5
Experiments
6
Data information
We use 4 types of inputs:
7
Word Embedding
8
*NNLM: Neural Network Language Model
*Word2Vec: skipgram version
Scikit-learn Model
9
DNN Model
10
Evaluation
We will compare different embedding methods and models using metrics below:
11
Results
12
Which Input Type is better? (NNML)
13
Model name | Type (inputs) | Train F1 | Test F1 |
KNN(k=5) | 2 | 0.40658 | 0.229477 |
Decision Tree�(max_depth = 15) | 3 | 0.180207 | 0.104106 |
Logistic Regression�(penalty = norm) | 1 | 0.387988 | 0.337281 |
SVM | 1 | 0.410342 | 0.301155 |
DNN - 2 layers(units = 2048) | 1 | 0.492853 | 0.352759 |
DNN - 4 layers(units = 1024) | 1 | 0.425655 | 0.345399 |
Which Input Type is better? (Word2Vec 250)
14
Model name | type (inputs) | Train F1 | Test F1 |
KNN(k=5) | 2 | 0.374074 | 0.200157 |
Decision Tree�(max_depth = 15) | 1 | 0.575786 | 0.132437 |
Logistic Regression�(penalty = norm) | 1 | 0.310636 | 0.288147 |
SVM | 1 | 0.261718 | 0.23843 |
DNN - 2 layers(units = 1024) | 1 | 0.387333 | 0.326913 |
DNN - 4 layers(units = 1024) | 1 | 0.362031 | 0.323764 |
Which Input Type is better? (Word2Vec 500)
15
Model name | type (inputs) | Train F1 | Test F1 |
KNN(k=5) | 4 | 0.36244 | 0.197064 |
Decision Tree�(max_depth = 15) | 3 | 0.568303 | 0.135419 |
Logistic Regression�(penalty = norm) | 1 | 0.339528 | 0.309808 |
SVM | 1 | 0.279512 | 0.252096 |
DNN - 2 layers(units = 2048) | 1 | 0.429399 | 0.351215 |
DNN - 4 layers(units = 1024) | 1 | 0.352299 | 0.318998 |
Which Input Type is better? (BERT 128)
16
Model name | type (inputs) | Train F1 | Test F1 |
KNN(k=5) | 1 | 0.359 | 0.173 |
Decision Tree�(max_depth = 15) | 2 | 0.600 | 0.134 |
Logistic Regression�(penalty = norm) | 2 | 0.233 | 0.221 |
SVM | 1 | 0.193 | 0.180 |
DNN - 2 layers(units = 2048) | 1 | 0.339 | 0.284 |
DNN - 4 layers(units = 1024) | 1 | 0.296 | 0.269 |
Which Embedding is better? (with Input Type 1)
17
Model name | Embedding | Train F1 | Test F1 |
KNN(k=5) | NNLM | 0.373766 | 0.187712 |
Decision Tree�(max_depth = 15) | Word2Vec250 | 0.575786 | 0.132437 |
Logistic Regression�(penalty = norm) | NNLM | 0.387988 | 0.337281 |
SVM | NNLM | 0.410342 | 0.301155 |
DNN - 2 layers(units = 2048) | Word2Vec500 | 0.429399 | 0.351215 |
DNN - 4 layers(units = 1024) | NNLM | 0.425655 | 0.345399 |
Which Classifier Model is better?
18
Model name | Train accuracy | Test accuracy | Train pression | Test pression | Train recall | Test recall | Train F1 |
KNN(k=5) | 0.599476 | 0.417232 | 0.619982 | 0.327112 | 0.318672 | 0.164411 | 0.373766 |
Decision Tree(max_depth=15) | 0.726731 | 0.32893 | 0.749377 | 0.125926 | 0.450795 | 0.099525 | 0.535664 |
Logistic Regression�(norm) | 0.594699 | 0.573303 | 0.521879 | 0.434586 | 0.347518 | 0.30451 | 0.387988 |
SVM | 0.6673 | 0.58845 | 0.741006 | 0.462719 | 0.35852 | 0.268077 | 0.410342 |
DNN - 2 layers�(unit = 512) | 0.649419 | 0.588111 | 0.652319 | 0.475497 | 0.431722 | 0.325616 | 0.476395 |
DNN - 4 layers�(unit = 512) | 0.640916 | 0.587387 | 0.590735 | 0.460928 | 0.395678 | 0.323581 | 0.428486 |
Conclusion
19
Q & A
20
Genre distribution
21
Drama
Documentary
Comedy
short
Genre distribution
22
Comedy Drama Short
Documentary
NNLM (Neural Network Language Model)
23
A neural network language model is a language model based on Neural Networks , exploiting their ability to learn distributed representations .