1 of 8

Speech & NLP

(TERM PROJECT)

Anjali Jha (16IM10032)

Shubham Mawa (16IM10033)

Bhargav D (16IM10034)

Prerit Jain (16IM10035)

2 of 8

PROBLEM STATEMENT

AUTOMATIC EXTRACTION OF EVENTS FROM NEWS DOCUMENTS.

( Events depicts the occurrence of any Disaster i.e natural or man-made )

3 of 8

PROBLEM STATEMENT

TASK 1

  • Classification of Documents into predefined event types.
  • The objective is to find whether the event has been discussed in the document.

TASK 2

  • Detecting event trigger for each word vector
  • The objective is to find whether an event is being associated with the word.

4 of 8

APPROACHES

  • DATA EXTRACTION
    • Parsing the XML Document and converting it into txt format.
  • DATA TRANSFORMATION
    • Creation of Dense Document embeddings using fastText.
    • Dimension Reduction.
  • MACHINE LEARNING MODELS
    • Training multiple classifiers for different classes after splitting the dataset into Train and Test Sets.
    • Selection of appropriate classifier for the classification task.
  • MODEL SELECTION & VALIDATION:
    • Based on the different model adequacy parameters, the best model is selected.
    • Hyper Parameter tuning.

DENSE DOCUMENT EMBEDDINGS

5 of 8

APPROACHES

  • DATA EXTRACTION:
    • Parsing the XML Document and converting it into sentences.
    • Creation of Vocabulary.
  • PREPROCESSING STEPS:
    • Creation of Dense Word Embedding Matrix using fastText library for vocabulary.
    • Removing Punctuations in sentences.
  • CREATION OF DATASET
    • Transformation of sentences into words with context. ( Using appropriate window size ).
    • Words are indexed by their position in the embedding matrix.
    • Corresponding event triggers for words are stored parallely.
    • The Event Triggers are numerically encoded.

NEURAL NETWORK ARCHITECTURE

6 of 8

    • The two dimensional representation of each word is fed to a convolution layer followed by max-pooling layer.
    • Parallel Bi-directional Long Short Term Memory(Bi-LSTM) for the same input
    • The output of CNN and Bi-LSTM is concatenated.
    • The representation vector is fed to a fully connected layer.
    • Followed by a Softmax layer to get the proper event type of the current word.
    • The gradients are calculated using back-propagation.
    • Regularization is implemented by dropout.

NEURAL NETWORK ARCHITECTURE

CNN + Bi-LSTM ARCHITECTURE FOR EVENT TRIGGER CLASSIFICATION

7 of 8

RESULTS

WORD

True Positive

False Positive

False

Negative

Event Trigger

203

855

2348

NONE

103921

2285

792

NEURAL NETWORK ARCHITECTURE

DENSE DOCUMENT EMBEDDINGS

MODEL

Training Set Accuracy

Testing Set Accuracy

SVMs

95.31%

95.01%

Logistic Regression

95.16%

95.46%

Decision Tree

95.11%

95.61%

8 of 8

  • Lack of Annotated Data in Hindi.
  • Data Transformation was complex due to the structure of Data given.

( XML Tree )

  • Huge number of Parameters in the Neural Network
  • Limited Vocabulary
  • Multiclass Classification with insufficient examples for each class

CHALLENGES & LIMITATIONS