SEQ2SEQ++: A MULTI-TASKING BASED SEQ2SEQ MODEL �TO GENERATE MEANINGFUL AND RELEVANT ANSWERS
PhD Candidate
Kulothunkan Palasundram (GS50783)
Supervisory Committee
Associate Prof. Dr. Nurfadhlina Mohd Sharef
Associate Prof. Dr. Azreen bin Azman
Dr. Khairul Azhar bin Kasmiran
University Putra Malaysia
Faculty of Computer Science & Information Technology
Outline
Introduction
Research Problem, Research Objective & Scope
Literature Review
Research Methodology
Proposed Methods & Experiments
Results and Conclusion
Summary and Contributions
Future works
Publications
Introduction
Chatbot Types and Research Scope
Introduction
Seq2Seq is a transformation of a sequence of words (question) into another sequence of words (answer)
Seq2Seq Learning
Introduction
Seq2Seq Learning
Research Problem – Key Observations
Research Objective
Propose an MTL-based Seq2Seq model that can generate meaningful and relevant answers
Main Objective
Research Scope
Question-answering as a single turn conversation task (a pair of question and answer) under the Multi-task Learning (MTL ) framework as defined in (Huang & Zhong, 2018) which is a key reference for this research.
Literature Review
Literature Review – Issues & Methods
Literature Review – Approaches
Method | Strengths | Weaknesses |
Additional embeddings | Additional encodings -> reduces encoder overfit | Needs additional data - > may not be available for all datasets and scenarios. |
Alternative Loss Functions | Offers alternative loss functions that is not influenced by frequency of words in dataset. | Reinforcement learning can be unstable and dependent on warm start using cross entropy loss functions. Loss function used for one dataset may not be suitable for another dataset. Requires custom reward functions to evaluate the model. |
Multi-task Learning | Auxiliary tasks can reduce question encoding and answer generation overfit. The auxiliary task can be excluded during model inference. | Fixed task loss weight mechanism is very inefficient and ineffective. |
Attention Mechanism | Provides a mechanism to focus on certain part of question during decoding to address language model influence | Existing methods creates an imbalance between language model influence and the question at hand during answer generation. |
Beam search | Can be used in conjunction with any other method | Not able to influence model training, thus improvements may not that significant |
Research Objectives - Refined
Propose an MTL-based Seq2Seq model that can generate meaningful and relevant answers by addressing the three issues which are language model influence, answer generation overfit, and question encoder overfit issues.
Main Objective
Sub-objective 1
address
New attention mechanism
Language model influence issue
Sub-objective 2
address
Improved MTL loss calculation method
Answer generation overfit issue
Sub-objective 3
address
Auxiliary tasks for MTL
Question encoding issue
Research Problem Statements Refined
Issue # 1 | Language Model Influence |
What | Ability to generate the next word based on the previously generated word. Over time, the language model influence gets stronger, and the model may generate irrelevant answers |
Existing method | Global attention mechanism (Bahdanau et al., 2015). Performs computation to determine which part of the question is important. |
Gaps in existing method | Global attention mechanism focus only on the encoder’s hidden states and the decoder’s final hidden state at each decoding step. Thus, influence of the previously generated words gets diluted as the decoding progresses. |
Proposal | Comprehensive Attention Mechanism (CAM) – focuses on the encoder hidden states and all the decoders hidden states |
Research Problem Statements Refined
Issue # 2 | Answer generation overfit |
What | |
Existing method | Answer generation overfitting can be addressed by adding regularization terms to the cross-entropy loss function to compute a new loss LMTL = αagLag + αnLn |
Gaps in existing method | Existing MTL algorithms uses a small fixed task loss weight (α) (such as 0.1 or 0.01) for the auxiliary task to compute the MTL loss for backpropagation. This may not be effective to reduce decoding overfitting |
Proposal | Dynamic Task Loss Weight Scheme (DL) for MTL whereby each task loss weights are computed during each epoch and used for MTL loss calculation |
Research Problem Statements Refined
Issue # 3 | Question encoding overfit |
What | Question encoding overfit refers to the situation where the encoder becomes overfit during training due to the high frequency of common words in training data |
Existing method | Utilize question encoding to perform additional tasks such as binary classification of answer (MTL-BC - Huang & Zhong, 2018) or first word prediction (MTL-LTS - Zhu et al., 2016) |
Gaps in existing method | Binary classification of answers is not natural as answers can also be partially correct. First word prediction was done in separate network (sequential MTL). Both methods are not sufficient to reduce the encoder overfit |
Proposal |
|
Research Methodology - Phases
Phase 1 - �Planning
Phase 2 - Literature Review
Phase 4 - �Experiment & Analysis
Phase 3 - Design and Development
Research Methodology - Dataset
Dataset Name | NarrativeQA (Kočiský et al., 2017) | SQuAD (Rajpurkar et al., 2016). |
Dataset summary | NarrativeQA is a fiction-based dataset and consists of proper English words. | SQuAD is a crowdsourced dataset based on Wikipedia articles. |
Training (# of question-answer pairs) | 24000 | 24819 |
Validation (# of question-answering pairs) | 4800 | 4800 |
Testing (# of question-answering pairs) | 1000 | 1000 |
Question vocabulary size (# of words) | 17294 | 19569 |
Answer vocabulary size (# of words) | 18830 | 19935 |
Maximum question length | 19 | 17 |
Maximum answer length | 16 | 17 |
Sample question-answer pair | Question: what are sleepy hollow renowned for ? Answer: ghosts and a haunting atmosphere | Question: how are plants different from animals ? Answer: primary cell wall composed of the polysaccharides cellulose |
Research Methodology - Metrics
Research Methodology – Proposed Methods
Issue | Proposed Method | Description | Benchmark Methods | Metrics |
Language model influence | Comprehensive Attention Mechanism (CAM) | An attention mechanism utilized during answer generation | Global attention mechanism (Bahdanau et al., 2015) utilized in STL (Bahdanau et al., 2015) and MTL-BC (Huang & Zhong, 2018). |
BLEU, WER and Distinct-2 |
Answer generation overfit | Dynamic Tasks Loss Weight Scheme (DL) | An algorithm to dynamically compute the task loss weight during multi-task learning
| MTL-BC (Huang & Zhong, 2018) which utilizes fixed tasks loss weight scheme | |
Question encoder overfit | Multi-functional encoder (MFE) | Performs question encoding, first-word prediction and last word prediction in parallel approach
| MTL-LTS (Zhu et al., 2016) which performs first word prediction task and then answer generation in sequential approach. | |
Ternary Classifier (TC) | Takes in question encoding from MFE and answer encoding from AE to perform classification | MTL-BC (Huang & Zhong, 2018) which utilizes binary classifier |
Research Methodology – Experiments
Experiment # | Models Involved | Purpose of Experiment |
1 | STL vs STL-CAM | To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a single task learning |
2 | MTL-BC-CAM versus MTL-BC | To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a multi-task learning (MTL) |
3 | MTL-BC-DL versus MTL-BC | To gauge performance improvement of using dynamic tasks loss weight scheme versus of using fixed tasks loss weight scheme for MTL calculation |
4 | MTL-TC versus MTL-BC | To gauge performance improvement of using ternary classification over binary classification as auxiliary task for MTL |
5 | MTL-MFE versus MTL-LTS | To gauge performance of MFE (parallel MTL) over LTS model (sequential MTL) |
6 | SEQ2SEQ++ versus STL-CAM, MTL-BC-CAM, MTL-BC-DL, MTL-MFE, MTL-TC | To gauge the performance of the final model (SEQ2SEQ++) against all the interim models |
7 | SEQ2SEQ++ versus STL, MTL-BC, MTL-LTS | To gauge the performance of the final model (SEQ2SEQ++) against all the benchmark models |
Proposed Methods: CAM
Comprehensive Attention Mechanism (CAM)
Models that implemented this method are STL-CAM, MTL-BC-CAM and SEQ2SEQ++
Comprehensive Attention Mechanism (CAM) vs Global Attention Mechanism
Proposed Method:
CAM
Existing Method:
Global Attention Mechanism
Proposed Methods: DL
Models that implemented this method are MTL-BC-DL, MTL-MFE and SEQ2SEQ++
Proposed Methods: DL
Algorithm 1: SEQ2SEQ++ Training Algorithm
Input: {Question (X), Answer(Y), Label(L), First Word (FW), Last Word(LW)} quintuplets, Maximum answer length (T), Maximum Epoch (E)
Steps:
Initialize Multi-functional Encoder (MFE), Answer Encoder(AE), Answer Decoder (AD) and Ternary Classifier (TC)
Initialize each losses LMTL = Lag = Ltc = Lfw = Llw = 0
Initialize each task loss weights αag = αtc = αfw = αlw = 0.25
For epoch 1 to Number of Epochs
1. For batch 1 to Number of Batches Do
1.1. Perform question encoding
1.2. Predict First Word and compute the first-word prediction losses (Lfw)
1.3. Predict Last Word and compute the last-word prediction losses (Llw)
1.4. Perform answer encoding
1.5. Perform ternary classification and compute ternary classification loss (Ltc)
1.6. Perform answer generation using CAM and compute answer generation loss (Lag)
1.7. Compute multi-task loss: LMTL = αag*Lag + αtc*Ltc + αfw*Lfw + αlw*Llw
1.8. Update the model parameters (neural network weights)
1.9. For each task, tk ⊆ {ag, tc , fw , lw} : Calculate the average loss (Ltk-avg)
End for Batch Looping
2. Calculate the total average loss for all tasks: Lavg-total = Lag-avg + Ltc-avg + Lfw-avg + Llw-avg
3. For each task, tk ⊆ {ag, tc , fw , lw}
3.1. Calculate new task loss weight: αtk = Ltk-avg / ( Lavg-total)
End for Epoch Looping
Output: Trained SEQ2SEQ++ model
Proposed Methods: MFE
Models that implemented this method are MTL-MFE and SEQ2SEQ++
Proposed Methods: TC
Models that implemented this method are MTL-MFE and SEQ2SEQ++
Proposed Methods: TC
Proposed Methods:
Final Model – SEQ2SEQ++
Recap – Experiments Done
Experiment # | Models Involved | Purpose of Experiment |
1 | STL vs STL-CAM | To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a single task learning |
2 | MTL-BC-CAM versus MTL-BC | To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a multi-task learning (MTL) |
3 | MTL-BC-DL versus MTL-BC | To gauge performance improvement of using dynamic tasks loss weight scheme versus of using fixed tasks loss weight scheme for MTL calculation |
4 | MTL-TC versus MTL-BC | To gauge performance improvement of using ternary classification over binary classification as auxiliary task for MTL |
5 | MTL-MFE versus MTL-LTS | To gauge performance of MFE (parallel MTL) over LTS model (sequential MTL) |
6 | SEQ2SEQ++ versus STL-CAM, MTL-BC-CAM, MTL-BC-DL, MTL-MFE, MTL-TC | To gauge the performance of the final model (SEQ2SEQ++) against all the interim models |
7 | SEQ2SEQ++ versus STL, MTL-BC, MTL-LTS | To gauge the performance of the final model (SEQ2SEQ++) against all the benchmark models |
Experiment Result –STL vs STL-CAM
Dataset | BLEU | WER | Distinct-2 |
NarrativeQA | 7.3% | -4.8% * | 5.4% |
SQuAD | 2.8% | -7.6% * | 11.8% |
STL-CAM over STL - Percentage Improvements
Experiment Result –MTL-BC vs MTL-BC-CAM
MTL-BC-CAM over MTL-BC - Percentage Improvements
Dataset | BLEU | WER | Distinct-2 |
NarrativeQA | 11.9% | 18.7% | -2.3% * |
SQuAD | 1.8% | 9.4% | 0.6% |
CAM vs Global attention mechanism conclusion
than the global attention-based models in their respective experiments
Experiment Result –MTL-BC vs MTL-BC-DL
MTL-BC-DL over MTL-BC Percentage Improvements
Dataset | BLEU | WER | Distinct-2 |
NarrativeQA | 4.7% | 8.9% | -0.6% |
SQuAD | 1.6% | 3.0% | -0.7% |
Dynamic vs Fixed Tasks Loss Weight Scheme conclusion
Experiment Result –MTL-LTS vs MTL-MFE
MTL-MFE over MTL-LTS Percentage Improvements
Dataset | BLEU | WER | Distinct-2 |
NarrativeQA | 34.2% | 48.7% | 0.4% |
SQuAD | 45.8% | 52.5% | 1.2% |
Parallel task learning (MTL-MFE) vs sequential learning to start (MTL-LTS) conclusion
Experiment Result –MTL-BC vs MTL-TC
MTL-TC over MTL-BC Percentage Improvements
Dataset | BLEU | WER | Distinct-2 |
NarrativeQA | 20.5% | 28.9% | 0.9% |
SQuAD | 2.7% | 6% | -0.01% |
Ternary classifier vs Binary classifier conclusion
Experiment Result – SEQ2SEQ++ vs interim models
SEQ2SEQ++ over Interim Models Percentage Improvements
Benchmark Method | Dataset | BLEU | WER | Distinct-2 |
STL-CAM | NarrativeQA | 42.35% | 64.8% | 0.13% |
SQuAD | 52.04% | 59.52% | -0.61% | |
MTL-BC-CAM | NarrativeQA | 29.03% | 51.81% | 4.61% |
SQuAD | 15.29% | 30.78% | 0.18% | |
MTL-BC-DL | NarrativeQA | 37.90% | 57.02% | 2.93% |
SQuAD | 15.46% | 35.34% | 1.47% | |
MTL-MFE | NarrativeQA | 8.43% | 19.81% | 0.23% |
SQuAD | 3.02% | 12.28% | 0.37% | |
MTL-TC | NarrativeQA | 19.84% | 44.88% | 1.40% |
SQuAD | 14.19% | 33.27% | 0.74% |
SEQ2SEQ++ vs interim models conclusion
Experiment Result – SEQ2SEQ++ vs benchmark models
SEQ2SEQ++ over Benchmark Models Percentage Improvements
Benchmark Method | Dataset | BLEU | WER | Distinct-2 |
STL | NarrativeQA | 52.71% | 63.04% | 5.44% |
SQuAD | 56.1% | 65.21% | 10.9% | |
MTL-BC | NarrativeQA | 44.42% | 60.82% | 2.85% |
SQuAD | 17.31% | 37.26% | 0.73% | |
MTL-LTS | NarrativeQA | 45.54% | 58.83% | 0.21% |
SQuAD | 50.17% | 58.3% | 1.53% |
SEQ2SEQ++ vs benchmark models conclusion
Summary
Summary
Summary
Summary
Summary
Future Works
Areas that have great potential for future works :-
Publications
Backup
Proposed Methods: MFE
Proposed Methods: TC