1 of 53

SEQ2SEQ++: A MULTI-TASKING BASED SEQ2SEQ MODEL �TO GENERATE MEANINGFUL AND RELEVANT ANSWERS

PhD Candidate

Kulothunkan Palasundram (GS50783)

Supervisory Committee

Associate Prof. Dr. Nurfadhlina Mohd Sharef

Associate Prof. Dr. Azreen bin Azman

Dr. Khairul Azhar bin Kasmiran

University Putra Malaysia

Faculty of Computer Science & Information Technology

2 of 53

Outline

Introduction

Research Problem, Research Objective & Scope

Literature Review

Research Methodology

Proposed Methods & Experiments

Results and Conclusion

Summary and Contributions

Future works

Publications

3 of 53

Introduction

Chatbot Types and Research Scope

4 of 53

Introduction

Seq2Seq is a transformation of a sequence of words (question) into another sequence of words (answer)

Seq2Seq Learning

5 of 53

Introduction

Seq2Seq Learning

6 of 53

Research Problem – Key Observations

  • Seq2Seq – popular method for natural answer generation
  • However, the generated answer may not be relevant to the question
  • Result - conversations with chatbots can be meaningless, abruptly terminated by users, and eventually lowers the chatbot adoption rate

7 of 53

Research Objective

Propose an MTL-based Seq2Seq model that can generate meaningful and relevant answers

Main Objective

8 of 53

Research Scope

Question-answering as a single turn conversation task (a pair of question and answer) under the Multi-task Learning (MTL ) framework as defined in (Huang & Zhong, 2018) which is a key reference for this research.

9 of 53

Literature Review

  • 26 articles reviewed
  • 3 key issues identified
  • 5 methods/ approaches found

10 of 53

Literature Review – Issues & Methods

11 of 53

Literature Review – Approaches

Method

Strengths

Weaknesses

Additional embeddings

Additional encodings -> reduces encoder overfit

Needs additional data - > may not be available for all datasets and scenarios.

Alternative Loss Functions

Offers alternative loss functions that is not influenced by frequency of words in dataset.

Reinforcement learning can be unstable and dependent on warm start using cross entropy loss functions. Loss function used for one dataset may not be suitable for another dataset.

Requires custom reward functions to evaluate the model.  

Multi-task Learning

Auxiliary tasks can reduce question encoding and answer generation overfit. The auxiliary task can be excluded during model inference.

Fixed task loss weight mechanism is very inefficient and ineffective.

Attention Mechanism

Provides a mechanism to focus on certain part of question during decoding to address language model influence

Existing methods creates an imbalance between language model influence and the question at hand during answer generation.

Beam search

Can be used in conjunction with any other method

Not able to influence model training, thus improvements may not that significant

12 of 53

Research Objectives - Refined

Propose an MTL-based Seq2Seq model that can generate meaningful and relevant answers by addressing the three issues which are language model influence, answer generation overfit, and question encoder overfit issues.

Main Objective

Sub-objective 1

address

New attention mechanism

Language model influence issue

Sub-objective 2

address

Improved MTL loss calculation method

Answer generation overfit issue

Sub-objective 3

address

Auxiliary tasks for MTL

Question encoding issue

13 of 53

Research Problem Statements Refined

Issue # 1

Language Model Influence

What

Ability to generate the next word based on the previously generated word. Over time, the language model influence gets stronger, and the model may generate irrelevant answers

Existing method

Global attention mechanism (Bahdanau et al., 2015). Performs computation to determine which part of the question is important.

Gaps in existing method

Global attention mechanism focus only on the encoder’s hidden states and the decoder’s final hidden state at each decoding step. Thus, influence of the previously generated words gets diluted as the decoding progresses.

Proposal

Comprehensive Attention Mechanism (CAM) – focuses on the encoder hidden states and all the decoders hidden states

14 of 53

Research Problem Statements Refined

Issue # 2

Answer generation overfit

What

Existing method

Answer generation overfitting can be addressed by adding regularization terms to the cross-entropy loss function to compute a new loss LMTL = αagLag + αnLn

Gaps in existing method

Existing MTL algorithms uses a small fixed task loss weight (α) (such as 0.1 or 0.01) for the auxiliary task to compute the MTL loss for backpropagation. This may not be effective to reduce decoding overfitting

Proposal

Dynamic Task Loss Weight Scheme (DL) for MTL whereby each task loss weights are computed during each epoch and used for MTL loss calculation

15 of 53

Research Problem Statements Refined

Issue # 3

Question encoding overfit

What

Question encoding overfit refers to the situation where the encoder becomes overfit during training due to the high frequency of common words in training data

Existing method

Utilize question encoding to perform additional tasks such as binary classification of answer (MTL-BC - Huang & Zhong, 2018) or first word prediction (MTL-LTS - Zhu et al., 2016)

Gaps in existing method

Binary classification of answers is not natural as answers can also be partially correct. First word prediction was done in separate network (sequential MTL). Both methods are not sufficient to reduce the encoder overfit

Proposal

  • Ternary Classification (TC)
  • Multifunctional Encoder (MFE)

16 of 53

Research Methodology - Phases

  • Initial literature review 
  • Research scope
  • Research objectives
  • Seq2Seq Issues
  • Existing approaches, strengths, limitations and gaps 
  • Evaluation metrics
  • Benchmark models (MTL-BC, STL, MTL-LTS) 
  • Proposed methods (CAM, DL, TC, MFE)
  • Final model (SEQ2SEQ++)
  • Data for training and evaluation
  • Experiments
  • Analysis
  • Thesis 

Phase 1 -  Planning

Phase 2 -  Literature Review

Phase 4 - Experiment & Analysis

Phase 3 - Design and Development

17 of 53

Research Methodology - Dataset

Dataset Name

NarrativeQA

(Kočiský et al., 2017)

SQuAD

(Rajpurkar et al., 2016).

Dataset summary

NarrativeQA is a fiction-based dataset and consists of proper English words.

SQuAD is a crowdsourced dataset based on Wikipedia articles.

Training (# of question-answer pairs)

24000

24819

Validation (# of question-answering pairs)

4800

4800

Testing (# of question-answering pairs)

1000

1000

Question vocabulary size (# of words)

17294

19569

Answer vocabulary size (# of words)

18830

19935

Maximum question length

19

17

Maximum answer length

16

17

Sample question-answer pair

Question: what are sleepy hollow renowned for ?

Answer: ghosts and a haunting atmosphere

Question: how are plants different from animals ?

Answer: primary cell wall composed of the polysaccharides cellulose

18 of 53

Research Methodology - Metrics

  • Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002)
      • to measure answer correctness
      • It gives a score of 0 to 1
      • The higher BLEU score indicates generating answers that are more relevant to the question

  • Word Error Rate (WER) (Mikolov et al., 2010)
      • to measure model error
      • It gives a score of 0 to 1
      • The lower the score means the fewer the errors in the generated answers

    • Distinct-2 (Li, Galley, Brockett, Gao, et al., 2016)
    • to measure the diversity of generated answers
      • It gives a score of 0 to 1
    • The higher the score means the more diverse are the answers

19 of 53

Research Methodology – Proposed Methods

Issue

Proposed Method

Description

Benchmark Methods

Metrics

Language model influence

Comprehensive Attention Mechanism (CAM)

An attention mechanism utilized during answer generation

Global attention mechanism (Bahdanau et al., 2015) utilized in STL (Bahdanau et al., 2015) and MTL-BC (Huang & Zhong, 2018).

 

 

 

 

 

 

BLEU,

WER and

Distinct-2

Answer generation overfit

Dynamic Tasks Loss Weight Scheme (DL)

An algorithm to dynamically compute the task loss weight during multi-task learning

 

MTL-BC (Huang & Zhong, 2018) which utilizes fixed tasks loss weight scheme

   Question encoder overfit

Multi-functional encoder (MFE)

Performs question encoding, first-word prediction and last word prediction in parallel approach

 

MTL-LTS (Zhu et al., 2016) which performs first word prediction task and then answer generation in sequential approach.

Ternary Classifier (TC)

Takes in question encoding from MFE and answer encoding from AE to perform classification

MTL-BC (Huang & Zhong, 2018) which utilizes binary classifier

20 of 53

Research Methodology – Experiments

Experiment #

Models Involved

Purpose of Experiment

1

STL vs STL-CAM

To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a single task learning

2

MTL-BC-CAM versus MTL-BC

To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a multi-task learning (MTL)

3

MTL-BC-DL versus MTL-BC

To gauge performance improvement of using dynamic tasks loss weight scheme versus of using fixed tasks loss weight scheme for MTL calculation

4

MTL-TC versus MTL-BC

To gauge performance improvement of using ternary classification over binary classification as auxiliary task for MTL

5

MTL-MFE versus MTL-LTS

To gauge performance of MFE (parallel MTL) over LTS model (sequential MTL)

6

SEQ2SEQ++ versus STL-CAM, MTL-BC-CAM, MTL-BC-DL, MTL-MFE, MTL-TC

To gauge the performance of the final model (SEQ2SEQ++) against all the interim models

7

SEQ2SEQ++ versus STL, MTL-BC, MTL-LTS

To gauge the performance of the final model (SEQ2SEQ++) against all the benchmark models

21 of 53

Proposed Methods: CAM

  • Comprehensive Attention Mechanism - An attention mechanism utilized during answer generation
  • In this mechanism:-
    • The attention weights are computed based on all encoder’s hidden states and the sum of all the decoder’s previous hidden states
    • These computed attention weights are then used to compute the context vector.
    • Subsequently, the decoder utilizes the context vector to eventually generate the answer.

  • This ensures all the hidden states are continuously considered for the next word prediction

22 of 53

Comprehensive Attention Mechanism (CAM)

 

Models that implemented this method are STL-CAM, MTL-BC-CAM and SEQ2SEQ++

23 of 53

Comprehensive Attention Mechanism (CAM) vs Global Attention Mechanism

Proposed Method:

CAM

Existing Method:

Global Attention Mechanism

24 of 53

Proposed Methods: DL

  • Dynamic Tasks Loss Weight Scheme - An algorithm to dynamically compute the task loss weight during multi-task learning

  • In this algorithm:-
    • The task loss weight (α) for each task will be updated during each epoch
    • The new task loss weights for each task will be proportional to the total model loss.
    • The new weights are then used for the next epoch to compute the MTL loss

  • This means that the influence of each task in each epoch is dynamically determined by the task loss in the previous epoch

Models that implemented this method are MTL-BC-DL, MTL-MFE and SEQ2SEQ++

25 of 53

Proposed Methods: DL

Algorithm 1: SEQ2SEQ++ Training Algorithm

Input: {Question (X), Answer(Y), Label(L), First Word (FW), Last Word(LW)} quintuplets, Maximum answer length (T), Maximum Epoch (E)

Steps:

Initialize Multi-functional Encoder (MFE), Answer Encoder(AE), Answer Decoder (AD) and Ternary Classifier (TC)

Initialize each losses LMTL = Lag = Ltc = Lfw = Llw = 0

Initialize each task loss weights αag = αtc = αfw = αlw = 0.25

For epoch 1 to Number of Epochs

1. For batch 1 to Number of Batches Do

1.1. Perform question encoding

1.2. Predict First Word and compute the first-word prediction losses (Lfw)

1.3. Predict Last Word and compute the last-word prediction losses (Llw)

1.4. Perform answer encoding

1.5. Perform ternary classification and compute ternary classification loss (Ltc)

1.6. Perform answer generation using CAM and compute answer generation loss (Lag)

1.7. Compute multi-task loss: LMTL = αag*Lag + αtc*Ltc + αfw*Lfw + αlw*Llw

1.8. Update the model parameters (neural network weights)

1.9. For each task, tk ⊆ {ag, tc , fw , lw} : Calculate the average loss (Ltk-avg)

End for Batch Looping

2. Calculate the total average loss for all tasks: Lavg-total = Lag-avg + Ltc-avg + Lfw-avg + Llw-avg

3. For each task, tk ⊆ {ag, tc , fw , lw}

3.1. Calculate new task loss weight: αtk = Ltk-avg / ( Lavg-total)

End for Epoch Looping

Output: Trained SEQ2SEQ++ model

26 of 53

Proposed Methods: MFE

  • MFE - Multi-Functional Encoder - Performs question encoding, first-word prediction and last word prediction based on the question encoder hidden states
  • MFE will learn to produce hidden states to ensure all 3 tasks have minimal losses

  • By sharing the question encoding with multiple tasks (answer generation, first-word prediction and last word predictions), the question encoding overfit can be reduced

Models that implemented this method are MTL-MFE and SEQ2SEQ++

27 of 53

Proposed Methods: TC

  • TC - Ternary Classifier- Performs ternary classification of an answer based on the question and answer encodings
  • Classifies as correct, partially-correct and wrong

  • By sharing the question encoding with multiple tasks (answer generation and answer classification), the question encoding overfit can be reduced

Models that implemented this method are MTL-MFE and SEQ2SEQ++

28 of 53

Proposed Methods: TC

 

29 of 53

Proposed Methods:

Final Model – SEQ2SEQ++

30 of 53

Recap – Experiments Done

Experiment #

Models Involved

Purpose of Experiment

1

STL vs STL-CAM

To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a single task learning

2

MTL-BC-CAM versus MTL-BC

To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a multi-task learning (MTL)

3

MTL-BC-DL versus MTL-BC

To gauge performance improvement of using dynamic tasks loss weight scheme versus of using fixed tasks loss weight scheme for MTL calculation

4

MTL-TC versus MTL-BC

To gauge performance improvement of using ternary classification over binary classification as auxiliary task for MTL

5

MTL-MFE versus MTL-LTS

To gauge performance of MFE (parallel MTL) over LTS model (sequential MTL)

6

SEQ2SEQ++ versus STL-CAM, MTL-BC-CAM, MTL-BC-DL, MTL-MFE, MTL-TC

To gauge the performance of the final model (SEQ2SEQ++) against all the interim models

7

SEQ2SEQ++ versus STL, MTL-BC, MTL-LTS

To gauge the performance of the final model (SEQ2SEQ++) against all the benchmark models

31 of 53

Experiment Result –STL vs STL-CAM

Dataset

BLEU

WER

Distinct-2

NarrativeQA

7.3%

-4.8% *

5.4%

SQuAD

2.8%

-7.6% *

11.8%

STL-CAM over STL - Percentage Improvements

  • * Negative values indicate STL performed better than STL-CAM

32 of 53

Experiment Result –MTL-BC vs MTL-BC-CAM

MTL-BC-CAM over MTL-BC - Percentage Improvements

  • * Negative values indicate MTL-BC performed better than MTL-BC-CAM

Dataset

BLEU

WER

Distinct-2

NarrativeQA

11.9%

18.7%

-2.3% *

SQuAD

1.8%

9.4%

0.6%

33 of 53

CAM vs Global attention mechanism conclusion

  • In general, CAM-based models produced answers with :-
    • higher correctness (higher BLEU scores)
    • lower errors (lower WER scores)
    • higher diversity (higher Distinct-2 scores)

than the global attention-based models in their respective experiments

  • These performance improvements are due to the reduction in the generation of frequently occurring words in answers.
  • CAM can capture the representation of the answer more precisely without loss of important information by utilizing all the decoder's previously generated hidden states as the decoding progresses.
  • This means CAM is a more effective attention mechanism than the global attention mechanism to address the language model influence issue.
  • Both outcomes also show that CAM can be utilized and is effective in both single-task and multi-task learning frameworks

34 of 53

Experiment Result –MTL-BC vs MTL-BC-DL

MTL-BC-DL over MTL-BC Percentage Improvements

  • * Negative values indicate MTL-BC performed better than MTL-BC-DL

Dataset

BLEU

WER

Distinct-2

NarrativeQA

4.7%

8.9%

-0.6%

SQuAD

1.6%

3.0%

-0.7%

35 of 53

Dynamic vs Fixed Tasks Loss Weight Scheme conclusion

  • The overall result shows that by utilizing dynamic tasks loss weight scheme, the performance of an MTL model can be improved by the reduction in the occurrence of high-frequency words in the generated answers.

  • In terms of BLEU and WER scores, MTL-BC-DL performed consistently better than MTL-BC.

  • This demonstrates that by utilizing DL, answer generation overfit can be effectively reduced and the MTL-BC-DL can produce answers with higher correctness (higher BLEU scores) and lower error rates (lower WER scores).

  • By dynamically adjusting the task loss weight during each epoch and using it for the next epoch, the model can increase learning for the task that is lagging (having higher loss) thus improving overall model learning and eventually improving model performance in answer generation.

36 of 53

Experiment Result –MTL-LTS vs MTL-MFE

MTL-MFE over MTL-LTS Percentage Improvements

Dataset

BLEU

WER

Distinct-2

NarrativeQA

34.2%

48.7%

0.4%

SQuAD

45.8%

52.5%

1.2%

37 of 53

Parallel task learning (MTL-MFE) vs sequential learning to start (MTL-LTS) conclusion

  • MTL-MFE performed significantly better than MTL-LTS on all the measurements (BLEU, WER and Distinct-2).

  • This demonstrates that by training first-word and last word prediction tasks in parallel with answer generation tasks which is a slightly more complex task, question encoder overfit can be reduced as compared to performing the first-word prediction first and then performing answer generation next

38 of 53

Experiment Result –MTL-BC vs MTL-TC

MTL-TC over MTL-BC Percentage Improvements

  • * Negative values indicate MTL-BC performed better than MTL-TC

Dataset

BLEU

WER

Distinct-2

NarrativeQA

20.5%

28.9%

0.9%

SQuAD

2.7%

6%

-0.01%

39 of 53

Ternary classifier vs Binary classifier conclusion

  • MTL-TC performed consistently better than MTL-BC on BLEU and WER and with mixed result on Distinct-2

  • This demonstrates that by utilizing a ternary classification task instead of a binary classification task, which is a more complex task, question encoder overfit can also be reduced and the models can improve answer generation quality by reducing generating high-frequency words incorrectly

40 of 53

Experiment Result – SEQ2SEQ++ vs interim models

SEQ2SEQ++ over Interim Models Percentage Improvements

Benchmark Method

Dataset

BLEU

WER

Distinct-2

 

STL-CAM

NarrativeQA

42.35%

64.8%

0.13%

SQuAD

52.04%

59.52%

-0.61%

 

MTL-BC-CAM

NarrativeQA

29.03%

51.81%

4.61%

SQuAD

15.29%

30.78%

0.18%

MTL-BC-DL

NarrativeQA

37.90%

57.02%

2.93%

SQuAD

15.46%

35.34%

1.47%

MTL-MFE

NarrativeQA

8.43%

19.81%

0.23%

SQuAD

3.02%

12.28%

0.37%

MTL-TC

NarrativeQA

19.84%

44.88%

1.40%

SQuAD

14.19%

33.27%

0.74%

41 of 53

SEQ2SEQ++ vs interim models conclusion

  • SEQ2SEQ++ shows better performance than the interim models

  • SEQ2SEQ++ can capitalize on the strengths of the individual methods which are :-
    • Comprehensive Attention Mechanism (CAM)
    • Dynamic Tasks Loss Weighting Scheme (DL)
    • Multifunctional Encoder (MFE)
    • Ternary Classifier (TC).

42 of 53

Experiment Result – SEQ2SEQ++ vs benchmark models

SEQ2SEQ++ over Benchmark Models Percentage Improvements

Benchmark Method

Dataset

BLEU

WER

Distinct-2

 

STL

NarrativeQA

52.71%

63.04%

5.44%

SQuAD

56.1%

65.21%

10.9%

 

MTL-BC

NarrativeQA

44.42%

60.82%

2.85%

SQuAD

17.31%

37.26%

0.73%

MTL-LTS

NarrativeQA

45.54%

58.83%

0.21%

SQuAD

50.17%

58.3%

1.53%

43 of 53

SEQ2SEQ++ vs benchmark models conclusion

  • SEQ2SEQ++ shows better performance than the benchmark models. It can generate answers with higher quality (highest BLEU score), lower error rate (lowest WER score), and higher diversity (highest Distinct-2 score) in comparison with all the other benchmark models.

  • This result indicates that SEQ2SEQ++ can address all three issues (language model influence, answer generation overfit, question encoder overfit) more effectively than the benchmark models.

44 of 53

Summary

  • MTL framework is one of the popular methods in Seq2Seq based answer generation as it provides a mechanism to introduce one or more auxiliary tasks that can be learned in parallel with the main task of answer generation to address the limitations and issues with Seq2Seq learning.

  • In other words, several methods can be integrated into an MTL setting to address several issues at hand.

  • This work capitalizes on this advantage by proposing SEQ2SEQ++ to address all the three issues encountered during answer generation

45 of 53

Summary

46 of 53

Summary

47 of 53

Summary

48 of 53

Summary

49 of 53

Future Works

Areas that have great potential for future works :-

 

  • Multi-turn conversation: Investigation of SEQ2SEQ++ for multi-turn conversation

 

  • Diversity: Further investigation can be performed to further improve this method to show a much more significant improvement in diversity in comparison with other existing methods.

 

  • Pre-trained Language models: Investigation can also be conducted on how existing pre-trained language models such as Generative Pre-trained Transformer (GPT-3)

 

  • Pre-trained embeddings: In this research, the embedding was jointly trained. There are also pre-trained embeddings that such as Word2Vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) that can be utilized with SEQ2SEQ++ to evaluate their effectiveness.

50 of 53

Publications

  1. Palasundram, K., Mohd Sharef, N., Nasharuddin, N., Kasmiran, K. & Azman, A. (2019). Sequence to Sequence Model Performance for Education Chatbot. International Journal of Emerging Technologies in Learning (iJET), 14(24), 56-68. Kassel, Germany: International Journal of Emerging Technology in Learning.

  1. K. Palasundram, N. Mohd Sharef, K. A. Kasmiran and A. Azman, "Enhancements to the Sequence-to-Sequence-Based Natural Answer Generation Models," in IEEE Access, vol. 8, pp. 45738-45752, 2020, doi: 10.1109/ACCESS.2020.2978551.

  1. K. Palasundram, N. M. Sharef, K. A. Kasmiran and A. Azman, "SEQ2SEQ++: A Multitasking-Based Seq2seq Model to Generate Meaningful and Relevant Answers," in IEEE Access, doi: 10.1109/ACCESS.2021.3133495.

51 of 53

Backup

52 of 53

Proposed Methods: MFE

 

53 of 53

Proposed Methods: TC