1 of 53

SEQ2SEQ++: A MULTI-TASKING BASED SEQ2SEQ MODEL �TO GENERATE MEANINGFUL AND RELEVANT ANSWERS

PhD Candidate

Kulothunkan Palasundram (GS50783)

Supervisory Committee

Associate Prof. Dr. Nurfadhlina Mohd Sharef

Associate Prof. Dr. Azreen bin Azman

Dr. Khairul Azhar bin Kasmiran

University Putra Malaysia

Faculty of Computer Science & Information Technology

2 of 53

Outline

Introduction

Research Problem, Research Objective & Scope

Literature Review

Research Methodology

Proposed Methods & Experiments

Results and Conclusion

Summary and Contributions

Future works

Publications

3 of 53

Introduction

Chatbot Types and Research Scope

4 of 53

Introduction

Seq2Seq is a transformation of a sequence of words (question) into another sequence of words (answer)

Seq2Seq Learning

5 of 53

Introduction

Seq2Seq Learning

6 of 53

Research Problem – Key Observations

Seq2Seq – popular method for natural answer generation
However, the generated answer may not be relevant to the question
Result - conversations with chatbots can be meaningless, abruptly terminated by users, and eventually lowers the chatbot adoption rate

7 of 53

Research Objective

Propose an MTL-based Seq2Seq model that can generate meaningful and relevant answers

Main Objective

8 of 53

Research Scope

Question-answering as a single turn conversation task (a pair of question and answer) under the Multi-task Learning (MTL ) framework as defined in (Huang & Zhong, 2018) which is a key reference for this research.

9 of 53

Literature Review

26 articles reviewed
3 key issues identified
5 methods/ approaches found

10 of 53

Literature Review – Issues & Methods

11 of 53

Literature Review – Approaches

Method	Strengths	Weaknesses
Additional embeddings	Additional encodings -> reduces encoder overfit	Needs additional data - > may not be available for all datasets and scenarios.
Alternative Loss Functions	Offers alternative loss functions that is not influenced by frequency of words in dataset.	Reinforcement learning can be unstable and dependent on warm start using cross entropy loss functions. Loss function used for one dataset may not be suitable for another dataset. Requires custom reward functions to evaluate the model.
Multi-task Learning	Auxiliary tasks can reduce question encoding and answer generation overfit. The auxiliary task can be excluded during model inference.	Fixed task loss weight mechanism is very inefficient and ineffective.
Attention Mechanism	Provides a mechanism to focus on certain part of question during decoding to address language model influence	Existing methods creates an imbalance between language model influence and the question at hand during answer generation.
Beam search	Can be used in conjunction with any other method	Not able to influence model training, thus improvements may not that significant

12 of 53

Research Objectives - Refined

Propose an MTL-based Seq2Seq model that can generate meaningful and relevant answers by addressing the three issues which are language model influence, answer generation overfit, and question encoder overfit issues.

Main Objective

Sub-objective 1

address

New attention mechanism

Language model influence issue

Sub-objective 2

address

Improved MTL loss calculation method

Answer generation overfit issue

Sub-objective 3

address

Auxiliary tasks for MTL

Question encoding issue

13 of 53

Research Problem Statements Refined

Issue # 1	Language Model Influence
What	Ability to generate the next word based on the previously generated word. Over time, the language model influence gets stronger, and the model may generate irrelevant answers
Existing method	Global attention mechanism (Bahdanau et al., 2015). Performs computation to determine which part of the question is important.
Gaps in existing method	Global attention mechanism focus only on the encoder’s hidden states and the decoder’s final hidden state at each decoding step. Thus, influence of the previously generated words gets diluted as the decoding progresses.
Proposal	Comprehensive Attention Mechanism (CAM) – focuses on the encoder hidden states and all the decoders hidden states

14 of 53

Research Problem Statements Refined

Issue # 2	Answer generation overfit
What
Existing method	Answer generation overfitting can be addressed by adding regularization terms to the cross-entropy loss function to compute a new loss L_MTL = α_agL_ag+ α_nL_n
Gaps in existing method	Existing MTL algorithms uses a small fixed task loss weight (α) (such as 0.1 or 0.01) for the auxiliary task to compute the MTL loss for backpropagation. This may not be effective to reduce decoding overfitting
Proposal	Dynamic Task Loss Weight Scheme (DL) for MTL whereby each task loss weights are computed during each epoch and used for MTL loss calculation

15 of 53

Research Problem Statements Refined

Issue # 3	Question encoding overfit
What	Question encoding overfit refers to the situation where the encoder becomes overfit during training due to the high frequency of common words in training data
Existing method	Utilize question encoding to perform additional tasks such as binary classification of answer (MTL-BC - Huang & Zhong, 2018) or first word prediction (MTL-LTS - Zhu et al., 2016)
Gaps in existing method	Binary classification of answers is not natural as answers can also be partially correct. First word prediction was done in separate network (sequential MTL). Both methods are not sufficient to reduce the encoder overfit
Proposal	Ternary Classification (TC) Multifunctional Encoder (MFE)

16 of 53

Research Methodology - Phases

Initial literature review
Research scope
Research objectives

Seq2Seq Issues
Existing approaches, strengths, limitations and gaps
Evaluation metrics

Benchmark models (MTL-BC, STL, MTL-LTS)
Proposed methods (CAM, DL, TC, MFE)
Final model (SEQ2SEQ++)
Data for training and evaluation

Experiments
Analysis
Thesis

Phase 1 - �Planning

Phase 2 - Literature Review

Phase 4 - �Experiment & Analysis

Phase 3 - Design and Development

17 of 53

Research Methodology - Dataset

Dataset Name	NarrativeQA (Kočiský et al., 2017)	SQuAD (Rajpurkar et al., 2016).
Dataset summary	NarrativeQA is a fiction-based dataset and consists of proper English words.	SQuAD is a crowdsourced dataset based on Wikipedia articles.
Training (# of question-answer pairs)	24000	24819
Validation (# of question-answering pairs)	4800	4800
Testing (# of question-answering pairs)	1000	1000
Question vocabulary size (# of words)	17294	19569
Answer vocabulary size (# of words)	18830	19935
Maximum question length	19	17
Maximum answer length	16	17
Sample question-answer pair	Question: what are sleepy hollow renowned for ? Answer: ghosts and a haunting atmosphere	Question: how are plants different from animals ? Answer: primary cell wall composed of the polysaccharides cellulose

18 of 53

Research Methodology - Metrics

Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002)

to measure answer correctness
It gives a score of 0 to 1
The higher BLEU score indicates generating answers that are more relevant to the question

Word Error Rate (WER) (Mikolov et al., 2010)

to measure model error
It gives a score of 0 to 1
The lower the score means the fewer the errors in the generated answers

Distinct-2 (Li, Galley, Brockett, Gao, et al., 2016)
to measure the diversity of generated answers

It gives a score of 0 to 1

The higher the score means the more diverse are the answers

19 of 53

Research Methodology – Proposed Methods

Issue	Proposed Method	Description	Benchmark Methods	Metrics
Language model influence	Comprehensive Attention Mechanism (CAM)	An attention mechanism utilized during answer generation	Global attention mechanism (Bahdanau et al., 2015) utilized in STL (Bahdanau et al., 2015) and MTL-BC (Huang & Zhong, 2018).	BLEU, WER and Distinct-2
Answer generation overfit	Dynamic Tasks Loss Weight Scheme (DL)	An algorithm to dynamically compute the task loss weight during multi-task learning	MTL-BC (Huang & Zhong, 2018) which utilizes fixed tasks loss weight scheme
Question encoder overfit	Multi-functional encoder (MFE)	Performs question encoding, first-word prediction and last word prediction in parallel approach	MTL-LTS (Zhu et al., 2016) which performs first word prediction task and then answer generation in sequential approach.
Question encoder overfit	Ternary Classifier (TC)	Takes in question encoding from MFE and answer encoding from AE to perform classification	MTL-BC (Huang & Zhong, 2018) which utilizes binary classifier

20 of 53

Research Methodology – Experiments

Experiment #	Models Involved	Purpose of Experiment
1	STL vs STL-CAM	To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a single task learning
2	MTL-BC-CAM versus MTL-BC	To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a multi-task learning (MTL)
3	MTL-BC-DL versus MTL-BC	To gauge performance improvement of using dynamic tasks loss weight scheme versus of using fixed tasks loss weight scheme for MTL calculation
4	MTL-TC versus MTL-BC	To gauge performance improvement of using ternary classification over binary classification as auxiliary task for MTL
5	MTL-MFE versus MTL-LTS	To gauge performance of MFE (parallel MTL) over LTS model (sequential MTL)
6	SEQ2SEQ++ versus STL-CAM, MTL-BC-CAM, MTL-BC-DL, MTL-MFE, MTL-TC	To gauge the performance of the final model (SEQ2SEQ++) against all the interim models
7	SEQ2SEQ++ versus STL, MTL-BC, MTL-LTS	To gauge the performance of the final model (SEQ2SEQ++) against all the benchmark models

21 of 53

Proposed Methods: CAM

Comprehensive Attention Mechanism - An attention mechanism utilized during answer generation
In this mechanism:-

The attention weights are computed based on all encoder’s hidden states and the sum of all the decoder’s previous hidden states
These computed attention weights are then used to compute the context vector.
Subsequently, the decoder utilizes the context vector to eventually generate the answer.

This ensures all the hidden states are continuously considered for the next word prediction

22 of 53

Comprehensive Attention Mechanism (CAM)

Models that implemented this method are STL-CAM, MTL-BC-CAM and SEQ2SEQ++

23 of 53

Comprehensive Attention Mechanism (CAM) vs Global Attention Mechanism

Proposed Method:

CAM

Existing Method:

Global Attention Mechanism

24 of 53

Proposed Methods: DL

Dynamic Tasks Loss Weight Scheme - An algorithm to dynamically compute the task loss weight during multi-task learning

In this algorithm:-

The task loss weight (α) for each task will be updated during each epoch
The new task loss weights for each task will be proportional to the total model loss.
The new weights are then used for the next epoch to compute the MTL loss

This means that the influence of each task in each epoch is dynamically determined by the task loss in the previous epoch

Models that implemented this method are MTL-BC-DL, MTL-MFE and SEQ2SEQ++

25 of 53

Proposed Methods: DL

Algorithm 1: SEQ2SEQ++ Training Algorithm

Input: {Question (X), Answer(Y), Label(L), First Word (FW), Last Word(LW)} quintuplets, Maximum answer length (T), Maximum Epoch (E)

Steps:

Initialize Multi-functional Encoder (MFE), Answer Encoder(AE), Answer Decoder (AD) and Ternary Classifier (TC)

Initialize each losses LMTL = Lag = Ltc = Lfw = Llw = 0

Initialize each task loss weights αag = αtc = αfw = αlw = 0.25

For epoch 1 to Number of Epochs

1. For batch 1 to Number of Batches Do

1.1. Perform question encoding

1.2. Predict First Word and compute the first-word prediction losses (Lfw)

1.3. Predict Last Word and compute the last-word prediction losses (Llw)

1.4. Perform answer encoding

1.5. Perform ternary classification and compute ternary classification loss (Ltc)

1.6. Perform answer generation using CAM and compute answer generation loss (Lag)

1.7. Compute multi-task loss: LMTL = αag*Lag + αtc*Ltc + αfw*Lfw + αlw*Llw

1.8. Update the model parameters (neural network weights)

1.9. For each task, tk ⊆ {ag, tc , fw , lw} : Calculate the average loss (Ltk-avg)

End for Batch Looping

2. Calculate the total average loss for all tasks: Lavg-total = Lag-avg + Ltc-avg + Lfw-avg + Llw-avg

3. For each task, tk ⊆ {ag, tc , fw , lw}

3.1. Calculate new task loss weight: αtk = Ltk-avg / ( Lavg-total)

End for Epoch Looping

Output: Trained SEQ2SEQ++ model

26 of 53

Proposed Methods: MFE

MFE - Multi-Functional Encoder - Performs question encoding, first-word prediction and last word prediction based on the question encoder hidden states
MFE will learn to produce hidden states to ensure all 3 tasks have minimal losses

By sharing the question encoding with multiple tasks (answer generation, first-word prediction and last word predictions), the question encoding overfit can be reduced

Models that implemented this method are MTL-MFE and SEQ2SEQ++

27 of 53

Proposed Methods: TC

TC - Ternary Classifier- Performs ternary classification of an answer based on the question and answer encodings
Classifies as correct, partially-correct and wrong

By sharing the question encoding with multiple tasks (answer generation and answer classification), the question encoding overfit can be reduced

Models that implemented this method are MTL-MFE and SEQ2SEQ++

28 of 53

Proposed Methods: TC

29 of 53

Proposed Methods:

Final Model – SEQ2SEQ++

30 of 53

Recap – Experiments Done

Experiment #	Models Involved	Purpose of Experiment
1	STL vs STL-CAM	To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a single task learning
2	MTL-BC-CAM versus MTL-BC	To gauge performance improvement of using CAM versus global attention mechanism during attention computation for answer generation in a multi-task learning (MTL)
3	MTL-BC-DL versus MTL-BC	To gauge performance improvement of using dynamic tasks loss weight scheme versus of using fixed tasks loss weight scheme for MTL calculation
4	MTL-TC versus MTL-BC	To gauge performance improvement of using ternary classification over binary classification as auxiliary task for MTL
5	MTL-MFE versus MTL-LTS	To gauge performance of MFE (parallel MTL) over LTS model (sequential MTL)
6	SEQ2SEQ++ versus STL-CAM, MTL-BC-CAM, MTL-BC-DL, MTL-MFE, MTL-TC	To gauge the performance of the final model (SEQ2SEQ++) against all the interim models
7	SEQ2SEQ++ versus STL, MTL-BC, MTL-LTS	To gauge the performance of the final model (SEQ2SEQ++) against all the benchmark models

31 of 53

Experiment Result –STL vs STL-CAM

Dataset	BLEU	WER	Distinct-2
NarrativeQA	7.3%	-4.8% *	5.4%
SQuAD	2.8%	-7.6% *	11.8%

STL-CAM over STL - Percentage Improvements

* Negative values indicate STL performed better than STL-CAM

32 of 53

Experiment Result –MTL-BC vs MTL-BC-CAM

MTL-BC-CAM over MTL-BC - Percentage Improvements

* Negative values indicate MTL-BC performed better than MTL-BC-CAM

Dataset	BLEU	WER	Distinct-2
NarrativeQA	11.9%	18.7%	-2.3% *
SQuAD	1.8%	9.4%	0.6%

33 of 53

CAM vs Global attention mechanism conclusion

In general, CAM-based models produced answers with :-

higher correctness (higher BLEU scores)
lower errors (lower WER scores)
higher diversity (higher Distinct-2 scores)

than the global attention-based models in their respective experiments

These performance improvements are due to the reduction in the generation of frequently occurring words in answers.
CAM can capture the representation of the answer more precisely without loss of important information by utilizing all the decoder's previously generated hidden states as the decoding progresses.
This means CAM is a more effective attention mechanism than the global attention mechanism to address the language model influence issue.
Both outcomes also show that CAM can be utilized and is effective in both single-task and multi-task learning frameworks

34 of 53

Experiment Result –MTL-BC vs MTL-BC-DL

MTL-BC-DL over MTL-BC Percentage Improvements

* Negative values indicate MTL-BC performed better than MTL-BC-DL

Dataset	BLEU	WER	Distinct-2
NarrativeQA	4.7%	8.9%	-0.6%
SQuAD	1.6%	3.0%	-0.7%

35 of 53

Dynamic vs Fixed Tasks Loss Weight Scheme conclusion

The overall result shows that by utilizing dynamic tasks loss weight scheme, the performance of an MTL model can be improved by the reduction in the occurrence of high-frequency words in the generated answers.

In terms of BLEU and WER scores, MTL-BC-DL performed consistently better than MTL-BC.

This demonstrates that by utilizing DL, answer generation overfit can be effectively reduced and the MTL-BC-DL can produce answers with higher correctness (higher BLEU scores) and lower error rates (lower WER scores).

By dynamically adjusting the task loss weight during each epoch and using it for the next epoch, the model can increase learning for the task that is lagging (having higher loss) thus improving overall model learning and eventually improving model performance in answer generation.

36 of 53

Experiment Result –MTL-LTS vs MTL-MFE

MTL-MFE over MTL-LTS Percentage Improvements

Dataset	BLEU	WER	Distinct-2
NarrativeQA	34.2%	48.7%	0.4%
SQuAD	45.8%	52.5%	1.2%

37 of 53

Parallel task learning (MTL-MFE) vs sequential learning to start (MTL-LTS) conclusion

MTL-MFE performed significantly better than MTL-LTS on all the measurements (BLEU, WER and Distinct-2).

This demonstrates that by training first-word and last word prediction tasks in parallel with answer generation tasks which is a slightly more complex task, question encoder overfit can be reduced as compared to performing the first-word prediction first and then performing answer generation next

38 of 53

Experiment Result –MTL-BC vs MTL-TC

MTL-TC over MTL-BC Percentage Improvements

* Negative values indicate MTL-BC performed better than MTL-TC

Dataset	BLEU	WER	Distinct-2
NarrativeQA	20.5%	28.9%	0.9%
SQuAD	2.7%	6%	-0.01%

39 of 53

Ternary classifier vs Binary classifier conclusion

MTL-TC performed consistently better than MTL-BC on BLEU and WER and with mixed result on Distinct-2

This demonstrates that by utilizing a ternary classification task instead of a binary classification task, which is a more complex task, question encoder overfit can also be reduced and the models can improve answer generation quality by reducing generating high-frequency words incorrectly

40 of 53

Experiment Result – SEQ2SEQ++ vs interim models

SEQ2SEQ++ over Interim Models Percentage Improvements

Benchmark Method	Dataset	BLEU	WER	Distinct-2
STL-CAM	NarrativeQA	42.35%	64.8%	0.13%
STL-CAM	SQuAD	52.04%	59.52%	-0.61%
MTL-BC-CAM	NarrativeQA	29.03%	51.81%	4.61%
MTL-BC-CAM	SQuAD	15.29%	30.78%	0.18%
MTL-BC-DL	NarrativeQA	37.90%	57.02%	2.93%
MTL-BC-DL	SQuAD	15.46%	35.34%	1.47%
MTL-MFE	NarrativeQA	8.43%	19.81%	0.23%
MTL-MFE	SQuAD	3.02%	12.28%	0.37%
MTL-TC	NarrativeQA	19.84%	44.88%	1.40%
MTL-TC	SQuAD	14.19%	33.27%	0.74%

41 of 53

SEQ2SEQ++ vs interim models conclusion

SEQ2SEQ++ shows better performance than the interim models

SEQ2SEQ++ can capitalize on the strengths of the individual methods which are :-

Comprehensive Attention Mechanism (CAM)
Dynamic Tasks Loss Weighting Scheme (DL)
Multifunctional Encoder (MFE)
Ternary Classifier (TC).

42 of 53

Experiment Result – SEQ2SEQ++ vs benchmark models

SEQ2SEQ++ over Benchmark Models Percentage Improvements

Benchmark Method	Dataset	BLEU	WER	Distinct-2
STL	NarrativeQA	52.71%	63.04%	5.44%
STL	SQuAD	56.1%	65.21%	10.9%
MTL-BC	NarrativeQA	44.42%	60.82%	2.85%
MTL-BC	SQuAD	17.31%	37.26%	0.73%
MTL-LTS	NarrativeQA	45.54%	58.83%	0.21%
MTL-LTS	SQuAD	50.17%	58.3%	1.53%

43 of 53

SEQ2SEQ++ vs benchmark models conclusion

SEQ2SEQ++ shows better performance than the benchmark models. It can generate answers with higher quality (highest BLEU score), lower error rate (lowest WER score), and higher diversity (highest Distinct-2 score) in comparison with all the other benchmark models.

This result indicates that SEQ2SEQ++ can address all three issues (language model influence, answer generation overfit, question encoder overfit) more effectively than the benchmark models.

44 of 53

Summary

MTL framework is one of the popular methods in Seq2Seq based answer generation as it provides a mechanism to introduce one or more auxiliary tasks that can be learned in parallel with the main task of answer generation to address the limitations and issues with Seq2Seq learning.

In other words, several methods can be integrated into an MTL setting to address several issues at hand.

This work capitalizes on this advantage by proposing SEQ2SEQ++ to address all the three issues encountered during answer generation

49 of 53

Future Works

Areas that have great potential for future works :-

Multi-turn conversation: Investigation of SEQ2SEQ++ for multi-turn conversation

Diversity: Further investigation can be performed to further improve this method to show a much more significant improvement in diversity in comparison with other existing methods.

Pre-trained Language models: Investigation can also be conducted on how existing pre-trained language models such as Generative Pre-trained Transformer (GPT-3)

Pre-trained embeddings: In this research, the embedding was jointly trained. There are also pre-trained embeddings that such as Word2Vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) that can be utilized with SEQ2SEQ++ to evaluate their effectiveness.

50 of 53

Publications

Palasundram, K., Mohd Sharef, N., Nasharuddin, N., Kasmiran, K. & Azman, A. (2019). Sequence to Sequence Model Performance for Education Chatbot. International Journal of Emerging Technologies in Learning (iJET), 14(24), 56-68. Kassel, Germany: International Journal of Emerging Technology in Learning.

K. Palasundram, N. Mohd Sharef, K. A. Kasmiran and A. Azman, "Enhancements to the Sequence-to-Sequence-Based Natural Answer Generation Models," in IEEE Access, vol. 8, pp. 45738-45752, 2020, doi: 10.1109/ACCESS.2020.2978551.

K. Palasundram, N. M. Sharef, K. A. Kasmiran and A. Azman, "SEQ2SEQ++: A Multitasking-Based Seq2seq Model to Generate Meaningful and Relevant Answers," in IEEE Access, doi: 10.1109/ACCESS.2021.3133495.

1 of 53

2 of 53

3 of 53

4 of 53

5 of 53

6 of 53

7 of 53

8 of 53

9 of 53

10 of 53

11 of 53

12 of 53

13 of 53

14 of 53

15 of 53

16 of 53

17 of 53

18 of 53

19 of 53

20 of 53

21 of 53

22 of 53

23 of 53

24 of 53

25 of 53

26 of 53

27 of 53

28 of 53

29 of 53

30 of 53

31 of 53

32 of 53

33 of 53

34 of 53

35 of 53

36 of 53

37 of 53

38 of 53

39 of 53

40 of 53

41 of 53

42 of 53

43 of 53

44 of 53

45 of 53

46 of 53

47 of 53

48 of 53

49 of 53

50 of 53

51 of 53

52 of 53

53 of 53