��
By: Samira Said �IS Department
Improved Automated Essay Grading System Via Natural Language Processing and Deep Learning
Under the supervision of:�Assoc. prof. Essam Hamed Assoc. prof. Essam Elfakharany
ABSTRACT
I offer Tahsin a very advanced system for automatically classifying articles. This is based on natural language processing and deep learning techniques.
I tested the code on the Kaggle platform, and the result was 73%. Then I tested the ode using another platform, Google Colab, and the result was 73%. Finally, I started making changes to the code and changing the parameters, and after the modification, I reached a result of 95%, then we used another algorithm, which is BI-LSTM, and I reached a result of 96.35%.
1 . Introduction
Exam questions classification
Constructed-response (CR) like Essays
Selected-Response (SR) like MCQ. Short answers
3
2- Motivation
4
2- Motivation
The chart shows the most important methods that have been adopted in education following the suspension of studies in various countries.
5
3. Problem Statement
6
4- Objectives
Construct an accurate model with the following characteristics:
7
5- Background
Automated Grading for Essays (AGE)
is the use of specialized computer algorithms to assign grades to essays written in an educational setting such as:
8
5- Background
1. Deep Learning:
9
5- Background
skin cancer detection
10
AGE systems
5- Background
11
2. Natural Language Processing (NLP)
5- Background
12
2. NLP Approaches related to AGE :
-- Paraphrase Identification:
-- Paraphrase identification is the process of analyzing two sentences and projecting how similar � they are based on their underlying hidden semantics.
-- It is a key feature that is beneficial for several NLP jobs such as plagiarism detection, answers � to questions, context detection, summarization, and domain identification.
-- Question Answering:
-- An automatic question-and-answering system should be able to interpret a natural language question and use reasoning to return an appropriate reply. Modern knowledge bases, such as the famous FREEBASE dataset, allow this field to flourish and leap out of the times when features and rule sets were hand-crafted to specific domains.
5- Background
3. Recurrent Neural Network (RNN):
13
5- Background
14
5- Background
4. LSTM ( Long short-Term Memory Network) :
Example: LSTM can be trained to generate new character, words, and bodies of text
15
6- Related Work
PAPER INFORMATION | OBJECTIVES | METHODOLOGY | DATASET | RESULTS | CONTRIBUTION | MISSING |
1- Automated Grading of Essays: A Review Authors: Jyoti G. Borade, Publisher: SPRINGER, 2021 | methods for automated grading of essays and Evaluating explanatory answers (applications of approaches such as Natural Language Processing and Deep Learning for AGE | 1-Textual Similarity 2- Latent Semantic-Based Vector Space Model 3- Neural Network Based Approaches 4 - Naive Bayes Classifiers | responses for each question from different students. | Accuracy obtained for simple LSTM is 83, Deep LSTM is 82, Bi-directional LSTM is 89 | This work presents a review of machine learning techniques used to assess essay type of answers | Limited Data set |
16
6- Related Work
PAPER INFORMATION | OBJECTIVES | METHODOLOGY | DATASET | RESULTS | CONTRIBUTION | MISSING |
2- Automated Content Grading Using Machine Learning Authors: Rahul K Chauhan, Ravinder Saharan, Siddhartha Singh, Priti Sharma Publisher: ResearchGate, 2020 | algorithmic approach in machine learning can be used to automatically examine and grade theoretical content in exam answer papers. | Random Forest Algorithm | The standard exam answer papers were taken from the mid-term (minor) examinations | Weighed Kappa Set1: 0.46 Set 2:0.55 Set 3:0.61 | This work represented how content grading in a big-data based technical domain can be solved using this approach | Low Accuracy |
17
6- Related Work
PAPER INFORMATION | OBJECTIVES | METHODOLOGY | DATASET | RESULTS | CONTRIBUTION | MISSING |
3- An Analysis of Automated Answer Evaluation Systems based on Machine Learning Authors: Birpal Singh J. Kapoor, 1Shubham M. Nagpure Publisher: IEEE, 2020 | summarize the existing mechanism and analyses the performance of the system used for automatic grading of the long and descriptive answers. | A methodology for detaching a course of action of expositions into subsets that address similar graders, which uses an explanation reasoning and bunching. | Kaggle website | many of analysts and scholars are as yet working extremely hard and created different frameworks, that gave empowering results. | features developed with the corpus based strategies, or NLP systems as a significant section of AI structure. | Percentage of accuracy is not specified |
18
6- Related Work
PAPER INFORMATION | OBJECTIVES | METHODOLOGY | DATASET | RESULTS | CONTRIBUTION | MISSING |
4- Automated language essay scoring systems: a literature review Authors: Mohamed Abdellatif Hussein1, Hesham Hassan Publisher: Peerj , 2019 | review the literature for the AES systems used for grading the essay questions | 1- Project Essay Grader (PEG) 2-Intelligent Essay Assessor (IEA) 3- E-rater 4- Criterion 5-IntelliMetric 6-MY Access 7-Bayesian Essay Test Scoring System (BETSY) 8- Automatic text scoring using neural networks 9- A neural network approach to automated essay scoring 10- neural network for automatic essay scoring | the Kaggle’s ASAP contest dataset | PEG 0.87 IEA 0.90 E-rater 0.91 IntelliMetric 0.83 BETSY 0.80 | The performance of these systems is evaluated based on the comparison of the scores assigned to a set of essays scored by expert humans. | The results are not specific to each method |
19
6- Related Work
PAPER INFORMATION | OBJECTIVES | METHODOLOGY | DATASET | RESULTS | CONTRIBUTION | MISSING |
5- Automated Essay Grading using Machine Learning Algorithm Authors: Ramalingam ,APandian,Prateek Chetry and Himanshu Nigam Publisher: IOP , 2018 | develop an automated essay assessment system by use of machine learning techniques | machine learning technique, e-Rater technique, provided as input and then it is compared with the essays of each set once it is done then the essays are compared based on their polarities, words used and the content of essay. The machine finally generates a score for the essay by combining all the results to get a final-score. | The dataset used has been extracted from kaggle.com, it consists of the data from the competition conducted by The Hewlett Foundation | the machine is capable of assessing an essay like a human rater. | This current approach tries to model the language features like language fluency, grammatical correctness, domain information content of the essays, and put an effort to fit the best polynomial in the feature space using linear regression with polynomial basis functions | accuracy percentage is not specified |
20
6- Related Work
PAPER INFORMATION | OBJECTIVES | METHODOLOGY | DATASET | RESULTS | CONTRIBUTION | MISSING |
6- Intelligent Auto-grading System Authors: Zining Wang1, Jianli Liu1 Publisher: Proceedings,2018 | present a novel automatic essay scoring system based on Natural Language Processing and Deep Learning technologies | Natural Language Processing, Deep Learning , LSTM | The dataset for training and testing is the public essay set available in the Automated Student Assessment Prize on Kaggle | Accuracy 73% | the NLP, Neural Network and intelligent auto-grading system and then attempt to build an innovative open-minded response grading machine. | Accuracy is low |
21
7- Related Work Conclusion & Research Gap
2- Deep Learning
1- Common used techniques
1- Machine learning
22
3- NLP for AGE
2- Different Models for AGE
3- Neural Network based Approaches��
4- Naïve Bayes Classifiers�
1- Machine Learning { linear regression, clustering, SVM, and Bayesian inference}
is an extension of the mean squared error. Importantly, the square root of the error is calculated, which means that the units of the RMSE are the same as the original units of the target value that is being predicted.
2- Text Similarity :
It takes text documents as inputs and finds similarity between them.
Lexical similarity: determines the similarity by matching contents, word-by-word.
Semantic similarity: is based on the meaning of the contents.
3- AGE Evaluation Metrics:�
is an extension of the mean squared error. Importantly, the square root of the � error is calculated, which means that the units of the RMSE are the same as the � original units of the target value that is being predicted.
4- Research Gap:
Most algorithms that use Natural Language Processing, Deep � Learning are get low accuracy.
8-Research Road Map
8.1- Re Implementation of Kaggle AGE Model using colab platform
8.2- AGE Model Evaluation
8.3- The Enhanced AGE model using RNN and LSTM Methodology�8.4- The enhanced AGE model Evaluation
8.5- Results of comparing the enhanced model to the original model
26
8.1 Re Implementation of Kaggle AGE Model using Kaggle platform
The existing AGE from kaggle
dataset
CSV
File reader libraries in python
AGE Model
question score
27
8.1 Re Implementation of Kaggle AGE Model using colab platform
dataset
CSV
File reader libraries in python
Model Building
question score
28
Model building steps
29
Model building steps
30
Dataset:
31
Dataset:
QUESTION NUMBER | COUNT | PERCENTAGE |
1 | 1783 | 13.7% |
2 | 1800 | 13.9% |
3 | 1726 | 13.3% |
4 | 1770 | 13.6% |
5 | 1805 | 13.9% |
6 | 1800 | 13.9% |
7 | 1569 | 12.1% |
8 | 723 | 5.57% |
Total | 12976 | 100% |
32
Dataset:
TYPE OF ESSAY | TRAINING SET SIZE |
Persuasive/ Narrative/Expository | 1,783 Essays |
33
Dataset:
SCORE | RUBRIC GUIDELINES |
1 | An undeveloped response that may take a position but offers no more than very minimal support. Typical elements:
|
2 | An under-developed response that may or may not take a position. Typical elements:
|
3 | A minimally-developed response that may take a position, but with inadequate support and details. Typical elements:
|
4 | A somewhat-developed response that takes a position and provides adequate support. Typical elements:
|
5 | A developed response that takes a clear position and provides reasonably persuasive support. Typical elements:
|
6 | A well-developed response that takes a clear and thoughtful position and provides persuasive support. Typical elements:
|
34
Dataset:
35
8.2- AGE Model Evaluation
Where O is a matrix of size n-by-n corresponds to n essays. O(i, j) gives the count of essays that obtained a score i by the first evaluator and score j by the second evaluator. E matrix gives expected ratings without considering any correlation between the two evaluations given by two different evaluators. W is a matrix of the same size as the O and E matrix. It is calculated as follows:
36
8.2- AGE model evaluation
Po is the relative observed agreement among raters
Pe is the hypothetical probability of chance agreement
using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then K=1, If there is no agreement among the raters other than what would be expected by chance (as given by pe), K=0
37
8.2- AGE model evaluation
MODEL | BATCH SIZE | EPOCHS | LOSS | KAPPA SCORE ( Accuracy ) |
LSTM Kaggle | 64 | 2 | 40.79 | 0.7351 |
LSTM Colab | 64 | 2 | 40.79 | 0.7351 |
38
8.3- Proposed Enhanced AGE Model using RNN � and LSTM Methodology
Enhanced Model Building using Deep learning and RNN and LSTM
Training network
Model Evaluation
Running the model
dataset
39
THE PROPOSED MODEL
Model parameters
Re Implementation of Kaggle AGE Model using Collab platform
Results Of Using the LSTM Algorithm
LSTM Experiments | Batch size | Epochs | Kappa score | Time AVG. |
Experiment 1 | 64 | 2 | 73.53% | 6 m |
Experiment 2 | 32 | 10 | 93.46% | 10 |
Experiment 3 | 32 | 20 | 95.29% | 17 |
Experiment 4 | 32 | 25 | 95.53% | 21 |
Experiment 5 | 32 | 30 | 95.67% | 27 |
Experiment 6 | 32 | 32 | 95.96% | 24 |
Experiment 7 | 32 | 35 | 95.89% | 30.2 |
Experiment 8 | 64 | 10 | 91.32% | 11 |
Experiment 9 | 64 | 20 | 94.53% | 11.5 |
Experiment 10 | 64 | 25 | 95.04% | 12 |
Experiment 11 | 64 | 30 | 95.23% | 14.2 |
Experiment 12 | 64 | 32 | 95.39% | 16.8 |
Experiment 13 | 64 | 35 | 95.80% | 27.4 |
Steps Of Lstm
Modified Algorithm
Clarifying the Kappa with Experiences
Bidirectional LSTM Algorithm Results
Steps Of Bidirectional-LSTM
Loss in Bidirectional LSTM algorithm
Kappa Ratio with each Bidirectional LSTM experiment
Comparing Results For Different Jerseys
Model | Batch size | Epochs | Kappa score | Time |
base LSTM model | 64 | 2 | 73.53% | 6 min |
modified LSTM | 32 | 32 | 95.96% | 24 min |
Bidirectional LSTM | 32 | 32 | 96.35% | 51 min |
12- References
52
Thank you !