1 of 53

By: Samira Said �IS Department

Improved Automated Essay Grading System Via Natural Language Processing and Deep Learning

Under the supervision of:�Assoc. prof. Essam Hamed Assoc. prof. Essam Elfakharany

2 of 53

ABSTRACT

I offer Tahsin a very advanced system for automatically classifying articles. This is based on natural language processing and deep learning techniques.

I tested the code on the Kaggle platform, and the result was 73%. Then I tested the ode using another platform, Google Colab, and the result was 73%. Finally, I started making changes to the code and changing the parameters, and after the modification, I reached a result of 95%, then we used another algorithm, which is BI-LSTM, and I reached a result of 96.35%.

3 of 53

1 . Introduction

Exam questions classification

Constructed-response (CR) like Essays

Selected-Response (SR) like MCQ. Short answers

3

4 of 53

2- Motivation

  1. Due to COVID-19 virus, all educational institutions (universities and schools) have resorted to online education, and among the online education are online exams and teachers are not accustomed to online correction.
  2. A need for an automated essay grading system that reduces cost and time

4

5 of 53

2- Motivation

The chart shows the most important methods that have been adopted in education following the suspension of studies in various countries.

5

6 of 53

3. Problem Statement

  • Due to the inaccuracy of the results in manual correction, length of time, effort and increased costs, the correction is done by human, and everyone has a point of view in correcting the essay questions, one paper is evaluated by more than one person to ensure that there is no bias.

  • So, There is a need for an automated essay grading scoring system that reduces cost, time and determines an accurate and reliable score.
  • This assignment actually requires a smart system that helps electronically � correct papers for essay questions.

  • The problem statement is:
  • 1- How to extract the answer from the text written by the student (knowledge). 2- How to automatically evaluate the answer knowledge and assign a grade.

6

7 of 53

4- Objectives

Construct an accurate model with the following characteristics:

  1. Extract the answer from the text written by the student (knowledge).
  2. Automatically evaluate the answer knowledge and assign a grade Using �deep learning technique.

7

8 of 53

5- Background

Automated Grading for Essays (AGE)

is the use of specialized computer algorithms to assign grades to essays written in an educational setting such as:

  1. Deep Learning.

  • Natural Language Processing (NLP)

  • Recurrent Neural Network (RNN).

  • LSTM ( Long short-Term Memory Network)

8

9 of 53

5- Background

1. Deep Learning:

  • An AI function that mimics the workings of the human brain in processing data for use in detecting objects, recognizing speech, translating languages, and making decisions.
  • It defined as a cascade of layers performing nonlinear processing to learn multiple levels of data presentations. Its goal is to speed up the learning period

  • Unlike conventional machine learning and data mining techniques, deep learning is able to generate very high–level data representations from massive volumes of raw data. So, it have provided a solution to many real world applications.

9

10 of 53

5- Background

skin cancer detection

10

  • Deep Learning applications:

AGE systems

11 of 53

5- Background

11

2. Natural Language Processing (NLP)

  • NLP is a series of algorithms and techniques that mainly focus on teaching computers to under-stand the human language. Some NLP tasks include document classification, translation, paraphrase identification, text similarity, summarization, and question answering.

  • NLP development is challenging due to the complexity and ambiguous structure of the human language. Moreover, natural language is highly context specific, where literal meanings change based on the form of words, and domain specificity.

  • Most NLP models follow a similar preprocessing step: (1) the input text is broken down into words through tokenization and then (2) these words are reproduced in the form of vectors, or n-grams. Representing words in a low dimension is important to create an accurate perception of similarities and differences between various words. The challenge arrives when there is a need to decide the length of words contained in each n-gram. This procedure is context specific and requires prior domain knowledge.

12 of 53

5- Background

12

2. NLP Approaches related to AGE :

  • Some of the highly impactful approaches in solving the most well-known NLP tasks are:

-- Paraphrase Identification:

-- Paraphrase identification is the process of analyzing two sentences and projecting how similar � they are based on their underlying hidden semantics.

-- It is a key feature that is beneficial for several NLP jobs such as plagiarism detection, answers � to questions, context detection, summarization, and domain identification.

-- Question Answering:

-- An automatic question-and-answering system should be able to interpret a natural language question and use reasoning to return an appropriate reply. Modern knowledge bases, such as the famous FREEBASE dataset, allow this field to flourish and leap out of the times when features and rule sets were hand-crafted to specific domains.

13 of 53

5- Background

3. Recurrent Neural Network (RNN):

  • It is another widely used and popular algorithm in deep learning, especially in NLP and speech processing.

  • Unlike traditional neural networks, RNN utilizes the sequential information in the network. This property is essential in many applications where the embedded structure in the data sequence conveys useful knowledge.
  • For example, to understand a word in a sentence, it is necessary to know the context. Therefore, an RNN can be seen as short-term memory units that include the input layer x, hidden (state) layers, and output layer y.

  • Three deep RNN approaches including deep “Input-to-Hidden,” “Hidden-to-Output,” and “Hidden-to-Hidden” are introduced. Based on these three solutions, a deep RNN is proposed that not only takes advantage of a deeper RNN but also reduces the difficult learning in deep networks.

13

14 of 53

5- Background

  • RNN approach automatically learns the relation between an essay and its grade.

  • Since the system is based on RNNs, it can use non-linear neural layers to identify complex patterns in the data and learn them, and encode all the information required for essay evaluation and scoring

14

15 of 53

5- Background

4. LSTM ( Long short-Term Memory Network) :

  • LSTM forms a memory about a sequence of inputs, over time. It is an artificial recurrent neural network (RNN) architecture used in the field of deep learning.

  • It is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition

Example: LSTM can be trained to generate new character, words, and bodies of text

15

16 of 53

6- Related Work

PAPER INFORMATION

OBJECTIVES

METHODOLOGY

DATASET

RESULTS

CONTRIBUTION

MISSING

1- Automated Grading of Essays: A Review

Authors:

Jyoti G. Borade,

Publisher:

SPRINGER, 2021

methods for automated grading of essays and  Evaluating explanatory answers (applications of approaches such as Natural Language Processing and Deep Learning for AGE

1-Textual Similarity

2- Latent Semantic-Based Vector Space Model

3- Neural Network Based Approaches

4 - Naive Bayes Classifiers

responses for each question from different students.

Accuracy obtained for simple LSTM is 83, Deep LSTM is 82, Bi-directional LSTM is 89

This work presents a review of machine learning techniques used to assess essay type of answers

Limited Data set

16

17 of 53

6- Related Work

PAPER INFORMATION

OBJECTIVES

METHODOLOGY

DATASET

RESULTS

CONTRIBUTION

MISSING

2- Automated Content Grading Using Machine Learning

Authors:

Rahul K Chauhan, Ravinder Saharan, Siddhartha Singh, Priti Sharma

Publisher:

ResearchGate, 2020

algorithmic approach in machine learning can be used to

automatically examine and grade theoretical content in exam

answer papers.

Random Forest Algorithm

The standard exam answer papers were taken from the mid-term (minor) examinations

Weighed Kappa

Set1: 0.46

Set 2:0.55

Set 3:0.61

This work represented how content grading in a big-data based technical domain can  be solved using this approach

Low Accuracy

17

18 of 53

6- Related Work

PAPER INFORMATION

OBJECTIVES

METHODOLOGY

DATASET

RESULTS

CONTRIBUTION

MISSING

3- An Analysis of Automated Answer Evaluation

Systems based on Machine Learning

Authors:

Birpal Singh J. Kapoor, 1Shubham M. Nagpure 

Publisher:

IEEE, 2020

summarize the existing mechanism and analyses the performance of the system used for automatic grading of the long and descriptive answers.

A methodology for detaching a course of action of expositions into subsets that address similar graders, which uses an explanation reasoning and bunching.

Kaggle website

many of analysts and scholars are as yet working extremely hard and created different frameworks, that gave empowering results.

features developed with the corpus based strategies, or NLP systems as a significant section of AI structure.

Percentage of accuracy is not specified

18

19 of 53

6- Related Work

PAPER INFORMATION

OBJECTIVES

METHODOLOGY

DATASET

RESULTS

CONTRIBUTION

MISSING

4- Automated language essay scoring

systems: a literature review

Authors:

Mohamed Abdellatif Hussein1, Hesham Hassan

Publisher:

Peerj , 2019

review the literature for the AES systems used for grading the essay questions

1- Project Essay Grader (PEG)

2-Intelligent Essay Assessor

 (IEA)

3- E-rater

4- Criterion

5-IntelliMetric

6-MY Access

7-Bayesian Essay Test Scoring System (BETSY)

8- Automatic text scoring using neural networks

9- A neural network approach to automated essay scoring

10- neural network for automatic essay scoring

the Kaggle’s ASAP contest dataset

PEG 0.87

IEA 0.90

E-rater 0.91

IntelliMetric 0.83

BETSY 0.80

The performance of these systems is evaluated based on the comparison of the scores assigned to a set of essays scored by expert humans.

The results are not specific to each method

19

20 of 53

6- Related Work

PAPER INFORMATION

OBJECTIVES

METHODOLOGY

DATASET

RESULTS

CONTRIBUTION

MISSING

5- Automated Essay Grading using Machine Learning

Algorithm

Authors:

Ramalingam ,APandian,Prateek Chetry and Himanshu Nigam

Publisher:

IOP , 2018

develop an automated essay assessment system by use of machine learning techniques

machine learning technique, e-Rater technique,

provided as input and then it is compared with the essays of each set once it is done then the essays are compared based on their polarities, words used and the content of essay. The machine finally generates a score for the essay by combining all the results to get a final-score.

The dataset used has been extracted from kaggle.com, it consists of the data from the competition conducted by The Hewlett Foundation

the machine is capable of assessing an essay like a human rater.

This current approach tries to model the language features like language fluency, grammatical correctness, domain

information content of the essays, and put an effort to fit the best polynomial in the feature space using linear regression with polynomial basis functions

accuracy percentage is not specified

20

21 of 53

6- Related Work

PAPER INFORMATION

OBJECTIVES

METHODOLOGY

DATASET

RESULTS

CONTRIBUTION

MISSING

6- Intelligent Auto-grading System

Authors:

Zining Wang1, Jianli Liu1

Publisher:

Proceedings,2018

present a novel automatic essay scoring system based on Natural Language Processing and Deep Learning technologies

Natural Language Processing, Deep Learning , LSTM

The dataset for training and testing is the public essay set available in the Automated Student Assessment Prize on Kaggle

Accuracy 73%

the NLP, Neural Network and intelligent auto-grading system and then attempt to build an innovative open-minded response grading machine.

Accuracy is low

21

22 of 53

7- Related Work Conclusion & Research Gap

2- Deep Learning

1- Common used techniques

1- Machine learning

22

3- NLP for AGE

  1. support vector machine (SVM)
  2. Random Forest
  3. Latent semantic analysis (LSA)
  1. N Layer neural network
  2. It used to implement a scoring function
  1. It used to handle linguistic issues such as multiple meanings of words in different contents.
  2. It helps to extract a linguistic features.

23 of 53

2- Different Models for AGE

3- Neural Network based Approaches��

4- Naïve Bayes Classifiers�

1- Machine Learning { linear regression, clustering, SVM, and Bayesian inference}

is an extension of the mean squared error. Importantly, the square root of the error is calculated, which means that the units of the RMSE are the same as the original units of the target value that is being predicted.

2- Text Similarity :

It takes text documents as inputs and finds similarity between them.

Lexical similarity: determines the similarity by matching contents, word-by-word.

Semantic similarity: is based on the meaning of the contents.

24 of 53

3- AGE Evaluation Metrics:�

  • Root Mean Squared Error (RMSE):

is an extension of the mean squared error. Importantly, the square root of the � error is calculated, which means that the units of the RMSE are the same as the � original units of the target value that is being predicted.

  • Quadratic weighted Kappa (QWK):
    • A weighted Kappa is one of the evaluation metrics, used to calculate the amount of similarity between prediction and actual evaluation.
    • It generates a score of 1.0 when prediction and actual evaluation is same.

25 of 53

4- Research Gap:

Most algorithms that use Natural Language Processing, Deep � Learning are get low accuracy.

  • A need for bridging this gap (improve the accuracy)

26 of 53

8-Research Road Map

8.1- Re Implementation of Kaggle AGE Model using colab platform

8.2- AGE Model Evaluation

8.3- The Enhanced AGE model using RNN and LSTM Methodology�8.4- The enhanced AGE model Evaluation

8.5- Results of comparing the enhanced model to the original model

26

27 of 53

8.1 Re Implementation of Kaggle AGE Model using Kaggle platform

The existing AGE from kaggle

dataset

CSV

File reader libraries in python

AGE Model

question score

27

28 of 53

8.1 Re Implementation of Kaggle AGE Model using colab platform

  1. The framework is starting from collecting the data in one dataset with a CSV format file . Then, we are going to use the COLAB editor for python to prepare our experiment.
  2. The python library keras learn is used to implement the machine learning model that will be used in forecasting.
  3. Finally, the kappa metrics are used for machine learning model assessment and result presentation.

dataset

CSV

File reader libraries in python

Model Building

question score

28

29 of 53

Model building steps

  1. Words that do not have the meaning of stop words are removed.
  2. Then the topic is converted into a group of sentences and each sentence is converted into a list of words.
  3. Then the representation and embedding meaning of each word in each sentence is obtained and an average vector is calculated to represent the subject.
  4. Each subject's beam is grouped into a list and entered into the model for training.

29

30 of 53

Model building steps

30

31 of 53

Dataset:

  • kaggle website: https://www.kaggle.com/c/asap-aes

  • Data Description:
  • using a data set of ~13000 essays These essays were divided into 8 different sets based on context
  • For each of the 8 questions, a number of articles are explained in the following table and chart

31

32 of 53

Dataset:

  • Distribution sets of Essays:

QUESTION NUMBER

COUNT

PERCENTAGE

1

1783

13.7%

2

1800

13.9%

3

1726

13.3%

4

1770

13.6%

5

1805

13.9%

6

1800

13.9%

7

1569

12.1%

8

723

5.57%

Total

12976

100%

32

33 of 53

Dataset:

  • Q1: Write a letter to your local newspaper in which you state your opinion on the effects computers have on people. Persuade the readers to agree with you.

TYPE OF ESSAY

TRAINING SET SIZE

Persuasive/ Narrative/Expository

1,783 Essays

33

34 of 53

Dataset:

SCORE

RUBRIC GUIDELINES

1

An undeveloped response that may take a position but offers no more than very minimal support. Typical elements:

  • Contains few or vague details.
  • Is awkward and fragmented.
  • May be difficult to read and understand.
  • May show no awareness of audience.

2

An under-developed response that may or may not take a position. Typical elements:

  • Contains only general reasons with unelaborated and/or list-like details.
  • Shows little or no evidence of organization.
  • May be awkward and confused or simplistic.
  • May show little awareness of audience.

3

A minimally-developed response that may take a position, but with inadequate support and details. Typical elements:

  • Has reasons with minimal elaboration and more general than specific details.
  • Shows some organization.
  • May be awkward in parts with few transitions.
  • Shows some awareness of audience.

4

A somewhat-developed response that takes a position and provides adequate support. Typical elements:

  • Has adequately elaborated reasons with a mix of general and specific details.
  • Shows satisfactory organization.
  • May be somewhat fluent with some transitional language.
  • Shows adequate awareness of audience.

5

A developed response that takes a clear position and provides reasonably persuasive support. Typical elements:

  • Has moderately well elaborated reasons with mostly specific details.
  • Exhibits generally strong organization.
  • May be moderately fluent with transitional language throughout.
  • May show a consistent awareness of audience.

6

A well-developed response that takes a clear and thoughtful position and provides persuasive support. Typical elements:

  • Has fully elaborated reasons with specific details.
  • Exhibits strong organization.
  • Is fluent and uses sophisticated transitional language.
  • May show a heightened awareness of audience.

34

35 of 53

Dataset:

  • Data Split:
  • Use 5 folds cross validation of the model accuracy, which means that in each fold 80% of the training data and 20% of the test data were chosen
  • Types of Essays :
  • 50% ( persuasive / narrative  / expository)
  • 50% (source dependent responses)

35

36 of 53

8.2- AGE Model Evaluation

  • QWK:(Quadratic weighted Kappa):
  • Quadratic Weighted Kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The definition of K is

Where O is a matrix of size n-by-n corresponds to n essays. O(i, j) gives the count of essays that obtained a score i by the first evaluator and score j by the second evaluator. E matrix gives expected ratings without considering any correlation between the two evaluations given by two different evaluators. W is a matrix of the same size as the O and E matrix. It is calculated as follows:

36

37 of 53

8.2- AGE model evaluation

Po is the relative observed agreement among raters

Pe is the hypothetical probability of chance agreement

using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then K=1, If there is no agreement among the raters other than what would be expected by chance (as given by pe), K=0

37

38 of 53

8.2- AGE model evaluation

MODEL

BATCH SIZE

EPOCHS

LOSS

KAPPA SCORE

( Accuracy )

LSTM Kaggle

64

2

40.79

0.7351

LSTM Colab

64

2

40.79

0.7351

  • The re-implemented model shows that the percentage the accuracy 73% is the same as mention in related work number #6.

38

39 of 53

8.3- Proposed Enhanced AGE Model using RNN � and LSTM Methodology

Enhanced Model Building using Deep learning and RNN and LSTM

Training network

Model Evaluation

Running the model

dataset

39

40 of 53

THE PROPOSED MODEL

41 of 53

Model parameters

  • Number of hidden layers (300).
  • Number of neurons in each hidden layer (300).
  • Number of neurons in input layers (50).
  • Number of neurons in output layer (1-12).
  • Dataset matrix.
  • Mathematical matrix of dataset
  • Size = 781 X 13000 X 50 =507.650.000
  • Essay matrix
  • Size = 781 X 50 = 39.050

42 of 53

Re Implementation of Kaggle AGE Model using Collab platform

  1. Kaggle system was log in and the code was downloaded and then uploaded to the Collab platform, the dataset file then was downloaded from Cagle and uploaded to Google Drive. Google Drive and Collab were connected using a link, then run. The result was 73%. The code parameters were edited in the third place. And another algorithm was used with the code. It was run again with the result 96%.

43 of 53

Results Of Using the LSTM Algorithm

LSTM

Experiments

Batch size

Epochs

Kappa score

Time AVG.

Experiment 1

64

2

73.53%

6 m

Experiment 2

32

10

93.46%

10

Experiment 3

32

20

95.29%

17

Experiment 4

32

25

95.53%

21

Experiment 5

32

30

95.67%

27

Experiment 6

32

32

95.96%

24

Experiment 7

32

35

95.89%

30.2

Experiment 8

64

10

91.32%

11

Experiment 9

64

20

94.53%

11.5

Experiment 10

64

25

95.04%

12

Experiment 11

64

30

95.23%

14.2

Experiment 12

64

32

95.39%

16.8

Experiment 13

64

35

95.80%

27.4

44 of 53

Steps Of Lstm

45 of 53

Modified Algorithm

46 of 53

Clarifying the Kappa with Experiences

47 of 53

Bidirectional LSTM Algorithm Results

48 of 53

Steps Of Bidirectional-LSTM

49 of 53

Loss in Bidirectional LSTM algorithm

50 of 53

Kappa Ratio with each Bidirectional LSTM experiment

51 of 53

Comparing Results For Different Jerseys

Model

Batch size

Epochs

Kappa score

Time

base LSTM model

64

2

73.53%

6 min

modified LSTM

32

32

95.96%

24 min

Bidirectional LSTM

32

32

96.35%

51 min

52 of 53

12- References

  1. Jyoti G. Borade and Laxman D. Netak, “Automated Grading of Essays: A Review” ,2021, SPRINGER.
  2. Rahul K Chauhan, “Automated Content Grading Using Machine Learning” , 2020, ResearchGate.
  3. Birpal Singh J. Kapoor, Shubham M. Nagpure, Sushil S. Kolhatkar, Prajwal G. Chanore, Mohan M. Vishwakarma, “An Analysis of Automated Answer Evaluation Systems based on Machine Learning”,2020, IEEE.
  4. Mohamed Abdellatif Hussein, Hesham Hassan and Mohammad Nassef, “Automated language essay scoring systems: a literature review”, 2019,peerj.
  5. V. V.Ramalingam , Apandian, Prateek Chetry and Himanshu Nigam, “Automated Essay Grading using Machine Learning Algorithm”, 2018, IOP.
  6. Zining Wang1, Jianli Liu1, Ruihai Dong2, “ Intelligent Auto-grading System” ,2018, Proceedings.

52

53 of 53

Thank you !