1 of 57

Improving Network Intrusion Detection System using Imbalance Reduction Techniques

Created By

Mounil Shah (181070058)

Shivani Pawar (181071046)

Mona Gandhi (181071021)

Ketaki Urankar (181071069)

From Final Year BTech

Computer Engineering department

Under the guidance of

Prof. Vaibhav D. Dhore

2 of 57

Agenda

Introduction
Motivation
Problem Statement
Project Idea
Objectives
Timeline of Project Execution
Literature Survey
Literature Gap
System Architecture
Methodologies
Results
Conclusion
Future Scope

4 of 57

Introduction

Breach the security principles of CIA

Confidentiality Integrity Availability

Need for development of techniques such as the Network Intrusion Detection System (NIDS)

How can we improve NIDS?

Massive expansion in the usage of the Internet

5 of 57

Motivation

Various datasets were created

CSE-CIC-IDS-2018

Latest
Imbalanced
Multiclass (14 including benign)
Features (83)

Problems:

Difficulty faced by algorithms to efficiently distinguish normal and malicious traffic in an imbalanced environment
Malicious traffic can hide in normal traffic
Difficult for NIDS to ensure the accuracy and timeliness of detection

NSL-KDD | CSE-CIC-IDS-2017 | CSE-CIC-IDS-2018

NSL-KDD

Traditional
Imbalanced
Multiclass (5 including benign)
Features (43)

6 of 57

Problem Statement

Using imbalance reduction techniques (IRT) to get better results on multiclass classification on the CSE-CIC-IDS 2018 dataset and NSL-KDD dataset.

Comparing various imbalance reduction methods

Undersampling techniques (US)
Oversampling techniques (OS)
Ensemble of different undersampling techniques
Ensemble of different oversampling techniques
DSSTE algorithm (combination of US and OS techniques)

7 of 57

Project Idea

Aim:

Reduce the imbalance problem in the CSE-CIC-IDS-2018 dataset and NSL-KDD dataset using various ensemble of imbalance reduction techniques
Thus, improve the ability of the model to detect intrusions

Design a pipeline including:

Apply various ensembles of imbalance reduction methods to the dataset
Train the data on Random Forest model, Decision Tree model and XGBoost model

Compare and analyze the results obtained to deduce the best techniques

8 of 57

Objectives

Main:

Obtain results for the pre-existing imbalance reduction techniques and comparing them for various machine learning models
Improve upon the ability to correctly classify the various classes of CSE-CIC-IDS 2018 dataset and NSL-KDD dataset

Additionally:

Propose an efficient method for reducing imbalance in dataset

9 of 57

Timeline of Project Execution

11 of 57

Literature Survey

CSE CIC IDS 2018 Dataset
NSL KDD Dataset
Imbalance Reduction Methods

12 of 57

CSE CIC IDS 2018 Dataset

Latest and comprehensive
Updated version of 2017 dataset
83 features
1,62,32,943 records

13 attack types + 1 benign

13 of 57

CSE CIC IDS 2018 Dataset

14 of 57

CSE CIC IDS 2018 Dataset

Detecting web attacks using random undersampling and ensemble learners [1]�Eight random undersampling ratios are studied on the CSE CIC IDS 2018 Dataset using Decision Trees, Random Forest, CatBoost, Light Gradient Boost, XG Boost and Logistic Regression models.

Building Auto-Encoder Intrusion Detection System Based on Random Forest Feature Selection [2]�This paper proposes an effective deep learning method, namely AE-IDS (Auto-Encoder Intrusion Detection System) based on random forest algorithm. This method constructs the training set with feature selection and feature grouping.

A novel time efficient learning based approach for smart intrusion detection system [3]�A Hybrid Feature Selection approach that aims to reduce the prediction latency without affecting attack prediction performance by lowering the model’s complexity.

[1] Richard Zuech, John Hancock, and Taghi M. Khoshgoftaar. “Detecting web attacks using random undersampling and ensemble learners”. In:Journal of Big Data8 (1 Dec. 2021)

[2] Xu Kui Li et al. “Building Auto-Encoder Intrusion Detection System based on random forest feature selection”. In:Computers and Security95 (Aug. 2020)

[3] Sugandh Seth, Gurvinder Singh, and Kuljit Kaur Chahal. “A novel time efficient learning-based approach for smart intrusion detection system”. In:Journal of Big Data8 (1 Dec. 2021).

15 of 57

NSL KDD Dataset

Made in 2009
Updated version of KDD Dataset
42 features
25192 records

Category	Number of original records
Normal	13,449
Probe	2289
DoS	9234
U2R	11
R2L	209
Total	25,192

4 attack types + 1 benign

16 of 57

NSL KDD Dataset

17 of 57

NSL KDD Dataset

A Survey of Intrusion Detection Models based on NSL-KDD Data Set [4]�A comprehensive review of various researches related to Machine Learning based IDS using the NSL-KDD data set is done

A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms [5]�In this paper the NSL-KDD data set is analysed and used to study the effectiveness of the various classification algorithms in detecting the anomalies in the network traffic patterns.

A Hybrid Data Mining Approach for Intrusion Detection on Imbalanced NSL-KDD Dataset [6]�A hybrid approach consisting of a combination of synthetic minority oversampling technique (SMOTE) and cluster center and nearest neighbor (CANN) has been proposed in this paper for imbalanced reduction.

[4] Kull ̄ıy ̄at al-Taqn ̄ıyah al-Uly ̄a (United Arab Emirates), Institute of Electrical, and Electronics Engineers.“ITT 2018, Information Technology Trends : 28 29 November 2018,

[5] L Dhanabal and S P Shantharajah. “A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms”.

[6] Mohammad Reza Parsaei, Samaneh Miri Rostami, and Reza Javidan. “A Hybrid Data Mining Approach for Intrusion Detection on Imbalanced NSL-KDD Dataset”. In:IJACSA) International Journal of Advanced Computer Science and Applications7 (6 2016)

18 of 57

Imbalance Reduction Methods

Class Imbalance Problem in the Network Intrusion Detection Systems [7]�The effect of class imbalance on the benchmark NSL_KDD dataset is evaluated using four popular classification techniques and the results are analyzed.

Combating Imbalance in Network Intrusion Datasets [8]�The paper focuses on rule learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives.

Intrusion Detection of Imbalanced Network Traffic Based on Machine Learning and Deep Learning [9]�It proposes a novel Difficult Set Sampling Technique(DSSTE) algorithm to tackle the class imbalance problem

[7] Sireesha Rodda and Uma Shankar Rao Erothi. “Class imbalance problem in the Network Intrusion Detection Systems”. In:International Conference on Electrical, Electronics, and Optimization Techniques, ICEEOT 2016 (Nov. 2016)

[8] David A. Cieslak, Nitesh V. Chawla, and Aaron Striegel. “Combating imbalance in network intrusion datasets”. In: 2006 IEEE International Conference on Granular Computing (2006)

[9] Lan Liu et al. “Intrusion Detection of Imbalanced Network Traffic Based on Machine Learning and Deep Learning”. In: IEEE Access 9 (2021), pp. 7550–7563.issn : 21693536.

19 of 57

Literature Gap

Most of the research done so far is on binary classification(that is benign or attack type) and comparatively less on multiclass classification in NIDS.
Most of the research is on imbalance reduction of NSL-KDD and not so much on 2018 dataset.
The 2018 dataset is enormous, very few papers consider all the attack types, and only considered attack types of sub groups, like just the web attacks or just DOS attacks.

20 of 57

Our Approach

Explore various Oversampling and Undersampling techniques on the 2018 dataset (along with NSL KDD)
Explore Multiclass Classification on 2018 Dataset (also with NSL KDD)
Consider all the 13 attack types + 1 benign type of 2018 dataset (accordingly for NSL KDD)
Explore the idea for reducing class imbalance involves using various methods at our disposal and forming their ensemble. This method of reduction can help us get the best out of all the techniques used.

21 of 57

Methodologies

24 of 57

Methodologies

Data Pre-Processing

Categorical Encoding
Delete Duplicate values
Remove Null Values
Delete unimportant features and perform feature transform
Standardize the data

26 of 57

Over Sampling Methods

Random OverSampling�It involves randomly duplicating examples from the minority class and adding them to the training dataset.

SMOTE (Synthetic Minority OverSampling Technique)�It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class.

27 of 57

Over Sampling Methods

ADASYN (Adaptive Synthetic)�It is similar to SMOTE, and derived from it, featuring just one important difference. it will bias the sample space (that is, the likelihood that any particular point will be chosen for duping) towards points which are located not in homogeneous neighborhoods.

Borderline SMOTE�It classifies any minority observation as a noise point if all the neighbors are the majority class and such an observation is ignored while creating synthetic data . Further, it classifies a few points as border points that have both majority and minority class as neighborhood and resample completely from these points.

28 of 57

Undersampling Methods

Random Undersampling�Random undersampling involves randomly selecting examples from the majority class to delete from the training dataset.

Tomek Links�Tomek Links are ambiguous points on the class boundary and are identified and removed in the majority class as they are difficult to classify

Cluster Centroid�Cluster centroids is a method that replaces cluster of samples by the cluster centroid of a K-means algorithm, where the number of clusters is set by the level of undersampling

29 of 57

Undersampling Methods

Edited Nearest Neighbour�ENN method works by finding the K-nearest neighbor of each observation first, then check whether the majority class from the observation’s k-nearest neighbor is the same as the observation’s class or not

Instance Hardness Threshold�Instance hardness is the probability of an observation being miss classified. In other words, it is �1 - probability of the class.

30 of 57

DSSTE Algorithm

Step 1�Dividing the dataset into easy and difficult set

Step 2�Undersampling Majority class in difficult set

Step 3�Oversampling Minority class samples in the difficult set

Finally,�Concatenating all the resultant sets of data

31 of 57

Difficult Set Sampling Technique (DSSTE)

Framework of DSSTE [8]

32 of 57

Ensemble of Techniques

Steps followed while forming the ensemble of IRTs:

Splitting into training and test data
Applying each individual IRT separately on the training dataset
Concatenating all the new training datasets thus generated
Removing all the duplicates from the concatenated dataset to obtain the final training dataset.

34 of 57

Ensemble implemented

Ensemble of Oversampling Techniques:

Ensemble of SMOTE + RandomOverSampler
Ensemble of SMOTE + BorderlineSMOTE + ADASYN + RandomOverSampler

Ensemble of Undersampling Techniques:

Ensemble of Tomek Links + Cluster Centroid + Edited Nearest Neighbors + Instance Hardness Threshold

35 of 57

36 of 57

Models used

Random Forest�Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

Decision Tree�A Classification Tree is a supervised learning algorithm where the outcome variable is categorical/discrete.Internal nodes represent the dataset attributes, branches represent decision rules, and each leaf node provides the outcome category in this tree-structured classifier.

XGBoost�It is a parallel regression tree model that combines the idea of Boosting, which is improved based on gradient descent decision tree.

38 of 57

Evaluation Metrics

Accuracy

Precision

Recall

F1-Score

40 of 57

Dataset obtained after Imbalance Reduction

Attack Type	Preprocessed dataset	Oversampling (Ensemble Selected)	Undersampling (ENN)
Benign	1042603	834091	770172
Bot	286191	832110	228920
BruteForce-Web	611	780376	266
BruteForce-XSS	230	832466	99
DDOS-Hoic	686012	834039	548923
DDOS-LOIC-UDP	1730	834091	1350
DOS-GoldenEye	41508	833949	33038
DOS-Hulk	461912	833784	360094
DOS-SlowHTTP	139890	358851	42296
DOS-SlowLoris	10990	777626	8674
FTP-BruteForce	193354	416442	90419
Infiltration	160639	832663	73141
SQL-Injection	87	799243	67
SSH-BruteForce	187589	660565	149452

41 of 57

Dataset obtained after Imbalance Reduction

42 of 57

NSL-KDD DATASET

43 of 57

Random Forest: NSL-KDD

44 of 57

Decision Tree: NSL-KDD

45 of 57

XGBoost: NSL-KDD

46 of 57

Conclusion for NSL-KDD

Amongst the Oversampling techniques, Adasyn has consistently given better results than all others.
Best results have been obtained using Random Undersampling, out of all the undersampling techniques.
Irrespective of the model, these two techniques have shown comparable, if not better results for all three models.
They have outperformed the DSSTE algorithm for almost all models.
Aside from these, Ensemble of all oversampling methods and the Ensemble of all undersampling methods have given comparatively good results.

47 of 57

CSE-CIC-IDS 2018 DATASET

48 of 57

Random Forest: CSE-CIC-IDS-2018

49 of 57

Decision Tree: CSE-CIC-IDS-2018

50 of 57

XGBoost: CSE-CIC-IDS-2018

51 of 57

Conclusion for CSE-CIC-IDS 2018 dataset

Out of all the Oversampling techniques, Ensemble of selected OS (Random Oversampling and SMOTE) have given the best results, consistently.
The results are close to or slightly better than that of unaltered dataset..
Best results have been obtained using Tomek Links and Edited Nearest Neighbour (ENN), out of all the undersampling techniques.
Both these techniques have performed better than DSSTE algorithm for all three models.
Aside from these, Ensemble of all oversampling methods and the Ensemble of all undersampling methods have given comparatively good results.

52 of 57

Conclusion

And

Future Scope

53 of 57

Conclusion

This project proposes a novel method of Ensemble of Imbalance Reduction techniques, for bringing down the variance in the sizes of different classes.
Using an ensemble, we used three models for the evaluation of our novel method with other standard oversampling and undersampling methods.
Through the experimentations we conclude that not all techniques are suitable for all the datasets. No single imbalance reduction Technique outperformed all others for all datasets and all models.
Ensembles of certain techniques did prove to show better results than the rest individual IRTs.
Depending upon the combination of techniques used for the ensemble, the results have varied. However, in all scenarios, ensemble have provided comparable, if not better, results.

1 of 57

2 of 57

3 of 57

4 of 57

5 of 57

6 of 57

7 of 57

8 of 57

9 of 57

10 of 57

11 of 57

12 of 57

13 of 57

14 of 57

15 of 57

16 of 57

17 of 57

18 of 57

19 of 57

20 of 57

21 of 57

22 of 57

23 of 57

24 of 57

25 of 57

26 of 57

27 of 57

28 of 57

29 of 57

30 of 57

31 of 57

32 of 57

33 of 57

34 of 57

35 of 57

36 of 57

37 of 57

38 of 57

39 of 57

40 of 57

41 of 57

42 of 57

43 of 57

44 of 57

45 of 57

46 of 57

47 of 57

48 of 57

49 of 57

50 of 57

51 of 57

52 of 57

53 of 57

54 of 57

55 of 57

56 of 57

57 of 57