1 of 57

Improving Network Intrusion Detection System using Imbalance Reduction Techniques

Created By

Mounil Shah (181070058)

Shivani Pawar (181071046)

Mona Gandhi (181071021)

Ketaki Urankar (181071069)

From Final Year BTech

Computer Engineering department

Under the guidance of

Prof. Vaibhav D. Dhore

2 of 57

Agenda

  • Introduction
  • Motivation
  • Problem Statement
  • Project Idea
  • Objectives
  • Timeline of Project Execution
  • Literature Survey
  • Literature Gap
  • System Architecture
  • Methodologies
  • Results
  • Conclusion
  • Future Scope

3 of 57

Overview

4 of 57

Introduction

  • Breach the security principles of CIA

Confidentiality Integrity Availability

  • Need for development of techniques such as the Network Intrusion Detection System (NIDS)

How can we improve NIDS?

  • Massive expansion in the usage of the Internet

5 of 57

Motivation

  • Various datasets were created
  • CSE-CIC-IDS-2018
    • Latest
    • Imbalanced
    • Multiclass (14 including benign)
    • Features (83)
  • Problems:
    • Difficulty faced by algorithms to efficiently distinguish normal and malicious traffic in an imbalanced environment
    • Malicious traffic can hide in normal traffic
    • Difficult for NIDS to ensure the accuracy and timeliness of detection

NSL-KDD | CSE-CIC-IDS-2017 | CSE-CIC-IDS-2018

  • NSL-KDD
    • Traditional
    • Imbalanced
    • Multiclass (5 including benign)
    • Features (43)

6 of 57

Problem Statement

  • Using imbalance reduction techniques (IRT) to get better results on multiclass classification on the CSE-CIC-IDS 2018 dataset and NSL-KDD dataset.
  • Comparing various imbalance reduction methods
    • Undersampling techniques (US)
    • Oversampling techniques (OS)
    • Ensemble of different undersampling techniques
    • Ensemble of different oversampling techniques
    • DSSTE algorithm (combination of US and OS techniques)

7 of 57

Project Idea

  • Aim:
    • Reduce the imbalance problem in the CSE-CIC-IDS-2018 dataset and NSL-KDD dataset using various ensemble of imbalance reduction techniques
    • Thus, improve the ability of the model to detect intrusions
  • Design a pipeline including:
    • Apply various ensembles of imbalance reduction methods to the dataset
    • Train the data on Random Forest model, Decision Tree model and XGBoost model
  • Compare and analyze the results obtained to deduce the best techniques

8 of 57

Objectives

Main:

  • Obtain results for the pre-existing imbalance reduction techniques and comparing them for various machine learning models
  • Improve upon the ability to correctly classify the various classes of CSE-CIC-IDS 2018 dataset and NSL-KDD dataset

Additionally:

  • Propose an efficient method for reducing imbalance in dataset

9 of 57

Timeline of Project Execution

10 of 57

Literature

11 of 57

Literature Survey

  • CSE CIC IDS 2018 Dataset
  • NSL KDD Dataset
  • Imbalance Reduction Methods

12 of 57

CSE CIC IDS 2018 Dataset

  • Latest and comprehensive
  • Updated version of 2017 dataset
  • 83 features
  • 1,62,32,943 records
  • 13 attack types + 1 benign

13 of 57

CSE CIC IDS 2018 Dataset

CSE CIC IDS 2018 Dataset

14 of 57

CSE CIC IDS 2018 Dataset

  • Detecting web attacks using random undersampling and ensemble learners [1]Eight random undersampling ratios are studied on the CSE CIC IDS 2018 Dataset using Decision Trees, Random Forest, CatBoost, Light Gradient Boost, XG Boost and Logistic Regression models.

  • Building Auto-Encoder Intrusion Detection System Based on Random Forest Feature Selection [2]This paper proposes an effective deep learning method, namely AE-IDS (Auto-Encoder Intrusion Detection System) based on random forest algorithm. This method constructs the training set with feature selection and feature grouping.

  • A novel time efficient learning based approach for smart intrusion detection system [3]A Hybrid Feature Selection approach that aims to reduce the prediction latency without affecting attack prediction performance by lowering the model’s complexity.

[1] Richard Zuech, John Hancock, and Taghi M. Khoshgoftaar. “Detecting web attacks using random undersampling and ensemble learners”. In:Journal of Big Data8 (1 Dec. 2021)

[2] Xu Kui Li et al. “Building Auto-Encoder Intrusion Detection System based on random forest feature selection”. In:Computers and Security95 (Aug. 2020)

[3] Sugandh Seth, Gurvinder Singh, and Kuljit Kaur Chahal. “A novel time efficient learning-based approach for smart intrusion detection system”. In:Journal of Big Data8 (1 Dec. 2021).

15 of 57

NSL KDD Dataset

  • Made in 2009
  • Updated version of KDD Dataset
  • 42 features
  • 25192 records

Category

Number of original records

Normal

13,449

Probe

2289

DoS

9234

U2R

11

R2L

209

Total

25,192

  • 4 attack types + 1 benign

16 of 57

NSL KDD Dataset

17 of 57

NSL KDD Dataset

  • A Survey of Intrusion Detection Models based on NSL-KDD Data Set [4]A comprehensive review of various researches related to Machine Learning based IDS using the NSL-KDD data set is done

  • A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms [5]�In this paper the NSL-KDD data set is analysed and used to study the effectiveness of the various classification algorithms in detecting the anomalies in the network traffic patterns.

  • A Hybrid Data Mining Approach for Intrusion Detection on Imbalanced NSL-KDD Dataset [6]�A hybrid approach consisting of a combination of synthetic minority oversampling technique (SMOTE) and cluster center and nearest neighbor (CANN) has been proposed in this paper for imbalanced reduction.

[4] Kull ̄ıy ̄at al-Taqn ̄ıyah al-Uly ̄a (United Arab Emirates), Institute of Electrical, and Electronics Engineers.“ITT 2018, Information Technology Trends : 28 29 November 2018,

[5] L Dhanabal and S P Shantharajah. “A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms”.

[6] Mohammad Reza Parsaei, Samaneh Miri Rostami, and Reza Javidan. “A Hybrid Data Mining Approach for Intrusion Detection on Imbalanced NSL-KDD Dataset”. In:IJACSA) International Journal of Advanced Computer Science and Applications7 (6 2016)

18 of 57

Imbalance Reduction Methods

  • Class Imbalance Problem in the Network Intrusion Detection Systems [7]�The effect of class imbalance on the benchmark NSL_KDD dataset is evaluated using four popular classification techniques and the results are analyzed.

  • Combating Imbalance in Network Intrusion Datasets [8]�The paper focuses on rule learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives.

  • Intrusion Detection of Imbalanced Network Traffic Based on Machine Learning and Deep Learning [9]�It proposes a novel Difficult Set Sampling Technique(DSSTE) algorithm to tackle the class imbalance problem

[7] Sireesha Rodda and Uma Shankar Rao Erothi. “Class imbalance problem in the Network Intrusion Detection Systems”. In:International Conference on Electrical, Electronics, and Optimization Techniques, ICEEOT 2016 (Nov. 2016)

[8] David A. Cieslak, Nitesh V. Chawla, and Aaron Striegel. “Combating imbalance in network intrusion datasets”. In: 2006 IEEE International Conference on Granular Computing (2006)

[9] Lan Liu et al. “Intrusion Detection of Imbalanced Network Traffic Based on Machine Learning and Deep Learning”. In: IEEE Access 9 (2021), pp. 7550–7563.issn : 21693536.

19 of 57

Literature Gap

  • Most of the research done so far is on binary classification(that is benign or attack type) and comparatively less on multiclass classification in NIDS.
  • Most of the research is on imbalance reduction of NSL-KDD and not so much on 2018 dataset.
  • The 2018 dataset is enormous, very few papers consider all the attack types, and only considered attack types of sub groups, like just the web attacks or just DOS attacks.

20 of 57

Our Approach

  • Explore various Oversampling and Undersampling techniques on the 2018 dataset (along with NSL KDD)
  • Explore Multiclass Classification on 2018 Dataset (also with NSL KDD)
  • Consider all the 13 attack types + 1 benign type of 2018 dataset (accordingly for NSL KDD)
  • Explore the idea for reducing class imbalance involves using various methods at our disposal and forming their ensemble. This method of reduction can help us get the best out of all the techniques used.

21 of 57

Methodologies

22 of 57

23 of 57

24 of 57

Methodologies

Data Pre-Processing

  • Categorical Encoding
  • Delete Duplicate values
  • Remove Null Values
  • Delete unimportant features and perform feature transform
  • Standardize the data

25 of 57

26 of 57

Over Sampling Methods

  • Random OverSampling�It involves randomly duplicating examples from the minority class and adding them to the training dataset.

  • SMOTE (Synthetic Minority OverSampling Technique)�It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class.

27 of 57

Over Sampling Methods

  • ADASYN (Adaptive Synthetic)�It is similar to SMOTE, and derived from it, featuring just one important difference. it will bias the sample space (that is, the likelihood that any particular point will be chosen for duping) towards points which are located not in homogeneous neighborhoods.

  • Borderline SMOTEIt classifies any minority observation as a noise point if all the neighbors are the majority class and such an observation is ignored while creating synthetic data . Further, it classifies a few points as border points that have both majority and minority class as neighborhood and resample completely from these points.

28 of 57

Undersampling Methods

  • Random Undersampling�Random undersampling involves randomly selecting examples from the majority class to delete from the training dataset.

  • Tomek LinksTomek Links are ambiguous points on the class boundary and are identified and removed in the majority class as they are difficult to classify

  • Cluster Centroid�Cluster centroids is a method that replaces cluster of samples by the cluster centroid of a K-means algorithm, where the number of clusters is set by the level of undersampling

29 of 57

Undersampling Methods

  • Edited Nearest Neighbour�ENN method works by finding the K-nearest neighbor of each observation first, then check whether the majority class from the observation’s k-nearest neighbor is the same as the observation’s class or not

  • Instance Hardness Threshold�Instance hardness is the probability of an observation being miss classified. In other words, it is �1 - probability of the class.

30 of 57

DSSTE Algorithm

Step 1�Dividing the dataset into easy and difficult set

Step 2�Undersampling Majority class in difficult set

Step 3�Oversampling Minority class samples in the difficult set

Finally,�Concatenating all the resultant sets of data

31 of 57

Difficult Set Sampling Technique (DSSTE)

Framework of DSSTE [8]

32 of 57

Ensemble of Techniques

Steps followed while forming the ensemble of IRTs:

  1. Splitting into training and test data
  2. Applying each individual IRT separately on the training dataset
  3. Concatenating all the new training datasets thus generated
  4. Removing all the duplicates from the concatenated dataset to obtain the final training dataset.

33 of 57

34 of 57

Ensemble implemented

Ensemble of Oversampling Techniques:

  • Ensemble of SMOTE + RandomOverSampler
  • Ensemble of SMOTE + BorderlineSMOTE + ADASYN + RandomOverSampler

Ensemble of Undersampling Techniques:

  • Ensemble of Tomek Links + Cluster Centroid + Edited Nearest Neighbors + Instance Hardness Threshold

35 of 57

36 of 57

Models used

  • Random Forest�Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

  • Decision Tree�A Classification Tree is a supervised learning algorithm where the outcome variable is categorical/discrete.Internal nodes represent the dataset attributes, branches represent decision rules, and each leaf node provides the outcome category in this tree-structured classifier.

  • XGBoost�It is a parallel regression tree model that combines the idea of Boosting, which is improved based on gradient descent decision tree.

37 of 57

38 of 57

Evaluation Metrics

  • Accuracy
  • Precision
  • Recall
  • F1-Score

39 of 57

Results

40 of 57

Dataset obtained after Imbalance Reduction

Attack Type

Preprocessed dataset

Oversampling

(Ensemble Selected)

Undersampling

(ENN)

Benign

1042603

834091

770172

Bot

286191

832110

228920

BruteForce-Web

611

780376

266

BruteForce-XSS

230

832466

99

DDOS-Hoic

686012

834039

548923

DDOS-LOIC-UDP

1730

834091

1350

DOS-GoldenEye

41508

833949

33038

DOS-Hulk

461912

833784

360094

DOS-SlowHTTP

139890

358851

42296

DOS-SlowLoris

10990

777626

8674

FTP-BruteForce

193354

416442

90419

Infiltration

160639

832663

73141

SQL-Injection

87

799243

67

SSH-BruteForce

187589

660565

149452

41 of 57

Dataset obtained after Imbalance Reduction

42 of 57

NSL-KDD DATASET

43 of 57

Random Forest: NSL-KDD

44 of 57

Decision Tree: NSL-KDD

45 of 57

XGBoost: NSL-KDD

46 of 57

Conclusion for NSL-KDD

  • Amongst the Oversampling techniques, Adasyn has consistently given better results than all others.
  • Best results have been obtained using Random Undersampling, out of all the undersampling techniques.
  • Irrespective of the model, these two techniques have shown comparable, if not better results for all three models.
  • They have outperformed the DSSTE algorithm for almost all models.
  • Aside from these, Ensemble of all oversampling methods and the Ensemble of all undersampling methods have given comparatively good results.

47 of 57

CSE-CIC-IDS 2018 DATASET

48 of 57

Random Forest: CSE-CIC-IDS-2018

49 of 57

Decision Tree: CSE-CIC-IDS-2018

50 of 57

XGBoost: CSE-CIC-IDS-2018

51 of 57

Conclusion for CSE-CIC-IDS 2018 dataset

  • Out of all the Oversampling techniques, Ensemble of selected OS (Random Oversampling and SMOTE) have given the best results, consistently.
  • The results are close to or slightly better than that of unaltered dataset..
  • Best results have been obtained using Tomek Links and Edited Nearest Neighbour (ENN), out of all the undersampling techniques.
  • Both these techniques have performed better than DSSTE algorithm for all three models.
  • Aside from these, Ensemble of all oversampling methods and the Ensemble of all undersampling methods have given comparatively good results.

52 of 57

Conclusion

And

Future Scope

53 of 57

Conclusion

  • This project proposes a novel method of Ensemble of Imbalance Reduction techniques, for bringing down the variance in the sizes of different classes.
  • Using an ensemble, we used three models for the evaluation of our novel method with other standard oversampling and undersampling methods.
  • Through the experimentations we conclude that not all techniques are suitable for all the datasets. No single imbalance reduction Technique outperformed all others for all datasets and all models.
  • Ensembles of certain techniques did prove to show better results than the rest individual IRTs.
  • Depending upon the combination of techniques used for the ensemble, the results have varied. However, in all scenarios, ensemble have provided comparable, if not better, results.

54 of 57

Future Scope

  • There is the possibility of trying out various combinations of Ensembles of imbalance reduction techniques and see which ones provide better results.
  • Training machine learning models apart from those used here for these datasets and experimenting with the imbalance reduction techniques.
  • DSSTE is the only technique which combines both, Oversampling and Undersampling. Other combinations of techniques can be tried.
  • Use trained models for intrusion detection in real world scenarios to know how efficient they really are.

55 of 57

Thank You

56 of 57

Q&A

57 of 57