1 of 47

Good || Evil: Defending Infrastructure at Scale with Anomaly and Classification Based Network Intrusion Detection

Master Thesis Project #75

Philippe Partarrieu <ppartarrieu@os3.nl>

Philipp Mieden <pmieden@os3.nl>

Supervisors

Joao Novaismarques <joao.novaismarques@kpn.com>

Jordi Scharloo <jordi.scharloo@kpn.com>

Giovanni Sileno <g.sileno@uva.nl>

2 of 47

The problems with signature-based solutions

  • Signatures can be easily bypassed by updating key attack characteristics
  • Boundless growth of signature DB over time
  • Requires Up-to-date signature DB

Anomaly detection can help here

  • System learns the baseline of normal network behavior
  • Alerts can be generated once there are deviations
  • Many different algorithms to choose from
  • Known to work well for many similar data science problems

Research Question: Which algorithms are best suited for the task of intrusion detection in computer networks?

2

3 of 47

Dataset: CIC IDS 2018

Large scale, modern dataset with traffic from over 400 different machines

  • 450GB pcaps
  • 80+ network flow based features provided as CSV
  • 6 types of attacks (brute-force, DoS, DDoS, bot, injection, infiltration)

3

4 of 47

Dataset: Exploration

4

5 of 47

Dataset: Exploration

5

HTTP request to IP 18.219.211.138:8080 instead to domain name

6 of 47

Dataset: Exploration - Botnet activity

6

7 of 47

Dataset: Issues

  • Original network flow CSV was missing information
    • Flow ID, Src IP, Dst IP, Src Port

  • Labeling tool used in study is not open source

  • One provided pcap was missing attack traffic
    • one day (Thursday-15-02-2018) contains no attack traffic at all

7

8 of 47

Dataset: Imbalance

An issue for some network intrusion detection algorithms

  • Some algorithms require an equal distribution of positive and negative classes
  • Some algorithms require a high ratio of anomalies

Different strategies to deal with dataset imbalance exist:

  • Class weights
  • Oversampling

8

9 of 47

Metrics: Confusion Matrix

9

Correct prediction

Incorrect prediction

True Positive

False Positive

False Negative

True Negative

10 of 47

Metrics: Accuracy

10

True Positive

0

False Positive

0

False Negative

1

True Negative

99

Accuracy is never suitable as a metric when dealing with imbalanced data

A simple example: assume a dataset with 99 benign and 1 malicious sample

An algorithm that always predicts that the sample is normal would score 99% accuracy. It has identified 99 benign events but missed the anomaly

Accuracy = =

11 of 47

Metrics: F1 score

11

True Positive

0

False Positive

0

False Negative

1

True Negative

99

For the same dataset, let’s calculate the F1 score

F1 score =

Where precision = and recall =

We obtain F1 = 2 * 0 = 0

The F1 score takes the relevance of the different error types into account!

12 of 47

Experiment Design

12

We compared deep learning models and shallow learning models

For the experiments we used both the Connection audit records generated by us, and the CIC IDS Network flow data provided by the dataset authors

13 of 47

Experiment Design: Model Choice

13

Model

Online

Supervised

Deep Learning

Deep Neural Network

X

Auto Encoders

X

Gradient Boosting

X

X

Isolation Forest

X

X

X

14 of 47

Feature Extraction

Ideally parallelized to take advantage

of multi-core processors.

Concurrency causes problems though:

  • Race conditions
  • Shared state

Ideally something lightweight to calculate that is still expressive enough to capture network trends

  • Solution: bi-directional network flow summaries (Connection Audit Records)
    • Requires keeping only minimal state, aggregate subflows
    • Processing rate: 300K pkts/second on a Ryzen 9, 16 core processor

14

15 of 47

Connection Audit Records

We chose a deliberately simple set of features for establishing a baseline.

Additional features can always be added later and their effect on prediction quality measured.

We also added stats for the number of specific TCP flags and the mean TCP window size.

15

We preserved the DstPort even when dropping address information

16 of 47

Labelling: adding attack information

16

We implemented unit tests to ensure labelling works as expected

Labelling can be applied to any audit record from NETCAP

17 of 47

Labelling: adding attack information

17

18 of 47

Data Encoding

Categorical data (eg: strings) must be transformed into numeric values

Multiple approaches for encoding categorical data:

  • Enumeration
  • One Hot
  • Learned Embedding

We chose enumeration because it does not alter the feature dimension!

18

19 of 47

Data Normalization

Numeric values must be normalized to reside within a certain threshold

We used:

  • Zscore: The Standard Normal Distribution

19

20 of 47

Tensorflow

Free and open source software library for machine learning from Google

Supports different backends for computations: CPU, GPU, FPGA etc

Can be run in a cluster mode to run processing jobs on multiple hardware devices

20

21 of 47

Tensorboard

21

22 of 47

Deep Neural Network: Baseline Model

We chose a deliberately small network for the baseline experiments

Bigger does not always mean better, as later experiment results confirmed

22

23 of 47

Deep Neural Network: GPU acceleration

GEFORCE RTX 3090

  • coreClock: 1.74GHz
  • coreCount: 82
  • deviceMemorySize: 23.67GiB
  • deviceMemoryBandwidth: 871.81GiB/s

Processing 6m Connection audit records, during training and testing ~2s per Epoch (= one run over the entire data).

23

24 of 47

Results: Audit record generation performance

24

25 of 47

Results: DNN* with address information

25

Day

14/02

15/02

16/02

20/02

21/02

22/02

23/02

28/02

01/03

02/03

Attack Labels

Brute-

force

DoS

DoS

DDoS

DDoS

Brute-

force

/ Injection

Brute-

force

/ Injection

Infiltration

Infiltration

Bot

Attack Ratio (%)

0.48

Pcaps contain no attack traffic

0.59

4.61

2.74

0.000035

0.000048

1.06

2.1

1.6

F1

0.99

-

1

0.94

0.94

0

0

0.97

0.97

0.90

*one model trained per day, address information and tweaked class weights, binary classification: normal or attack

26 of 47

Results: DNN* without address information

26

Day

14/02

15/02

16/02

20/02

21/02

22/02

23/02

28/02

01/03

02/03

Attack Labels

Brute-

force

DoS

DoS

DDoS

DDoS

Brute-

force

/ Injection

Brute-

force

/ Injection

Infiltration

Infiltration

Bot

Attack Ratio (%)

0.48

Pcaps contain no attack traffic

0.59

4.61

2.74

0.000035

0.000048

1.06

2.1

1.6

F1

0.98

-

0.99

0.94

0.94

0

0

0.83

0.80

0.79

*one model trained per day, no address information and tuned class weights,

binary classification: normal or attack

27 of 47

Results: DNN trained on entire dataset

27

Variation

A + TCW

NA + TCW

F1

0.95

0.87

one model trained for the entire dataset,

binary classification: normal or attack

Variation

A

NA

F1

0.95

0.94

one model trained for the entire dataset,

multi-class classification: normal, denial-of-service, bruteforce, injection, infiltration or botnet

28 of 47

Results: Isolation Forest

28

Day

14/02

15/02

16/02

20/02

21/02

22/02

23/02

28/02

01/03

02/03

Attack Labels

Brute-

force

DoS

DoS

DDoS

DDoS

Brute-

force

/ Injection

Brute-

force

/ Injection

Infiltration

Infiltration

Bot

Attack Ratio (%)

5.38

0.8

12.99

7.29

12.9

0.01

0.79

11.24

28.11

3.48

F1

0.95

0.99

0.91

0.95

0.88

0.99

0.99

0.77

0.56

0.99

Run on enriched CIC network flow data

With IP address information

29 of 47

Discussion: Isolation Forest

29

Based on the idea that anomalies are more susceptible to isolation under random partitioning

Doesn’t perform well when anomaly clusters are large and dense

To get the best results, it requires a “contamination rate”

30 of 47

Results: Gradient Boosting (with pruning)

30

Day

14/02

15/02

16/02

20/02

21/02

22/02

23/02

28/02

01/03

02/03

Attack Labels

Brute-

force

DoS

DoS

DDoS

DDoS

Brute-

force

/ Injection

Brute-

force

/ Injection

Infiltration

Infiltration

Bot

F1

0.94

0.99

0.87

0.93

0.87

0.99

0.99

0.68

0.72

0.96

Run on enriched CIC network flow data

31 of 47

Discussion: Gradient Boosting

31

Likely overfitting the dataset (works by minimising the mean squared error)

Tuned using hyperparameter cross-validation testing with the same dataset

* Overfitting is when a model performs well on a given dataset but the results aren’t generalisable - the trained model will perform poorly if evaluated on a different dataset

32 of 47

Results: Ensemble of Auto Encoders (Kitsune)

32

Day

14/02

15/02

16/02

20/02

21/02

22/02

23/02

28/02

01/03

02/03

Attack Labels

Brute-

force

DoS

DoS

DDoS

DDoS

Brute-

force

Brute-

force

Infiltration

Infiltration

Bot

Attack Ratio (%)

0.0048

0.54

0.0059

0.046

0.027

0.000037

0.000049

0.011

0.21

0.016

F1

0.65

0.51

0.59

0.65

X

0.68

0.68

0.53

0.45

0.43

Run on connection audit records

With IP address information

On the first 1 million lines of each file

33 of 47

Discussion: Ensemble of Auto Encoders (Kitsune)

33

Results are poor because of under-exposure to anomalies

Takes > 24h to run on a single day with 6 million samples

34 of 47

Results Recap

34

Day

14/02

15/02

16/02

20/02

21/02

22/02

23/02

28/02

01/03

02/03

Training Time

Attack Labels

Brute-

force

DoS

DoS

DDoS

DDoS

Brute-

force

/ Injection

Brute-

force

/ Injection

Infiltration

Infiltration

Bot

DNN A + TCW

0.99

-

1

0.94

0.94

0

0

0.97

0.97

0.94

~2 min (per epoch)

Kitsune

0.65

0.51

0.59

0.65

X

0.68

0.68

0.53

0.45

0.43

4 hours ๋

run on GPU: GEFORCE RTX 3090

๋ run on CPU: AMD Ryzen 5 3600 6-Core @ 3.6GHz

One model per day, binary classification, with address info

iForest

0.95

0.99

0.91

0.95

0.88

0.99

0.99

0.77

0.56

0.99

3 min ๋

GBoost

0.94

0.99

0.87

0.93

0.87

0.99

0.99

0.89

0.72

0.96

30 min ๋

35 of 47

Conclusion

High success ratio for the supervised strategies, even without address information

  • Knowledge transfer between network topologies should be possible

  • GPU or parallelisation are essential for processing large amounts of data

  • Overfitting of certain models can be mitigated to make them generalisable

35

36 of 47

Future Work

Complete alert pipeline and test analysis in Maltego / Elastic

Further research and more experiments with unsupervised algorithms

36

37 of 47

Recap and contributions

Analyzed a modern dataset for network intrusion detection using state of the art algorithms for anomaly detection

Found numerous errors in the dataset and reported them back to authors

Created our own feature extraction and labelling logic and open sourced it

Created a DNN using tensorflow and evaluated its performance

Created a generic analyzer with support for many other online and offline models, including isolation forests, gradient boosting, kitsune and more

37

38 of 47

Recap and contributions

Bootstrapped a pipeline for feeding the generated alerts into a modern analytics platform, Elastic / Kibana or Maltego

Open sourced our entire experiment testbed and internal documentation for reproducibility

Evaluated the novel autoencoder ensemble Kitsune framework on the CIC IDS 2018 dataset

38

39 of 47

Questions?

39

40 of 47

40

41 of 47

DNN Train / Test Split

41

42 of 47

DNN Train / Test Split

42

43 of 47

43

44 of 47

44

45 of 47

UNIX socket processing

45

46 of 47

46

47 of 47