1 of 47

Good || Evil: Defending Infrastructure at Scale with Anomaly and Classification Based Network Intrusion Detection

Master Thesis Project #75

Philippe Partarrieu <ppartarrieu@os3.nl>

Philipp Mieden <pmieden@os3.nl>

Supervisors

Joao Novaismarques <joao.novaismarques@kpn.com>

Jordi Scharloo <jordi.scharloo@kpn.com>

Giovanni Sileno <g.sileno@uva.nl>

2 of 47

The problems with signature-based solutions

Signatures can be easily bypassed by updating key attack characteristics
Boundless growth of signature DB over time
Requires Up-to-date signature DB

Anomaly detection can help here

System learns the baseline of normal network behavior
Alerts can be generated once there are deviations
Many different algorithms to choose from
Known to work well for many similar data science problems

Research Question: Which algorithms are best suited for the task of intrusion detection in computer networks?

2

3 of 47

Dataset: CIC IDS 2018

Large scale, modern dataset with traffic from over 400 different machines

450GB pcaps
80+ network flow based features provided as CSV
6 types of attacks (brute-force, DoS, DDoS, bot, injection, infiltration)

3

https://www.unb.ca/cic/datasets/ids-2018.html

4 of 47

Dataset: Exploration

4

- no normal days provided

- pcap seem to be missing information

- CSV data seems to be missing information

- labeling not published and not granular enough:

- we suspect they simply labelled all traffic from the involved ips during the attack time interval

- which potentially marks legitimate background traffic as malicious (e.g OS update check or mail traffic)

- we enabled this as feature only selectively where it was needed:

- for scenarios that involved executing an exploit against a victim (e.g infiltration attacks)

- but not for denial of service traffic for example, where victims are just contacted but perform no malicious activity on their own

- infiltration attack days however, where victim machine acts on attackers behalf after infection, need to be treated differently.

5 of 47

Dataset: Exploration

5

HTTP request to IP 18.219.211.138:8080 instead to domain name

- no normal days provided

- pcap seem to be missing information

- CSV data seems to be missing information

- labeling not published and not granular enough:

- we suspect they simply labelled all traffic from the involved ips during the attack time interval

- which potentially marks legitimate background traffic as malicious (e.g OS update check or mail traffic)

- we enabled this as feature only selectively where it was needed:

- for scenarios that involved executing an exploit against a victim (e.g infiltration attacks)

- but not for denial of service traffic for example, where victims are just contacted but perform no malicious activity on their own

- infiltration attack days however, where victim machine acts on attackers behalf after infection, need to be treated differently.

6 of 47

Dataset: Exploration - Botnet activity

6

- no normal days provided

- pcap seem to be missing information

- CSV data seems to be missing information

- labeling not published and not granular enough:

- we suspect they simply labelled all traffic from the involved ips during the attack time interval

- which potentially marks legitimate background traffic as malicious (e.g OS update check or mail traffic)

- we enabled this as feature only selectively where it was needed:

- for scenarios that involved executing an exploit against a victim (e.g infiltration attacks)

- but not for denial of service traffic for example, where victims are just contacted but perform no malicious activity on their own

- infiltration attack days however, where victim machine acts on attackers behalf after infection, need to be treated differently.

7 of 47

Dataset: Issues

Original network flow CSV was missing information

Flow ID, Src IP, Dst IP, Src Port

Labeling tool used in study is not open source

One provided pcap was missing attack traffic

one day (Thursday-15-02-2018) contains no attack traffic at all

7

- no normal days provided

- pcap seem to be missing information

- CSV data seems to be missing information

- labeling not published and not granular enough:

- we suspect they simply labelled all traffic from the involved ips during the attack time interval

- which potentially marks legitimate background traffic as malicious (e.g OS update check or mail traffic)

- we enabled this as feature only selectively where it was needed:

- for scenarios that involved executing an exploit against a victim (e.g infiltration attacks)

- but not for denial of service traffic for example, where victims are just contacted but perform no malicious activity on their own

- infiltration attack days however, where victim machine acts on attackers behalf after infection, need to be treated differently.

8 of 47

Dataset: Imbalance

An issue for some network intrusion detection algorithms

Some algorithms require an equal distribution of positive and negative classes
Some algorithms require a high ratio of anomalies

Different strategies to deal with dataset imbalance exist:

Class weights
Oversampling

8

9 of 47

Metrics: Confusion Matrix

9

Correct prediction

Incorrect prediction

True Positive	False Positive
False Negative	True Negative

10 of 47

Metrics: Accuracy

10

True Positive 0	False Positive 0
False Negative 1	True Negative 99

Accuracy is never suitable as a metric when dealing with imbalanced data

A simple example: assume a dataset with 99 benign and 1 malicious sample

An algorithm that always predicts that the sample is normal would score 99% accuracy. It has identified 99 benign events but missed the anomaly

Accuracy = =

11 of 47

Metrics: F1 score

11

True Positive 0	False Positive 0
False Negative 1	True Negative 99

For the same dataset, let’s calculate the F1 score

F1 score =

Where precision = and recall =

We obtain F1 = 2 * 0 = 0

The F1 score takes the relevance of the different error types into account!

12 of 47

Experiment Design

12

We compared deep learning models and shallow learning models

For the experiments we used both the Connection audit records generated by us, and the CIC IDS Network flow data provided by the dataset authors

13 of 47

Experiment Design: Model Choice

13

Model	Online	Supervised	Deep Learning
Deep Neural Network	X	✓	✓
Auto Encoders	✓	X	✓
Gradient Boosting	X	✓	X
Isolation Forest	X	X	X

14 of 47

Feature Extraction

Ideally parallelized to take advantage

of multi-core processors.

Concurrency causes problems though:

Race conditions
Shared state

Ideally something lightweight to calculate that is still expressive enough to capture network trends

Solution: bi-directional network flow summaries (Connection Audit Records)

Requires keeping only minimal state, aggregate subflows
Processing rate: 300K pkts/second on a Ryzen 9, 16 core processor

14

15 of 47

Connection Audit Records

We chose a deliberately simple set of features for establishing a baseline.

Additional features can always be added later and their effect on prediction quality measured.

We also added stats for the number of specific TCP flags and the mean TCP window size.

15

We preserved the DstPort even when dropping address information

16 of 47

Labelling: adding attack information

16

We implemented unit tests to ensure labelling works as expected

Labelling can be applied to any audit record from NETCAP

17 of 47

Labelling: adding attack information

17

18 of 47

Data Encoding

Categorical data (eg: strings) must be transformed into numeric values

Multiple approaches for encoding categorical data:

Enumeration
One Hot
Learned Embedding

We chose enumeration because it does not alter the feature dimension!

18

19 of 47

Data Normalization

Numeric values must be normalized to reside within a certain threshold

We used:

Zscore: The Standard Normal Distribution

19

http://www.ltcconline.net/greenl/courses/201/probdist/zScore.htm

20 of 47

Tensorflow

Free and open source software library for machine learning from Google

Supports different backends for computations: CPU, GPU, FPGA etc

Can be run in a cluster mode to run processing jobs on multiple hardware devices

20

https://www.tensorflow.org

21 of 47

Tensorboard

21

22 of 47

Deep Neural Network: Baseline Model

We chose a deliberately small network for the baseline experiments

Bigger does not always mean better, as later experiment results confirmed

22

23 of 47

Deep Neural Network: GPU acceleration

GEFORCE RTX 3090

coreClock: 1.74GHz
coreCount: 82
deviceMemorySize: 23.67GiB
deviceMemoryBandwidth: 871.81GiB/s

Processing 6m Connection audit records, during training and testing ~2s per Epoch (= one run over the entire data).

23

24 of 47

Results: Audit record generation performance

24

25 of 47

Results: DNN* with address information

25

Day	14/02	15/02	16/02	20/02	21/02	22/02	23/02	28/02	01/03	02/03
Attack Labels	Brute- force	DoS	DoS	DDoS	DDoS	Brute- force / Injection	Brute- force / Injection	Infiltration	Infiltration	Bot
Attack Ratio (%)	0.48	Pcaps contain no attack traffic	0.59	4.61	2.74	0.000035	0.000048	1.06	2.1	1.6
F1	0.99	-	1	0.94	0.94	0	0	0.97	0.97	0.90

*one model trained per day, address information and tweaked class weights, binary classification: normal or attack

26 of 47

Results: DNN* without address information

26

Day	14/02	15/02	16/02	20/02	21/02	22/02	23/02	28/02	01/03	02/03
Attack Labels	Brute- force	DoS	DoS	DDoS	DDoS	Brute- force / Injection	Brute- force / Injection	Infiltration	Infiltration	Bot
Attack Ratio (%)	0.48	Pcaps contain no attack traffic	0.59	4.61	2.74	0.000035	0.000048	1.06	2.1	1.6
F1	0.98	-	0.99	0.94	0.94	0	0	0.83	0.80	0.79

*one model trained per day, no address information and tuned class weights,

binary classification: normal or attack

27 of 47

Results: DNN trained on entire dataset

27

Variation	A + TCW	NA + TCW
F1	0.95	0.87

one model trained for the entire dataset,

binary classification: normal or attack

Variation	A	NA
F1	0.95	0.94

one model trained for the entire dataset,

multi-class classification: normal, denial-of-service, bruteforce, injection, infiltration or botnet

28 of 47

Results: Isolation Forest

28

Day	14/02	15/02	16/02	20/02	21/02	22/02	23/02	28/02	01/03	02/03
Attack Labels	Brute- force	DoS	DoS	DDoS	DDoS	Brute- force / Injection	Brute- force / Injection	Infiltration	Infiltration	Bot
Attack Ratio (%)	5.38	0.8	12.99	7.29	12.9	0.01	0.79	11.24	28.11	3.48
F1	0.95	0.99	0.91	0.95	0.88	0.99	0.99	0.77	0.56	0.99

Run on enriched CIC network flow data

With IP address information

29 of 47

Discussion: Isolation Forest

29

Based on the idea that anomalies are more susceptible to isolation under random partitioning

Doesn’t perform well when anomaly clusters are large and dense

To get the best results, it requires a “contamination rate”

https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

30 of 47

Results: Gradient Boosting (with pruning)

30

Day	14/02	15/02	16/02	20/02	21/02	22/02	23/02	28/02	01/03	02/03
Attack Labels	Brute- force	DoS	DoS	DDoS	DDoS	Brute- force / Injection	Brute- force / Injection	Infiltration	Infiltration	Bot
F1	0.94	0.99	0.87	0.93	0.87	0.99	0.99	0.68	0.72	0.96

Run on enriched CIC network flow data

31 of 47

Discussion: Gradient Boosting

31

Likely overfitting the dataset (works by minimising the mean squared error)

Tuned using hyperparameter cross-validation testing with the same dataset

* Overfitting is when a model performs well on a given dataset but the results aren’t generalisable - the trained model will perform poorly if evaluated on a different dataset

32 of 47

Results: Ensemble of Auto Encoders (Kitsune)

32

Day	14/02	15/02	16/02	20/02	21/02	22/02	23/02	28/02	01/03	02/03
Attack Labels	Brute- force	DoS	DoS	DDoS	DDoS	Brute- force	Brute- force	Infiltration	Infiltration	Bot
Attack Ratio (%)	0.0048	0.54	0.0059	0.046	0.027	0.000037	0.000049	0.011	0.21	0.016
F1	0.65	0.51	0.59	0.65	X	0.68	0.68	0.53	0.45	0.43

Run on connection audit records

With IP address information

On the first 1 million lines of each file

33 of 47

Discussion: Ensemble of Auto Encoders (Kitsune)

33

https://arxiv.org/pdf/1802.09089.pdf

Results are poor because of under-exposure to anomalies

Takes > 24h to run on a single day with 6 million samples

34 of 47

Results Recap

34

Day	14/02	15/02	16/02	20/02	21/02	22/02	23/02	28/02	01/03	02/03	Training Time
Attack Labels	Brute- force	DoS	DoS	DDoS	DDoS	Brute- force / Injection	Brute- force / Injection	Infiltration	Infiltration	Bot
DNN A + TCW	0.99	-	1	0.94	0.94	0	0	0.97	0.97	0.94	~2 min (per epoch)^✝
Kitsune	0.65	0.51	0.59	0.65	X	0.68	0.68	0.53	0.45	0.43	4 hours ๋

^✝ run on GPU: GEFORCE RTX 3090

๋ run on CPU: AMD Ryzen 5 3600 6-Core @ 3.6GHz

One model per day, binary classification, with address info

iForest	0.95	0.99	0.91	0.95	0.88	0.99	0.99	0.77	0.56	0.99	3 min ๋
GBoost	0.94	0.99	0.87	0.93	0.87	0.99	0.99	0.89	0.72	0.96	30 min ๋

35 of 47

Conclusion

High success ratio for the supervised strategies, even without address information

Knowledge transfer between network topologies should be possible

GPU or parallelisation are essential for processing large amounts of data

Overfitting of certain models can be mitigated to make them generalisable

35

36 of 47

Future Work

Complete alert pipeline and test analysis in Maltego / Elastic

Further research and more experiments with unsupervised algorithms

36

37 of 47

Recap and contributions

Analyzed a modern dataset for network intrusion detection using state of the art algorithms for anomaly detection

Found numerous errors in the dataset and reported them back to authors

Created our own feature extraction and labelling logic and open sourced it

Created a DNN using tensorflow and evaluated its performance

Created a generic analyzer with support for many other online and offline models, including isolation forests, gradient boosting, kitsune and more

37

38 of 47

Recap and contributions

Bootstrapped a pipeline for feeding the generated alerts into a modern analytics platform, Elastic / Kibana or Maltego

Open sourced our entire experiment testbed and internal documentation for reproducibility

Evaluated the novel autoencoder ensemble Kitsune framework on the CIC IDS 2018 dataset

38

39 of 47

Questions?

39

Links:

https://github.com/dreadl0ck/netcap

https://github.com/dreadl0ck/masterthesis

https://github.com/ppartarr/anomaly

https://rp.os3.nl/2020-2021/p75/report.pdf

40 of 47

40

41 of 47

DNN Train / Test Split

41

42 of 47

DNN Train / Test Split

42

43 of 47

43

44 of 47

44

45 of 47

UNIX socket processing

45

46 of 47

46

1 of 47

2 of 47

3 of 47

4 of 47

5 of 47

6 of 47

7 of 47

8 of 47

9 of 47

10 of 47

11 of 47

12 of 47

13 of 47

14 of 47

15 of 47

16 of 47

17 of 47

18 of 47

19 of 47

20 of 47

21 of 47

22 of 47

23 of 47

24 of 47

25 of 47

26 of 47

27 of 47

28 of 47

29 of 47

30 of 47

31 of 47

32 of 47

33 of 47

34 of 47

35 of 47

36 of 47

37 of 47

38 of 47

39 of 47

40 of 47

41 of 47

42 of 47

43 of 47

44 of 47

45 of 47

46 of 47

47 of 47