Good || Evil: Defending Infrastructure at Scale with Anomaly and Classification Based Network Intrusion Detection
Master Thesis Project #75
Philippe Partarrieu <ppartarrieu@os3.nl>
Philipp Mieden <pmieden@os3.nl>
Supervisors
Joao Novaismarques <joao.novaismarques@kpn.com>
Jordi Scharloo <jordi.scharloo@kpn.com>
Giovanni Sileno <g.sileno@uva.nl>
The problems with signature-based solutions
Anomaly detection can help here
Research Question: Which algorithms are best suited for the task of intrusion detection in computer networks?
2
Dataset: CIC IDS 2018
Large scale, modern dataset with traffic from over 400 different machines
3
Dataset: Exploration
4
Dataset: Exploration
5
HTTP request to IP 18.219.211.138:8080 instead to domain name
Dataset: Exploration - Botnet activity
6
Dataset: Issues
7
Dataset: Imbalance
An issue for some network intrusion detection algorithms
Different strategies to deal with dataset imbalance exist:
8
Metrics: Confusion Matrix
9
Correct prediction
Incorrect prediction
True Positive | False Positive |
False Negative | True Negative |
Metrics: Accuracy
10
True Positive 0 | False Positive 0 |
False Negative 1 | True Negative 99 |
Accuracy is never suitable as a metric when dealing with imbalanced data
A simple example: assume a dataset with 99 benign and 1 malicious sample
An algorithm that always predicts that the sample is normal would score 99% accuracy. It has identified 99 benign events but missed the anomaly
Accuracy = =
Metrics: F1 score
11
True Positive 0 | False Positive 0 |
False Negative 1 | True Negative 99 |
For the same dataset, let’s calculate the F1 score
F1 score =
Where precision = and recall =
We obtain F1 = 2 * 0 = 0
The F1 score takes the relevance of the different error types into account!
Experiment Design
12
We compared deep learning models and shallow learning models
For the experiments we used both the Connection audit records generated by us, and the CIC IDS Network flow data provided by the dataset authors
Experiment Design: Model Choice
13
Model | Online | Supervised | Deep Learning |
Deep Neural Network | X | ✓ | ✓ |
Auto Encoders | ✓ | X | ✓ |
Gradient Boosting | X | ✓ | X |
Isolation Forest | X | X | X |
Feature Extraction
Ideally parallelized to take advantage
of multi-core processors.
Concurrency causes problems though:
Ideally something lightweight to calculate that is still expressive enough to capture network trends
14
Connection Audit Records
We chose a deliberately simple set of features for establishing a baseline.
Additional features can always be added later and their effect on prediction quality measured.
We also added stats for the number of specific TCP flags and the mean TCP window size.
15
We preserved the DstPort even when dropping address information
Labelling: adding attack information
16
We implemented unit tests to ensure labelling works as expected
Labelling can be applied to any audit record from NETCAP
Labelling: adding attack information
17
Data Encoding
Categorical data (eg: strings) must be transformed into numeric values
Multiple approaches for encoding categorical data:
We chose enumeration because it does not alter the feature dimension!
18
Data Normalization
Numeric values must be normalized to reside within a certain threshold
We used:
19
Tensorflow
Free and open source software library for machine learning from Google
Supports different backends for computations: CPU, GPU, FPGA etc
Can be run in a cluster mode to run processing jobs on multiple hardware devices
20
Tensorboard
21
Deep Neural Network: Baseline Model
We chose a deliberately small network for the baseline experiments
Bigger does not always mean better, as later experiment results confirmed
22
Deep Neural Network: GPU acceleration
GEFORCE RTX 3090
Processing 6m Connection audit records, during training and testing ~2s per Epoch (= one run over the entire data).
23
Results: Audit record generation performance
24
Results: DNN* with address information
25
Day | 14/02 | 15/02 | 16/02 | 20/02 | 21/02 | 22/02 | 23/02 | 28/02 | 01/03 | 02/03 |
Attack Labels | Brute- force | DoS | DoS | DDoS | DDoS | Brute- force / Injection | Brute- force / Injection | Infiltration | Infiltration | Bot |
Attack Ratio (%) | 0.48 | Pcaps contain no attack traffic | 0.59 | 4.61 | 2.74 | 0.000035 | 0.000048 | 1.06 | 2.1 | 1.6 |
F1 | 0.99 | - | 1 | 0.94 | 0.94 | 0 | 0 | 0.97 | 0.97 | 0.90 |
*one model trained per day, address information and tweaked class weights, binary classification: normal or attack
Results: DNN* without address information
26
Day | 14/02 | 15/02 | 16/02 | 20/02 | 21/02 | 22/02 | 23/02 | 28/02 | 01/03 | 02/03 |
Attack Labels | Brute- force | DoS | DoS | DDoS | DDoS | Brute- force / Injection | Brute- force / Injection | Infiltration | Infiltration | Bot |
Attack Ratio (%) | 0.48 | Pcaps contain no attack traffic | 0.59 | 4.61 | 2.74 | 0.000035 | 0.000048 | 1.06 | 2.1 | 1.6 |
F1 | 0.98 | - | 0.99 | 0.94 | 0.94 | 0 | 0 | 0.83 | 0.80 | 0.79 |
*one model trained per day, no address information and tuned class weights,
binary classification: normal or attack
Results: DNN trained on entire dataset
27
Variation | A + TCW | NA + TCW |
F1 | 0.95 | 0.87 |
one model trained for the entire dataset,
binary classification: normal or attack
Variation | A | NA |
F1 | 0.95 | 0.94 |
one model trained for the entire dataset,
multi-class classification: normal, denial-of-service, bruteforce, injection, infiltration or botnet
Results: Isolation Forest
28
Day | 14/02 | 15/02 | 16/02 | 20/02 | 21/02 | 22/02 | 23/02 | 28/02 | 01/03 | 02/03 |
Attack Labels | Brute- force | DoS | DoS | DDoS | DDoS | Brute- force / Injection | Brute- force / Injection | Infiltration | Infiltration | Bot |
Attack Ratio (%) | 5.38 | 0.8 | 12.99 | 7.29 | 12.9 | 0.01 | 0.79 | 11.24 | 28.11 | 3.48 |
F1 | 0.95 | 0.99 | 0.91 | 0.95 | 0.88 | 0.99 | 0.99 | 0.77 | 0.56 | 0.99 |
Run on enriched CIC network flow data
With IP address information
Discussion: Isolation Forest
29
Based on the idea that anomalies are more susceptible to isolation under random partitioning
Doesn’t perform well when anomaly clusters are large and dense
To get the best results, it requires a “contamination rate”
Results: Gradient Boosting (with pruning)
30
Day | 14/02 | 15/02 | 16/02 | 20/02 | 21/02 | 22/02 | 23/02 | 28/02 | 01/03 | 02/03 |
Attack Labels | Brute- force | DoS | DoS | DDoS | DDoS | Brute- force / Injection | Brute- force / Injection | Infiltration | Infiltration | Bot |
F1 | 0.94 | 0.99 | 0.87 | 0.93 | 0.87 | 0.99 | 0.99 | 0.68 | 0.72 | 0.96 |
Run on enriched CIC network flow data
Discussion: Gradient Boosting
31
Likely overfitting the dataset (works by minimising the mean squared error)
Tuned using hyperparameter cross-validation testing with the same dataset
* Overfitting is when a model performs well on a given dataset but the results aren’t generalisable - the trained model will perform poorly if evaluated on a different dataset
Results: Ensemble of Auto Encoders (Kitsune)
32
Day | 14/02 | 15/02 | 16/02 | 20/02 | 21/02 | 22/02 | 23/02 | 28/02 | 01/03 | 02/03 |
Attack Labels | Brute- force | DoS | DoS | DDoS | DDoS | Brute- force | Brute- force | Infiltration | Infiltration | Bot |
Attack Ratio (%) | 0.0048 | 0.54 | 0.0059 | 0.046 | 0.027 | 0.000037 | 0.000049 | 0.011 | 0.21 | 0.016 |
F1 | 0.65 | 0.51 | 0.59 | 0.65 | X | 0.68 | 0.68 | 0.53 | 0.45 | 0.43 |
Run on connection audit records
With IP address information
On the first 1 million lines of each file
Discussion: Ensemble of Auto Encoders (Kitsune)
33
Results are poor because of under-exposure to anomalies
Takes > 24h to run on a single day with 6 million samples
Results Recap
34
Day | 14/02 | 15/02 | 16/02 | 20/02 | 21/02 | 22/02 | 23/02 | 28/02 | 01/03 | 02/03 | Training Time |
Attack Labels | Brute- force | DoS | DoS | DDoS | DDoS | Brute- force / Injection | Brute- force / Injection | Infiltration | Infiltration | Bot | |
DNN A + TCW | 0.99 | - | 1 | 0.94 | 0.94 | 0 | 0 | 0.97 | 0.97 | 0.94 | ~2 min (per epoch)✝ |
Kitsune | 0.65 | 0.51 | 0.59 | 0.65 | X | 0.68 | 0.68 | 0.53 | 0.45 | 0.43 | 4 hours ๋ |
✝ run on GPU: GEFORCE RTX 3090
๋ run on CPU: AMD Ryzen 5 3600 6-Core @ 3.6GHz
One model per day, binary classification, with address info
iForest | 0.95 | 0.99 | 0.91 | 0.95 | 0.88 | 0.99 | 0.99 | 0.77 | 0.56 | 0.99 | 3 min ๋ |
GBoost | 0.94 | 0.99 | 0.87 | 0.93 | 0.87 | 0.99 | 0.99 | 0.89 | 0.72 | 0.96 | 30 min ๋ |
Conclusion
High success ratio for the supervised strategies, even without address information
35
Future Work
Complete alert pipeline and test analysis in Maltego / Elastic
Further research and more experiments with unsupervised algorithms
36
Recap and contributions
Analyzed a modern dataset for network intrusion detection using state of the art algorithms for anomaly detection
Found numerous errors in the dataset and reported them back to authors
Created our own feature extraction and labelling logic and open sourced it
Created a DNN using tensorflow and evaluated its performance
Created a generic analyzer with support for many other online and offline models, including isolation forests, gradient boosting, kitsune and more
37
Recap and contributions
Bootstrapped a pipeline for feeding the generated alerts into a modern analytics platform, Elastic / Kibana or Maltego
Open sourced our entire experiment testbed and internal documentation for reproducibility
Evaluated the novel autoencoder ensemble Kitsune framework on the CIC IDS 2018 dataset
38
Questions?
39
Links:
https://github.com/dreadl0ck/netcap
https://github.com/dreadl0ck/masterthesis
https://github.com/ppartarr/anomaly
https://rp.os3.nl/2020-2021/p75/report.pdf
More questions? Send us an email!
40
DNN Train / Test Split
41
DNN Train / Test Split
42
43
44
UNIX socket processing
45
46