1 of 31

Machine Learning

w Security

Nosacz Meetup #128.10.20.19

Mariusz Wołoszyn

2 of 31

To pewie widzieliśćie?

If it’s in Python it can be Machine Learning,

If it’s in PowerPoitn it’s AI

3 of 31

ML is everywhere

Hype

4 of 31

Gun Detection

Image recognition

Supervised learning, image classification, object detection

API:�https://valossa.com/image-recognition-demo-is-now-live/

5 of 31

Adversarial attacks

6 of 31

Adversarial training

7 of 31

Fooling humans

8 of 31

Mouse and cat

“Facebook AI Research (FAIR) has developed a state-of-the-art “de-identification” system that works on video, including even live video. It works by altering key facial features of a video subject in real time using machine learning, to trick a facial recognition system into improperly identifying the subject. “

9 of 31

Summary

  1. There’s AI hype all over
  2. For each ML use there’s hay to “hack it”
  3. ML capabilities are much like humans�(with its strengths and weaknesses)
  4. For each advance there is countermeasure

10 of 31

Is ML in Security a thing?

  • Probably, someone make a book about it...

11 of 31

Is ML in Security a thing?

  • Probably, someone make a book about it...
  • ...and another...

12 of 31

Is ML in Security a thing?

  • Probably, someone make a book about it...
  • ...and another…
  • ... yet another...

13 of 31

Is ML in Security a thing?

  • Probably, someone make a book about it...
  • ...and another…
  • ... yet another…
  • ...and another…

14 of 31

Is ML in Security a thing?

  • Probably, someone make a book about it...
  • ...and another…
  • ... yet another…
  • ...and another…
  • ...and one more...

15 of 31

Is ML in Security a thing?

  • Probably, someone make a book about it...
  • ...and another…
  • ... yet another…
  • ...and another…
  • ...and one more…
  • ...and more...

16 of 31

Is ML in Security a thing?

  • Probably, someone make a book about it...
  • and another…
  • yet another…
  • and another…
  • and one more…
  • and more…
  • and more...

OK, I was cheating a bit, but you got the point.

17 of 31

ML in Security

Machine Learning

In Security

Pattern recognition

Anomaly detection

18 of 31

Pattern recognition

  • Spam detection
  • Malware detection
  • Botnet detection
  • Identity verification
  • ...

19 of 31

Anomaly detection

  • User authentication?
  • Behavioral analysis?
  • Network outlier detection
  • Malicious URL detection

20 of 31

Caveats

  • Lot of measures to optimise (not just accuracy, precision and recall).
  • At 99% precision and 100.000 sessions per day we may interrupt 1010 legit connections. Is it acceptable?
  • What costs more? Too many false positives or lower recall?
  • Maintain bayesian viewpoint. Asses your prior and incorporate that into your system.

21 of 31

Choose wisely

  • Pick right algorithm (people tend to use sledgehammer to crack a nuts)
    • In security context it’s often better to have slightly less reliable answer but faster,
    • also it’s easier to implement cheaper, less accurate algorithm at scale rather than best but prohibitively expensive.
  • Is explainability required? (Usually it’s beneficial if not strictly desired).

22 of 31

Calibrate your models

Using default thresholds is usually bad idea (0.5 for probability scores or 0 for SVM).

Build your models around decision_function not predict_proba in sklearn.

23 of 31

Retrain your models regularly

  • Patterns do change...
  • so do customs, preferences, environment.
  • New technologies and trends do come out.
  • Limit training data span (6 to 12 months is usually good for behavioral patterns).

24 of 31

Example Malware Detection

  • Ergo (a command line tool that makes machine learning with Keras easier)
  • Under the hood it’s classification (a binary one).
  • We need lot of input features (characteristics of file to be examined).
  • Throw features at classification algorithm(s)
    • Logistic regression, SVM, Forest (say XGBoost) or NN.
  • Asses the results in terms of precision, recall, auroc, execution (inferring/predict) time, resources utilization, explainability.
  • Build process and pipeline!

https://www.evilsocket.net/2019/05/22/How-to-create-a-Malware-detection-system-with-Machine-Learning/#.XOXPbZSbnyM.twitter

25 of 31

Feature Engineering

  • Extract file characteristics (https://github.com/lief-project/LIEF)
  • Bytes histogram
  • Used APIs

26 of 31

Model

27 of 31

EDA

28 of 31

XGBoost

  1. Explainability.
  2. Speed (parallels nicely on CPU).
  3. No need for GPU (cheaper).
  4. Due to explainability we can actually remove features with little relevance boosting performance even further.
  5. Knowing which features are important we can come up with other features.
  6. Accuracy: 0.987548132219833

https://github.com/emsi/ergo/tree/master/eda-notebooks

29 of 31

There’s always a way to fool it...

30 of 31

31 of 31

Following slides were censored