1 of 99

Final Year Project Automatic Malware Analysis using Deep Learning and Sandboxing Technology

By: Ee Sheng

Next

2 of 99

Introduction

Project 1: To explore Deep Learning on malware binary and classify their threat groups.

Project 2: To explore open source sandboxing tools for automatic malware analysis.

Back

Next

3 of 99

Table of Contents

Deep Learning

1

2

4

Research on Android & Linux Sandboxes

3

Enhancing CAPEv2

> From Chaos to Control: Multi-classification in Malware Analysis

Back

Next

> The Hunt for Zero-Days: Android and Linux Exploits

> A Malware Analyst's Secret weapon

00

00

Conclusion

> The Epilogue: Putting it all together

4 of 99

Final Year Project Timeline (W1-6)

Week 1

Research on various CNN architecture (VGG16)

Data Pre-processing

Week 2

Research on how to build own CNN models

Data Pre-processing

Start on Sandbox project -

Setting up tools

Week 4

Data Pre-processing

Build own CNN models

Research and testing of Linux and Android Sandboxes

Week 6

Fine-tune my CNN models

Setup the Linux & Android Tools

Week 3

Data Pre-processing

Build own CNN models

Research on the potential enhancements on CapeV2

Week 5

Reconfigure VGG16 architecture to suit grayscale images

Build own CNN models

5 of 99

Final Year Project Timeline (W7-12)

Week 7

Fine-tune my CNN models

Setup the Linux & Android Tools

Midterm presentation

Week 8

Install & Dual Boot Ubuntu LTS 22.04

Setup latest CAPEv2 Sandbox

Fix tiny bugs on latest CAPEv2

Week 10

Implement enhanced features

Fix bugs with latest CAPEv2

Week 12

Final Presentation

Week 9

Setup latest CAPEv2 Sandbox

Implement enhanced features

Week 11

Fixed codes upon developers request

Ensure test cases pass

6 of 99

Deep Learning

Decoding the Mysteries of the CNN Universe

Back

Next

7 of 99

“What are Convolutional Neural Networks (CNNs)”

Back

Next

  • CNNs are a powerful type of deep learning algorithm that are particularly well-suited for analyzing images and videos.
  • E.g. Object recognition and face detection

8 of 99

“From pixels to predictions: Understanding the Architecture of CNNs”

Back

Next

9 of 99

"Building Brains: Understanding the Training of a CNN"

Back

Next

  1. "Optimization Algorithm"
    1. Help the network change its settings.
    2. Adaptive gradient algorithms (Adam, Adagrad, RMSprop)

  1. "Loss Function"
    1. Check how good the network is doing.
    2. Mean squared error (MSE)

  1. "Evaluation Metrics"
    1. Check if the network is accurate
    2. Accuracy, F1 score

10 of 99

“CNNs in Action: Real-world Solutions and Their Impact”

Back

Next

  1. Real-life examples of the application of CNNs, such as:
    1. CNNs being used in self-driving cars to detect and recognize objects on the road.
    2. CNNs being used in the medical field to analyze medical images and detect diseases.
    3. CNNs being used to automatically extract features from malware samples and classify them into different families.

11 of 99

Deep Learning

From Chaos to Control - Multi-classification in Malware Analysis

Back

Next

12 of 99

“Uncovering the Magic of Data Pre-processing: A Crucial Step in Data Analysis”

Back

Next

Data pre-processing is the process of preparing raw data for analysis by cleaning, formatting, and transforming it into a usable form.

  1. Ensure that data is accurate, consistent and relevant to the analysis.
  2. Without proper data pre-processing, the results of data analysis may be unreliable or misleading.

13 of 99

“Mastering the Art of Data Preprocessing: A Step-by-Step Guide”

Back

Next

Step 1: Identify the purpose of the analysis and the types of data needed

Step 2: Collect and import the raw data

Step 3: Clean and format the data in a usable format

Step 4: Transform the data as needed to make it more suitable for analysis (e.g.,Pixel-normalizing)

14 of 99

Identify the purpose of the analysis and the types of data needed

Back

Next

  • Executable files and other related files associated with the malware

15 of 99

Step 1: Identify the purpose of the analysis and the types of data needed

Back

Next

16 of 99

Step 1: Identify the purpose of the analysis and the types of data needed (Cont’d)

Back

Next

17 of 99

Step 2: Collect and import the raw data

Back

Next

  1. CSIT Samples (Password: malware, internal zip password: infected)
  2. Script to import data
  3. Unzip.py

- Unzip recursively .7z, .rar, .zip folders regardless of OS.

18 of 99

Step 2: Collect and import the raw data (unzip.py)

Back

Next

19 of 99

Step 3: Clean and format the data in a usable format

Back

Next

  1. Remove the low count sample files
  2. Remove unusable sample files (i.e., HTML, XML files)
  3. Scripts to format data:
    • convert_bin_to_img.py
      • Convert compiled malware (i.e., .msi, .exe) including PDFs and Word Docs into grayscale images.
    • resize_recursively.py
      • Able to resize original images in directory **recursively** to specific width and height (1024px1024p)
    • pad_resize_recursively.py
      • Able to pad original images in directory **recursively** to specific width and height (1024px1024p)

20 of 99

Step 3: Clean and format the data in a usable format (convert_bin_to_img.py)

Back

Next

21 of 99

Step 3: Clean and format the data in a usable format (resize_recursively.py)

Back

Next

22 of 99

Step 3: Clean and format the data in a usable format (pad_resize_recursively.py)

Back

Next

23 of 99

resize_recursively.py

pad_resize_recursively.py

Back

Next

24 of 99

resize_recursively.py

pad_resize_recursively.py

Back

Next

25 of 99

Step 4: Transform the data for analysis

Back

Next

Pixel-wise Normalization: The process of adjusting the pixel values of the image so that they have a common scale, without distorting the content of the image.

  • Use Min-max normalization: Set range 0 to 1
  • Enable the data to be in common range and removing the influence of outliers.

Step 1: Resize images from (1024x1024) to (224x224)

Step 2: Convert to 1 Dimension Array

Step 3: Arrays / 255

26 of 99

Pixel-wise Normalization: Resize images and convert to arrays

27 of 99

Pixel-wise Normalization: Resize images and convert to arrays (Cont’d)

28 of 99

Pixel-wise Normalization: Convert arrays into range of 0 to 1

29 of 99

Pixel-wise Normalization: Convert arrays into range of 0 to 1 (Cont’d)

Why is this important?

  1. CNN can make use of the entire range of values and it can perform better when learning patterns and features

  1. Imagine picture of a cat, but the colors and brightness of the image are all over the place

    • Normalizing the image, makes all the pixels have consistent brightness and color, which makes it easier for the computer to understand what's in the image.

30 of 99

"Unleashing the Power of Machine Learning: A Step-by-Step Guide to Model Training"

Step 1: Choose models and define the training process

Step 2: Train the models

Step 3: Evaluate the models

Step 4: Fine-tune the models

4idzm7

31 of 99

Step 1: Choose models and define the training process (First CNN)

32 of 99

Step 1: Choose models and define the training process (First CNN)

(Cont’d)

33 of 99

Step 2: Train the models

34 of 99

Step 2: Train the models

Datasets

Split Ratio

Developed CNN Models

Samples >50 (Unpadded)

6:4

3 Models each

Samples >50 (Unpadded)

8:2

Samples >100 (Unpadded)

6:4

Samples >100 (Unpadded)

8:2

Samples >50 (Padded)

6:4

Samples >50 (Padded)

8:2

Samples >100 (Padded)

6:4

Samples >100 (Padded)

8:2

35 of 99

Step 3: Evaluate the models (Padded Dataset)

Models

Convolutional Layers

Dense Layer

Training Data Accuracy

Testing Data Accuracy

F1 Score

Model 1

4

2

100.0 %

81.72 %

0.822

Model 2

3

2

100.0 %

78.92 %

0.798

Model 3

2

2

99.46 %

75.05 %

0.768

36 of 99

Step 3: Evaluate the models (Padded Dataset)

37 of 99

Step 4: Fine-tune the models

Bias regularization is like putting a weight on some of the blocks, so they're harder to move around. This makes it harder to build the tower in one spot, so it becomes more stable.

Dropout regularization is like taking away some of the blocks from the box randomly, so you don't have as many blocks to work with. This makes it so the tower has to be built with a lot of different blocks and not just rely on a few blocks.

Result: Strong & Stable Tower

38 of 99

Step 4: Fine-tune the models (Cont’d)

Models

Convolutional Layers

Dense Layer

Regularization

Training Data Accuracy

Testing Data Accuracy

F1 Score

Model 1

4

2

3 Dropout(0.50)

99.57 %

81.94 %

0.818

Model 1

4

2

2 Dropout(0.80), 1 Dropout (0.50),

6 Bias(0.001)

87.55 %

72.9 %

0.727

Model 1

4

2

2 Dropout(0.80), 1 Dropout (0.50),

6 Bias(0.005)

94.08 %

73.76 %

0.749

Model 1

4

2

3 Dropout(0.80), 6 Bias(0.005)

91.36 %

74.62 %

0.746

39 of 99

Step 4: Fine-tune the models (Cont’d)

Models

Convolutional Layers

Dense Layer

Regularization

Training Data Accuracy

Testing Data Accuracy

F1 Score

Model 2

3

2

2 Dropout(0.80), 2Bias (0.005),

2 Bias(0.003),

1 Bias (0.001)

96.52 %

80.43 %

0.801

Model 2

3

2

2 Dropout(0.30)

99.95 %

80.65 %

0.803

Model 3

2

2

1 Dropout(0.20)

99.29 %

77.42 %

0.779

Model 3

2

2

2 Dropout(0.80), 2Bias (0.005),

2 Bias(0.003),

1 Bias (0.001)

99.4 %

76.13 %

0.77

40 of 99

Step 4: Fine-tune the models (Cont’d)

Learning Rates

1e - 3

1e - 4

4e-4

4e - 3

5e - 3

3e - 3

2e - 3

5e - 4

3e - 4

The learning rate is a value that controls how much the model's parameters are updated in each iteration.

When we teach a computer to do something, like recognizing pictures, we need to adjust how it's thinking and understanding.

41 of 99

Fine-tune the models

Models

Convolutional Layers

Dense Layer

Regularization

Training Data Accuracy

Testing Data Accuracy

F1 Score

Model 1

4

2

3 Dropout(0.50)

lr= 1e-3

99.57 %

81.94 %

0.818

Model 2

3

2

2 Dropout(0.80), 2Bias (0.005),

2 Bias(0.003),

1 Bias (0.001)

lr= 1e-3

96.52 %

80.43 %

0.801

Model 3

2

2

1 Dropout(0.20)

lr= 1e-3

99.29 %

77.42 %

0.779

42 of 99

43 of 99

Addition: Transfer Learning: (VGG16)

  • Chosen VGG16 Model
    • Pre-trained on thousands of image classes (“ImageNet”) total of 14,197,122 images.
    • Used pre-trained VGG16 weights to train on “Grayscale images > 3 Dimensions > Arrays”
    • (Transfer Learning)

44 of 99

Addition: Transfer Learning: (VGG16) (Cont’d)

Models

Convolutional Layers

Dense Layer

Regularization

VGG16

13

3

2 Dropout(0.50)

45 of 99

Addition: extract_pe_features & extract_opcodes

46 of 99

Addition: Flask Web App with deployed AI Model

47 of 99

The Hunt for Zero-Days:

Android and Linux Exploits

Back

Next

48 of 99

Android and Linux Malwares

  • Android Package (.apk)
  • Executable and Linkable Format (.elf)

49 of 99

Successful Android Analysis Tools: MobSf

Mobile Security Framework (MobSF)

  • Best Static and Dynamic analysis sandboxing tool for android
  • Supports Android/IOS/Windows
  • Supports APK, XAPK, IPA & APPX
  • Easy installation
  • Most Information displayed
  • Multiple VM options (Android Studio, Genymotion)

50 of 99

Successful Android Analysis Tools: MobSf

51 of 99

Successful Android Analysis Tools: APKLeaks

APKLeaks

  • Static Analysis Only
  • Scans APK files for Uniform Resource Identifier (URI),endpoints and secrets
  • Creates a txt file and saves a copy of Output
  • Not Dynamic
  • Best Suited for finding vulnerabilities

  • LinkFinder

52 of 99

Successful Android Analysis Tools: APKLeaks

53 of 99

Successful Android Analysis Tools: RiskinDroid

RiskinDroid

  • Analyses apk and calculates risk level from 0-100
  • analyses: Declared permissions, Exploited permissions, Ghost permissions, Useless permissions
  • The higher the score the bigger the threat
  • Only takes permissions into considerations
  • Inspects and rates permissions

54 of 99

Successful Android Analysis Tools: RiskinDroid

55 of 99

Successful Android Analysis Tools:Quark-Engine

Quark-Engine

  • Made for penetration Testing
  • Open Source (Early - Stage)
  • Static Analysis
  • Detects crime done by Applications
  • Reusable & Shareable (customisable script)

56 of 99

Successful Android Analysis Tools:Quark-Engine

57 of 99

Successful Android Analysis Tools

Mobile Security Framework (MobSF)

  • Best Static and dynamic analysis sandboxing tool for android
  • Supports Android/IOS/Windows
  • Supports APK, XAPK, IPA & APPX
  • Easy installation
  • Most Information displayed
  • Multiple VM options (Android Studio, Genymotion)

Quark-Engine

  • Made for penetration Testing
  • Open Source (Early - Stage)
  • Static Analysis
  • Detects crime done by Applications

APKLeaks

  • Static Analysis Only
  • Scans APK files for Uniform Resource Identifier (URI),endpoints and secrets
  • Creates a txt file and saves a copy of Output
  • Not Dynamic
  • Best Suited for finding vulnerabilities

RiskinDroid

  • Analyses apk and calculates risk level from 0-100
  • analyses: Declared permissions, Exploited permissions, Ghost permissions, Useless permissions
  • The higher the score the bigger the threat
  • Only takes permissions into considerations
  • LinkFinder
  • Inspects and rates permissions
  • Reusable & Shareable (customisable script)

58 of 99

MobSF

RiskinDroid

Quark-Engine

ApkLeaks

Risk Assessment

File Details

Malicious Indicator

Suspicious Indicator

Other Information

Related Sandbox Artifacts

File Permission

File Activities

File Receivers

File Certificates

Extracted Strings

Extracted Files

Comparison Table

59 of 99

Successful Linux Analysis & Sandbox Tool: LiSa

  • Provides for an overview, static analysis, dynamic analysis, and network analysis of submitted Linux files:

Back

Next

Overview:

File Overview, Downloads, Anomalies

Static Analysis:

ELF Info, Imports, Exports, Libraries, Relocations, Symbols, Sections, Strings

Dynamic Analysis:

Process Tree, Opened Files, Syscalls

Network Analysis:

Endpoints, HTTP Requests, DNS Questions, Telnet Data, IRC Messages

60 of 99

Successful Linux Analysis & Sandbox Tool: LiSa

Back

Next

61 of 99

Using & Enhancing CAPEv2

A Malware Analyst's Secret weapon

Back

Next

62 of 99

Memory Snapshot At Different Phases

Back

Next

  • To better understand the different processes, services, etc. that are running during the execution of the malware, we can create memory dumps of the VM during different phases of execution.

  1. Modify CAPEv2 scheduler script to dump the memory of the VM before and during the analysis at customizable time intervals

  1. Utilize Volatility API to process and parse the memory dump for readability

  1. Create Python scripts and HTML templates to display the memory dump information in the web interface

  1. Modify existing configuration files and relevant Python scripts to ensure that the memory dumps before and during the analysis would be processed and displayed on the web interface

63 of 99

Memory Snapshot At Different Phases

Back

Next

Memory Dumps Before, During, and After the Analysis

(using Cerber Ransomware as an example)

notable differences

64 of 99

CAPEv2 Threat Attribution (with MISP)

Back

Next

  • As part of the client requirements, having threat attribution for samples submitted to CAPEv2 would help to provide better context (i.e. relevant events, threat actors, IOCs, etc.)

  • Malware Information Sharing Platform (MISP) is an “open source software solution for collecting, storing, distributing and sharing cyber security indicators and threats about cyber security” which can be integrated with other solutions

65 of 99

CAPEv2 Threat Attribution (with MISP)

Back

Next

Examples of MISP Events:

Information:

  • Country
  • Target
  • Sector
  • Attack Tool
  • Attack Pattern
  • Threat Actor
  • etc.

66 of 99

CAPEv2 Threat Attribution (with MISP)

Back

Next

Detection: Agentb

Screenshots of Successfully Correlated Samples from VX-Underground

67 of 99

Backup of Reports to Google Drive

Back

Next

  • To provide backups of analyses results, we have created a Google Service Account and utilized the Google API to authenticate and upload the reports to a shared Google Drive folder
  • The credentials file are stored in directory /opt/CAPEv2/utils, which is the working directory of the reporting script

68 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

Next

  • Some forms of malware may utilize Command-and-Control (C2) servers to receive commands and send data from the infiltrated devices back to the servers, e.g. botnets, trojans, etc.

  • An example is Emotet, a banking trojan aimed at stealing banking credentials from infected hosts

  • When we submit an Emotet sample to CAPEv2, this is a section of the analysis result:

By default, CAPEv2 would return the IP addresses contacted, as well as countries that they are located in, using GeoIP

To gather more information about the contacted domains, WHOIS Lookup can be utilized

69 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

Next

  • For the CAPEv2 development, we utilized the WHOIS lookup service to obtain more information about contacted domains (if any)

Information:

  • Expiry Date
  • Registrant Org.
  • Registrant State
  • Registrant Country
  • Name Servers
  • etc.

70 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

Next

  • WhoisXML API (https://www.whoisxmlapi.com/) provides services such as domain discovery, email verification, DNS lookup, WHOIS lookup, etc.
  • For the CAPEv2 development, we utilized the WHOIS lookup service to obtain more information about contacted domains (if any)

  • Below is a section of the JSON output for the WHOIS lookup against domain “google.com”:

Information:

  • Expiry Date
  • Registrant Org.
  • Registrant State
  • Registrant Country
  • Name Servers
  • etc.

71 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

Next

  1. Register for a WhoisXML account and generate API key (each free account is entitled to only 500 WHOIS lookup API requests)

  1. Modify CAPEv2 processing scripts, configuration files and HTML templates to extract and display the domain information in the web interface

network.py

If the submitted sample contacts any host/s, then perform a WHOIS lookup against the host/s and obtain the output in JSON format

processing.conf

Add necessary information for network.py to perform the API query and obtain domain information

72 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

Next

Screenshots of WHOIS lookup results for malware that contacts domains

Detection: Zebrocy

Detection: Emotet

73 of 99

Challenges

Back

Next

74 of 99

Software Engineer Problems: Trial & Error

Back

Next

75 of 99

Software Engineer Problems: Pull Request Conflicts

Back

Next

76 of 99

Software Engineer Problems: Failed Test Cases

Back

Next

77 of 99

Software Engineer Problems: Failed Ruff Tests

Back

Next

78 of 99

Software Engineer Problems: Trial & Error (Cont’d)

Back

Next

79 of 99

Software Engineer: Pending Outcome

Back

Next

80 of 99

Enhancements Merged - Google Backup

Back

Next

81 of 99

Enhancements Merged - Google Backup (Cont’d)

Back

Next

82 of 99

Enhancements Merged - Google Backup (Cont’d)

Back

Next

83 of 99

Enhancements Merged - Google Backup (Cont’d)

Back

Next

84 of 99

Enhancements Merged - MISP

Back

Next

85 of 99

Enhancements Merged - MISP (Cont’d)

Back

Next

86 of 99

Enhancements - WHOIS (Pending)

Back

Next

87 of 99

Enhancements - WHOIS (Pending)

Back

Next

88 of 99

Enhancements - Memory Analysis (Rejected)

Back

Next

89 of 99

Enhancements Community Merged

Back

Next

90 of 99

Enhancements Community Merged - WHOIS

Back

Next

91 of 99

Enhancements Community Merged - MISP

Back

Next

92 of 99

Enhancements Community Merged - MISP HTML

Back

Next

93 of 99

Enhancements Community Merged - MISP HTML

Back

Next

94 of 99

Drakvuf Setup Ubuntu 22.04 LTS - In Progress

Back

Next

95 of 99

The Epilogue

Putting it all together

Back

Next

96 of 99

Challenges & issues faced

Back

Next

  • No Godfather to sponsor GPU
  • Not enough data
  • Ambiguous documentation
  • Inexperienced in malware analysis
  • Update predecessor codes

97 of 99

Takeaways from the Project

Back

Next

  • Deep Learning Malware Classification
  • Linux & Android sandboxes
  • The processes of automated malware analysis
  • Time management for 2 major projects

98 of 99

Further exploration

Back

Next

  • Test out other malware analysis tools
  • Obfuscate the malware
  • Result will be compared with other AI models
  • Combined models and average predictions

99 of 99

Thank you!

Next

Back

Credits: This presentation template was created by Slidesgo, and includes icons by Flaticon, and infographics & images by Freepik