1 of 99

Final Year Project Automatic Malware Analysis using Deep Learning and Sandboxing Technology

By: Ee Sheng

Introduction

Project 1: To explore Deep Learning on malware binary and classify their threat groups.

Project 2: To explore open source sandboxing tools for automatic malware analysis.

Back

Table of Contents

Deep Learning

1

2

4

Research on Android & Linux Sandboxes

3

Enhancing CAPEv2

> From Chaos to Control: Multi-classification in Malware Analysis

Back

> A Malware Analyst's Secret weapon

00

Conclusion

> The Epilogue: Putting it all together

4 of 99

Final Year Project Timeline (W1-6)

Week 1

Research on various CNN architecture (VGG16)

Data Pre-processing

Week 2

Research on how to build own CNN models

Data Pre-processing

Start on Sandbox project -

Setting up tools

Week 4

Data Pre-processing

Build own CNN models

Research and testing of Linux and Android Sandboxes

Week 6

Fine-tune my CNN models

Setup the Linux & Android Tools

Week 3

Data Pre-processing

Build own CNN models

Research on the potential enhancements on CapeV2

Week 5

Reconfigure VGG16 architecture to suit grayscale images

Build own CNN models

5 of 99

Final Year Project Timeline (W7-12)

Week 7

Fine-tune my CNN models

Setup the Linux & Android Tools

Midterm presentation

Week 8

Install & Dual Boot Ubuntu LTS 22.04

Setup latest CAPEv2 Sandbox

Fix tiny bugs on latest CAPEv2

Week 10

Implement enhanced features

Fix bugs with latest CAPEv2

Week 12

Final Presentation

Week 9

Setup latest CAPEv2 Sandbox

Implement enhanced features

Week 11

Fixed codes upon developers request

Ensure test cases pass

6 of 99

Deep Learning

Decoding the Mysteries of the CNN Universe

Back

“What are Convolutional Neural Networks (CNNs)”

Back

CNNs are a powerful type of deep learning algorithm that are particularly well-suited for analyzing images and videos.
E.g. Object recognition and face detection

8 of 99

“From pixels to predictions: Understanding the Architecture of CNNs”

Back

Input layer: This is where an image is fed into the network to be analyzed.
Convolutional layers: These layers are used to process the image and identify important features within it. Think of it as a set of filters that are applied to the image, like a magnifying glass, to help pick out certain aspects of it.
Pooling layers: These layers are used to make the network simpler by reducing the amount of information being processed. It does this by shrinking the image down a little bit, which makes the network less complex and easier to understand.
Fully connected layers: These layers take all the important information identified by the previous layers and combine it to give a final output. It's like the final step in analyzing the image, where the network makes a decision about what it's looking at.

9 of 99

"Building Brains: Understanding the Training of a CNN"

Back

"Optimization Algorithm"

Help the network change its settings.
Adaptive gradient algorithms (Adam, Adagrad, RMSprop)

"Loss Function"

Check how good the network is doing.
Mean squared error (MSE)

"Evaluation Metrics"

Check if the network is accurate
Accuracy, F1 score

10 of 99

“CNNs in Action: Real-world Solutions and Their Impact”

Back

Real-life examples of the application of CNNs, such as:

CNNs being used in self-driving cars to detect and recognize objects on the road.
CNNs being used in the medical field to analyze medical images and detect diseases.
CNNs being used to automatically extract features from malware samples and classify them into different families.

11 of 99

Deep Learning

From Chaos to Control - Multi-classification in Malware Analysis

Back

“Uncovering the Magic of Data Pre-processing: A Crucial Step in Data Analysis”

Back

Ensure that data is accurate, consistent and relevant to the analysis.
Without proper data pre-processing, the results of data analysis may be unreliable or misleading.

13 of 99

“Mastering the Art of Data Preprocessing: A Step-by-Step Guide”

Back

Step 2: Collect and import the raw data

Step 3: Clean and format the data in a usable format

Step 4: Transform the data as needed to make it more suitable for analysis (e.g.,Pixel-normalizing)

14 of 99

Identify the purpose of the analysis and the types of data needed

Back

Executable files and other related files associated with the malware

15 of 99

Step 1: Identify the purpose of the analysis and the types of data needed

Back

Step 1: Identify the purpose of the analysis and the types of data needed (Cont’d)

Back

Step 2: Collect and import the raw data

Back

CSIT Samples (Password: malware, internal zip password: infected)
Script to import data
Unzip.py

- Unzip recursively .7z, .rar, .zip folders regardless of OS.

18 of 99

Step 2: Collect and import the raw data (unzip.py)

Back

Step 3: Clean and format the data in a usable format

Back

Remove the low count sample files
Remove unusable sample files (i.e., HTML, XML files)
Scripts to format data:

convert_bin_to_img.py

Convert compiled malware (i.e., .msi, .exe) including PDFs and Word Docs into grayscale images.

resize_recursively.py

Able to resize original images in directory **recursively** to specific width and height (1024px1024p)

pad_resize_recursively.py

Able to pad original images in directory **recursively** to specific width and height (1024px1024p)

20 of 99

Step 3: Clean and format the data in a usable format (convert_bin_to_img.py)

Back

Step 3: Clean and format the data in a usable format (resize_recursively.py)

Back

Step 3: Clean and format the data in a usable format (pad_resize_recursively.py)

Back

resize_recursively.py

pad_resize_recursively.py

Back

resize_recursively.py

pad_resize_recursively.py

Back

Step 4: Transform the data for analysis

Back

Use Min-max normalization: Set range 0 to 1
Enable the data to be in common range and removing the influence of outliers.

Step 1: Resize images from (1024x1024) to (224x224)

Step 2: Convert to 1 Dimension Array

Step 3: Arrays / 255

26 of 99

Pixel-wise Normalization: Resize images and convert to arrays

27 of 99

Pixel-wise Normalization: Resize images and convert to arrays (Cont’d)

28 of 99

Pixel-wise Normalization: Convert arrays into range of 0 to 1

29 of 99

Pixel-wise Normalization: Convert arrays into range of 0 to 1 (Cont’d)

Why is this important?

CNN can make use of the entire range of values and it can perform better when learning patterns and features

Imagine picture of a cat, but the colors and brightness of the image are all over the place

Normalizing the image, makes all the pixels have consistent brightness and color, which makes it easier for the computer to understand what's in the image.

30 of 99

"Unleashing the Power of Machine Learning: A Step-by-Step Guide to Model Training"

Step 1: Choose models and define the training process

Step 2: Train the models

Step 3: Evaluate the models

Step 4: Fine-tune the models

4idzm7

31 of 99

Step 1: Choose models and define the training process (First CNN)

32 of 99

Step 1: Choose models and define the training process (First CNN)

(Cont’d)

33 of 99

Step 2: Train the models

34 of 99

Step 2: Train the models

Datasets	Split Ratio	Developed CNN Models
Samples >50 (Unpadded)	6:4	3 Models each
Samples >50 (Unpadded)	8:2
Samples >100 (Unpadded)	6:4
Samples >100 (Unpadded)	8:2
Samples >50 (Padded)	6:4
Samples >50 (Padded)	8:2
Samples >100 (Padded)	6:4
Samples >100 (Padded)	8:2

35 of 99

Step 3: Evaluate the models (Padded Dataset)

Models	Convolutional Layers	Dense Layer	Training Data Accuracy	Testing Data Accuracy	F1 Score
Model 1	4	2	100.0 %	81.72 %	0.822
Model 2	3	2	100.0 %	78.92 %	0.798
Model 3	2	2	99.46 %	75.05 %	0.768

36 of 99

Step 3: Evaluate the models (Padded Dataset)

37 of 99

Step 4: Fine-tune the models

Bias regularization is like putting a weight on some of the blocks, so they're harder to move around. This makes it harder to build the tower in one spot, so it becomes more stable.

Dropout regularization is like taking away some of the blocks from the box randomly, so you don't have as many blocks to work with. This makes it so the tower has to be built with a lot of different blocks and not just rely on a few blocks.

Result: Strong & Stable Tower

38 of 99

Step 4: Fine-tune the models (Cont’d)

Models	Convolutional Layers	Dense Layer	Regularization	Training Data Accuracy	Testing Data Accuracy	F1 Score
Model 1	4	2	3 Dropout(0.50)	99.57 %	81.94 %	0.818
Model 1	4	2	2 Dropout(0.80), 1 Dropout (0.50), 6 Bias(0.001)	87.55 %	72.9 %	0.727
Model 1	4	2	2 Dropout(0.80), 1 Dropout (0.50), 6 Bias(0.005)	94.08 %	73.76 %	0.749
Model 1	4	2	3 Dropout(0.80), 6 Bias(0.005)	91.36 %	74.62 %	0.746

39 of 99

Step 4: Fine-tune the models (Cont’d)

Models	Convolutional Layers	Dense Layer	Regularization	Training Data Accuracy	Testing Data Accuracy	F1 Score
Model 2	3	2	2 Dropout(0.80), 2Bias (0.005), 2 Bias(0.003), 1 Bias (0.001)	96.52 %	80.43 %	0.801
Model 2	3	2	2 Dropout(0.30)	99.95 %	80.65 %	0.803
Model 3	2	2	1 Dropout(0.20)	99.29 %	77.42 %	0.779
Model 3	2	2	2 Dropout(0.80), 2Bias (0.005), 2 Bias(0.003), 1 Bias (0.001)	99.4 %	76.13 %	0.77

40 of 99

Step 4: Fine-tune the models (Cont’d)

Learning Rates
1e - 3
1e - 4
4e-4
4e - 3
5e - 3
3e - 3
2e - 3
5e - 4
3e - 4

The learning rate is a value that controls how much the model's parameters are updated in each iteration.

When we teach a computer to do something, like recognizing pictures, we need to adjust how it's thinking and understanding.

41 of 99

Fine-tune the models

Models	Convolutional Layers	Dense Layer	Regularization	Training Data Accuracy	Testing Data Accuracy	F1 Score
Model 1	4	2	3 Dropout(0.50) lr= 1e-3	99.57 %	81.94 %	0.818
Model 2	3	2	2 Dropout(0.80), 2Bias (0.005), 2 Bias(0.003), 1 Bias (0.001) lr= 1e-3	96.52 %	80.43 %	0.801
Model 3	2	2	1 Dropout(0.20) lr= 1e-3	99.29 %	77.42 %	0.779

42 of 99

43 of 99

Addition: Transfer Learning: (VGG16)

Chosen VGG16 Model

Pre-trained on thousands of image classes (“ImageNet”) total of 14,197,122 images.
Used pre-trained VGG16 weights to train on “Grayscale images > 3 Dimensions > Arrays”
(Transfer Learning)

44 of 99

Addition: Transfer Learning: (VGG16) (Cont’d)

Models	Convolutional Layers	Dense Layer	Regularization
VGG16	13	3	2 Dropout(0.50)

45 of 99

Addition: extract_pe_features & extract_opcodes

46 of 99

Addition: Flask Web App with deployed AI Model

47 of 99

The Hunt for Zero-Days:

Android and Linux Exploits

Back

Android and Linux Malwares

Android Package (.apk)
Executable and Linkable Format (.elf)

49 of 99

Successful Android Analysis Tools: MobSf

Mobile Security Framework (MobSF)

Best Static and Dynamic analysis sandboxing tool for android
Supports Android/IOS/Windows
Supports APK, XAPK, IPA & APPX
Easy installation
Most Information displayed
Multiple VM options (Android Studio, Genymotion)

50 of 99

Successful Android Analysis Tools: MobSf

51 of 99

Successful Android Analysis Tools: APKLeaks

APKLeaks

Static Analysis Only
Scans APK files for Uniform Resource Identifier (URI),endpoints and secrets
Creates a txt file and saves a copy of Output
Not Dynamic
Best Suited for finding vulnerabilities

LinkFinder

52 of 99

Successful Android Analysis Tools: APKLeaks

53 of 99

Successful Android Analysis Tools: RiskinDroid

RiskinDroid

Analyses apk and calculates risk level from 0-100
analyses: Declared permissions, Exploited permissions, Ghost permissions, Useless permissions
The higher the score the bigger the threat
Only takes permissions into considerations

Inspects and rates permissions

54 of 99

Successful Android Analysis Tools: RiskinDroid

55 of 99

Successful Android Analysis Tools:Quark-Engine

Quark-Engine

Made for penetration Testing
Open Source (Early - Stage)
Static Analysis
Detects crime done by Applications

Reusable & Shareable (customisable script)

56 of 99

Successful Android Analysis Tools:Quark-Engine

57 of 99

Successful Android Analysis Tools

Mobile Security Framework (MobSF)

Best Static and dynamic analysis sandboxing tool for android
Supports Android/IOS/Windows
Supports APK, XAPK, IPA & APPX
Easy installation
Most Information displayed
Multiple VM options (Android Studio, Genymotion)

Quark-Engine

Made for penetration Testing
Open Source (Early - Stage)
Static Analysis
Detects crime done by Applications

APKLeaks

Static Analysis Only
Scans APK files for Uniform Resource Identifier (URI),endpoints and secrets
Creates a txt file and saves a copy of Output
Not Dynamic
Best Suited for finding vulnerabilities

RiskinDroid

Analyses apk and calculates risk level from 0-100
analyses: Declared permissions, Exploited permissions, Ghost permissions, Useless permissions
The higher the score the bigger the threat
Only takes permissions into considerations

LinkFinder

Inspects and rates permissions

Reusable & Shareable (customisable script)

58 of 99

	MobSF	RiskinDroid	Quark-Engine	ApkLeaks
Risk Assessment	✅	❌	✅	❌
File Details	✅	❌	✅	❌
Malicious Indicator	✅	❌	✅	❌
Suspicious Indicator	✅	❌	✅	❌
Other Information	✅	❌	❌	❌
Related Sandbox Artifacts	✅	❌	❌	❌
File Permission	✅	✅	❌	❌
File Activities	✅	❌	❌	❌
File Receivers	✅	❌	❌	❌
File Certificates	✅	❌	❌	❌
Extracted Strings	✅	❌	❌	❌
Extracted Files	✅	❌	❌	❌

Comparison Table

For the 4 tools i have chosen out of all the tools, I selected them based on the type and amount of information that is displayed, installation complexity and how useful the tool might be. For Example, i chose Riskindroid is due to the fact that it is the only tool to calculate risk based on its permissions, i chose MobSF was because it was the most in depth and detailed Dynamic analysis out of the tools that i have tested and the installation for is very simple. For ApkLeaks, it is the only tool to include LinkFinder which is a python script written to discover endpoints and their parameters in JavaScript files to find vulnerabilities for penetration testers. For quark engine, i have chosen it because on their Github they have stated that it has a “Re-Usable & Sharable” which means its similar to CAPE where you can customise the script, it will run for every analysis.

59 of 99

Successful Linux Analysis & Sandbox Tool: LiSa

An open-source sandbox project providing automated Linux malware analysis on various CPU architectures (GitHub - danieluhricek/LiSa: Sandbox for automated Linux malware analysis.)

Provides for an overview, static analysis, dynamic analysis, and network analysis of submitted Linux files:

Back

File Overview, Downloads, Anomalies

Static Analysis:

ELF Info, Imports, Exports, Libraries, Relocations, Symbols, Sections, Strings

Dynamic Analysis:

Process Tree, Opened Files, Syscalls

Network Analysis:

Endpoints, HTTP Requests, DNS Questions, Telnet Data, IRC Messages

60 of 99

Successful Linux Analysis & Sandbox Tool: LiSa

Back

Using & Enhancing CAPEv2

A Malware Analyst's Secret weapon

Back

Memory Snapshot At Different Phases

Back

To better understand the different processes, services, etc. that are running during the execution of the malware, we can create memory dumps of the VM during different phases of execution.

Modify CAPEv2 scheduler script to dump the memory of the VM before and during the analysis at customizable time intervals

Utilize Volatility API to process and parse the memory dump for readability

Create Python scripts and HTML templates to display the memory dump information in the web interface

Modify existing configuration files and relevant Python scripts to ensure that the memory dumps before and during the analysis would be processed and displayed on the web interface

63 of 99

Memory Snapshot At Different Phases

Back

(using Cerber Ransomware as an example)

notable differences

64 of 99

CAPEv2 Threat Attribution (with MISP)

Back

As part of the client requirements, having threat attribution for samples submitted to CAPEv2 would help to provide better context (i.e. relevant events, threat actors, IOCs, etc.)

Malware Information Sharing Platform (MISP) is an “open source software solution for collecting, storing, distributing and sharing cyber security indicators and threats about cyber security” which can be integrated with other solutions

65 of 99

CAPEv2 Threat Attribution (with MISP)

Back

Information:

Country
Target
Sector
Attack Tool
Attack Pattern
Threat Actor
etc.

66 of 99

CAPEv2 Threat Attribution (with MISP)

Back

Screenshots of Successfully Correlated Samples from VX-Underground

More test results are available at https://docs.google.com/document/d/1O-GqjsxMuTSLMuxnk822z4u1gSQSwpcG_Hie_86MErU/edit?usp=sharing

67 of 99

Backup of Reports to Google Drive

Back

To provide backups of analyses results, we have created a Google Service Account and utilized the Google API to authenticate and upload the reports to a shared Google Drive folder
The credentials file are stored in directory /opt/CAPEv2/utils, which is the working directory of the reporting script

68 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

Some forms of malware may utilize Command-and-Control (C2) servers to receive commands and send data from the infiltrated devices back to the servers, e.g. botnets, trojans, etc.

An example is Emotet, a banking trojan aimed at stealing banking credentials from infected hosts

When we submit an Emotet sample to CAPEv2, this is a section of the analysis result:

By default, CAPEv2 would return the IP addresses contacted, as well as countries that they are located in, using GeoIP

To gather more information about the contacted domains, WHOIS Lookup can be utilized

69 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

For the CAPEv2 development, we utilized the WHOIS lookup service to obtain more information about contacted domains (if any)

Information:

Expiry Date
Registrant Org.
Registrant State
Registrant Country
Name Servers
etc.

70 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

WhoisXML API (https://www.whoisxmlapi.com/) provides services such as domain discovery, email verification, DNS lookup, WHOIS lookup, etc.

For the CAPEv2 development, we utilized the WHOIS lookup service to obtain more information about contacted domains (if any)

Below is a section of the JSON output for the WHOIS lookup against domain “google.com”:

Information:

Expiry Date
Registrant Org.
Registrant State
Registrant Country
Name Servers
etc.

71 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

Register for a WhoisXML account and generate API key (each free account is entitled to only 500 WHOIS lookup API requests)

Modify CAPEv2 processing scripts, configuration files and HTML templates to extract and display the domain information in the web interface

network.py

If the submitted sample contacts any host/s, then perform a WHOIS lookup against the host/s and obtain the output in JSON format

processing.conf

Add necessary information for network.py to perform the API query and obtain domain information

72 of 99

CAPEv2 WHOIS Lookup (with WhoisXML)

Back

Detection: Zebrocy

Detection: Emotet

73 of 99

Challenges

Back

Software Engineer Problems: Trial & Error

Back

Software Engineer Problems: Pull Request Conflicts

Back

Software Engineer Problems: Failed Test Cases

Back

Software Engineer Problems: Failed Ruff Tests

Back

Software Engineer Problems: Trial & Error (Cont’d)

Back

Software Engineer: Pending Outcome

Back

Enhancements Merged - Google Backup

Back

Enhancements Merged - Google Backup (Cont’d)

Back

Enhancements Merged - Google Backup (Cont’d)

Back

Enhancements Merged - Google Backup (Cont’d)

Back

Enhancements Merged - MISP

Back

Enhancements Merged - MISP (Cont’d)

Back

Enhancements - WHOIS (Pending)

Back

Enhancements - WHOIS (Pending)

Back

Enhancements - Memory Analysis (Rejected)

Back

Enhancements Community Merged

Back

Enhancements Community Merged - WHOIS

Back

Enhancements Community Merged - MISP

Back

Enhancements Community Merged - MISP HTML

Back

Enhancements Community Merged - MISP HTML

Back

Drakvuf Setup Ubuntu 22.04 LTS - In Progress

Back

The Epilogue

Putting it all together

Back

Challenges & issues faced

Back

No Godfather to sponsor GPU
Not enough data
Ambiguous documentation
Inexperienced in malware analysis
Update predecessor codes

97 of 99

Takeaways from the Project

Back

Deep Learning Malware Classification
Linux & Android sandboxes
The processes of automated malware analysis
Time management for 2 major projects

98 of 99

Further exploration

Back

Test out other malware analysis tools
Obfuscate the malware
Result will be compared with other AI models
Combined models and average predictions

99 of 99

Thank you!

Credits: This presentation template was created by Slidesgo, and includes icons by Flaticon, and infographics & images by Freepik

1 of 99

2 of 99

3 of 99

4 of 99

5 of 99

6 of 99

7 of 99

8 of 99

9 of 99

10 of 99

11 of 99

12 of 99

13 of 99

14 of 99

15 of 99

16 of 99

17 of 99

18 of 99

19 of 99

20 of 99

21 of 99

22 of 99

23 of 99

24 of 99

25 of 99

26 of 99

27 of 99

28 of 99

29 of 99

30 of 99

31 of 99

32 of 99

33 of 99

34 of 99

35 of 99

36 of 99

37 of 99

38 of 99

39 of 99

40 of 99

41 of 99

42 of 99

43 of 99

44 of 99

45 of 99

46 of 99

47 of 99

48 of 99

49 of 99

50 of 99

51 of 99

52 of 99

53 of 99

54 of 99

55 of 99

56 of 99

57 of 99

58 of 99

59 of 99

60 of 99

61 of 99

62 of 99

63 of 99

64 of 99

65 of 99

66 of 99

67 of 99

68 of 99

69 of 99

70 of 99

71 of 99

72 of 99

73 of 99

74 of 99

75 of 99

76 of 99

77 of 99

78 of 99

79 of 99

80 of 99