Final Year Project Automatic Malware Analysis using Deep Learning and Sandboxing Technology
By: Ee Sheng
Next
Introduction
Project 1: To explore Deep Learning on malware binary and classify their threat groups.
Project 2: To explore open source sandboxing tools for automatic malware analysis.
Back
Next
Table of Contents
Deep Learning
1
2
4
Research on Android & Linux Sandboxes
3
Enhancing CAPEv2
> From Chaos to Control: Multi-classification in Malware Analysis
Back
Next
> The Hunt for Zero-Days: Android and Linux Exploits
> A Malware Analyst's Secret weapon
00
00
Conclusion
> The Epilogue: Putting it all together
Final Year Project Timeline (W1-6)
Week 1
Research on various CNN architecture (VGG16)
Data Pre-processing
Week 2
Research on how to build own CNN models
Data Pre-processing
Start on Sandbox project -
Setting up tools
Week 4
Data Pre-processing
Build own CNN models
Research and testing of Linux and Android Sandboxes
Week 6
Fine-tune my CNN models
Setup the Linux & Android Tools
Week 3
Data Pre-processing
Build own CNN models
Research on the potential enhancements on CapeV2
Week 5
Reconfigure VGG16 architecture to suit grayscale images
Build own CNN models
Final Year Project Timeline (W7-12)
Week 7
Fine-tune my CNN models
Setup the Linux & Android Tools
Midterm presentation
Week 8
Install & Dual Boot Ubuntu LTS 22.04
Setup latest CAPEv2 Sandbox
Fix tiny bugs on latest CAPEv2
Week 10
Implement enhanced features
Fix bugs with latest CAPEv2
Week 12
Final Presentation
Week 9
Setup latest CAPEv2 Sandbox
Implement enhanced features
Week 11
Fixed codes upon developers request
Ensure test cases pass
Deep Learning
Decoding the Mysteries of the CNN Universe
Back
Next
“What are Convolutional Neural Networks (CNNs)”
Back
Next
“From pixels to predictions: Understanding the Architecture of CNNs”
Back
Next
"Building Brains: Understanding the Training of a CNN"
Back
Next
“CNNs in Action: Real-world Solutions and Their Impact”
Back
Next
Deep Learning
From Chaos to Control - Multi-classification in Malware Analysis
Back
Next
“Uncovering the Magic of Data Pre-processing: A Crucial Step in Data Analysis”
Back
Next
Data pre-processing is the process of preparing raw data for analysis by cleaning, formatting, and transforming it into a usable form.
“Mastering the Art of Data Preprocessing: A Step-by-Step Guide”
Back
Next
Step 1: Identify the purpose of the analysis and the types of data needed
Step 2: Collect and import the raw data
Step 3: Clean and format the data in a usable format
Step 4: Transform the data as needed to make it more suitable for analysis (e.g.,Pixel-normalizing)
Identify the purpose of the analysis and the types of data needed
Back
Next
Step 1: Identify the purpose of the analysis and the types of data needed
Back
Next
Step 1: Identify the purpose of the analysis and the types of data needed (Cont’d)
Back
Next
Step 2: Collect and import the raw data
Back
Next
- Unzip recursively .7z, .rar, .zip folders regardless of OS.
Step 2: Collect and import the raw data (unzip.py)
Back
Next
Step 3: Clean and format the data in a usable format
Back
Next
Step 3: Clean and format the data in a usable format (convert_bin_to_img.py)
Back
Next
Step 3: Clean and format the data in a usable format (resize_recursively.py)
Back
Next
Step 3: Clean and format the data in a usable format (pad_resize_recursively.py)
Back
Next
resize_recursively.py
pad_resize_recursively.py
Back
Next
resize_recursively.py
pad_resize_recursively.py
Back
Next
Step 4: Transform the data for analysis
Back
Next
Pixel-wise Normalization: The process of adjusting the pixel values of the image so that they have a common scale, without distorting the content of the image.
Step 1: Resize images from (1024x1024) to (224x224)
Step 2: Convert to 1 Dimension Array
Step 3: Arrays / 255
Pixel-wise Normalization: Resize images and convert to arrays
Pixel-wise Normalization: Resize images and convert to arrays (Cont’d)
Pixel-wise Normalization: Convert arrays into range of 0 to 1
Pixel-wise Normalization: Convert arrays into range of 0 to 1 (Cont’d)
Why is this important?
"Unleashing the Power of Machine Learning: A Step-by-Step Guide to Model Training"
Step 1: Choose models and define the training process
Step 2: Train the models
Step 3: Evaluate the models
Step 4: Fine-tune the models
4idzm7
Step 1: Choose models and define the training process (First CNN)
Step 1: Choose models and define the training process (First CNN)
(Cont’d)
Step 2: Train the models
Step 2: Train the models
Datasets | Split Ratio | Developed CNN Models |
Samples >50 (Unpadded) | 6:4 | 3 Models each |
Samples >50 (Unpadded) | 8:2 | |
Samples >100 (Unpadded) | 6:4 | |
Samples >100 (Unpadded) | 8:2 | |
Samples >50 (Padded) | 6:4 | |
Samples >50 (Padded) | 8:2 | |
Samples >100 (Padded) | 6:4 | |
Samples >100 (Padded) | 8:2 |
Step 3: Evaluate the models (Padded Dataset)
Models | Convolutional Layers | Dense Layer | Training Data Accuracy | Testing Data Accuracy | F1 Score |
Model 1 | 4 | 2 | 100.0 % | 81.72 % | 0.822 |
Model 2 | 3 | 2 | 100.0 % | 78.92 % | 0.798 |
Model 3 | 2 | 2 | 99.46 % | 75.05 % | 0.768 |
Step 3: Evaluate the models (Padded Dataset)
Step 4: Fine-tune the models
Bias regularization is like putting a weight on some of the blocks, so they're harder to move around. This makes it harder to build the tower in one spot, so it becomes more stable.
Dropout regularization is like taking away some of the blocks from the box randomly, so you don't have as many blocks to work with. This makes it so the tower has to be built with a lot of different blocks and not just rely on a few blocks.
Result: Strong & Stable Tower
Step 4: Fine-tune the models (Cont’d)
Models | Convolutional Layers | Dense Layer | Regularization | Training Data Accuracy | Testing Data Accuracy | F1 Score |
Model 1 | 4 | 2 | 3 Dropout(0.50) | 99.57 % | 81.94 % | 0.818 |
Model 1 | 4 | 2 | 2 Dropout(0.80), 1 Dropout (0.50), 6 Bias(0.001) | 87.55 % | 72.9 % | 0.727 |
Model 1 | 4 | 2 | 2 Dropout(0.80), 1 Dropout (0.50), 6 Bias(0.005) | 94.08 % | 73.76 % | 0.749 |
Model 1 | 4 | 2 | 3 Dropout(0.80), 6 Bias(0.005) | 91.36 % | 74.62 % | 0.746 |
Step 4: Fine-tune the models (Cont’d)
Models | Convolutional Layers | Dense Layer | Regularization | Training Data Accuracy | Testing Data Accuracy | F1 Score |
Model 2 | 3 | 2 | 2 Dropout(0.80), 2Bias (0.005), 2 Bias(0.003), 1 Bias (0.001) | 96.52 % | 80.43 % | 0.801 |
Model 2 | 3 | 2 | 2 Dropout(0.30) | 99.95 % | 80.65 % | 0.803 |
Model 3 | 2 | 2 | 1 Dropout(0.20) | 99.29 % | 77.42 % | 0.779 |
Model 3 | 2 | 2 | 2 Dropout(0.80), 2Bias (0.005), 2 Bias(0.003), 1 Bias (0.001) | 99.4 % | 76.13 % | 0.77 |
Step 4: Fine-tune the models (Cont’d)
Learning Rates |
1e - 3 |
1e - 4 |
4e-4 |
4e - 3 |
5e - 3 |
3e - 3 |
2e - 3 |
5e - 4 |
3e - 4 |
The learning rate is a value that controls how much the model's parameters are updated in each iteration.
When we teach a computer to do something, like recognizing pictures, we need to adjust how it's thinking and understanding.
Fine-tune the models
Models | Convolutional Layers | Dense Layer | Regularization | Training Data Accuracy | Testing Data Accuracy | F1 Score |
Model 1 | 4 | 2 | 3 Dropout(0.50) lr= 1e-3 | 99.57 % | 81.94 % | 0.818 |
Model 2 | 3 | 2 | 2 Dropout(0.80), 2Bias (0.005), 2 Bias(0.003), 1 Bias (0.001) lr= 1e-3 | 96.52 % | 80.43 % | 0.801 |
Model 3 | 2 | 2 | 1 Dropout(0.20) lr= 1e-3 | 99.29 % | 77.42 % | 0.779 |
Addition: Transfer Learning: (VGG16)
Addition: Transfer Learning: (VGG16) (Cont’d)
Models | Convolutional Layers | Dense Layer | Regularization |
VGG16 | 13 | 3 | 2 Dropout(0.50) |
Addition: extract_pe_features & extract_opcodes
Addition: Flask Web App with deployed AI Model
The Hunt for Zero-Days:
Android and Linux Exploits
Back
Next
Android and Linux Malwares
Successful Android Analysis Tools: MobSf
Mobile Security Framework (MobSF)
Successful Android Analysis Tools: MobSf
Successful Android Analysis Tools: APKLeaks
APKLeaks
Successful Android Analysis Tools: APKLeaks
Successful Android Analysis Tools: RiskinDroid
RiskinDroid
Successful Android Analysis Tools: RiskinDroid
Successful Android Analysis Tools:Quark-Engine
Quark-Engine
Successful Android Analysis Tools:Quark-Engine
Successful Android Analysis Tools
Mobile Security Framework (MobSF)
Quark-Engine
APKLeaks
RiskinDroid
| MobSF | RiskinDroid | Quark-Engine | ApkLeaks |
Risk Assessment | ✅ | ❌ | ✅ | ❌ |
File Details | ✅ | ❌ | ✅ | ❌ |
Malicious Indicator | ✅ | ❌ | ✅ | ❌ |
Suspicious Indicator | ✅ | ❌ | ✅ | ❌ |
Other Information | ✅ | ❌ | ❌ | ❌ |
Related Sandbox Artifacts | ✅ | ❌ | ❌ | ❌ |
File Permission | ✅ | ✅ | ❌ | ❌ |
File Activities | ✅ | ❌ | ❌ | ❌ |
File Receivers | ✅ | ❌ | ❌ | ❌ |
File Certificates | ✅ | ❌ | ❌ | ❌ |
Extracted Strings | ✅ | ❌ | ❌ | ❌ |
Extracted Files | ✅ | ❌ | ❌ | ❌ |
Comparison Table
Successful Linux Analysis & Sandbox Tool: LiSa
Back
Next
Overview:
File Overview, Downloads, Anomalies
Static Analysis:
ELF Info, Imports, Exports, Libraries, Relocations, Symbols, Sections, Strings
Dynamic Analysis:
Process Tree, Opened Files, Syscalls
Network Analysis:
Endpoints, HTTP Requests, DNS Questions, Telnet Data, IRC Messages
Successful Linux Analysis & Sandbox Tool: LiSa
Back
Next
Using & Enhancing CAPEv2
A Malware Analyst's Secret weapon
Back
Next
Memory Snapshot At Different Phases
Back
Next
Memory Snapshot At Different Phases
Back
Next
Memory Dumps Before, During, and After the Analysis
(using Cerber Ransomware as an example)
notable differences
CAPEv2 Threat Attribution (with MISP)
Back
Next
CAPEv2 Threat Attribution (with MISP)
Back
Next
Examples of MISP Events:
Information:
CAPEv2 Threat Attribution (with MISP)
Back
Next
Detection: Agentb
Screenshots of Successfully Correlated Samples from VX-Underground
More test results are available at https://docs.google.com/document/d/1O-GqjsxMuTSLMuxnk822z4u1gSQSwpcG_Hie_86MErU/edit?usp=sharing
Backup of Reports to Google Drive
Back
Next
CAPEv2 WHOIS Lookup (with WhoisXML)
Back
Next
By default, CAPEv2 would return the IP addresses contacted, as well as countries that they are located in, using GeoIP
To gather more information about the contacted domains, WHOIS Lookup can be utilized
CAPEv2 WHOIS Lookup (with WhoisXML)
Back
Next
Information:
CAPEv2 WHOIS Lookup (with WhoisXML)
Back
Next
Information:
CAPEv2 WHOIS Lookup (with WhoisXML)
Back
Next
network.py
If the submitted sample contacts any host/s, then perform a WHOIS lookup against the host/s and obtain the output in JSON format
processing.conf
Add necessary information for network.py to perform the API query and obtain domain information
CAPEv2 WHOIS Lookup (with WhoisXML)
Back
Next
Screenshots of WHOIS lookup results for malware that contacts domains
Detection: Zebrocy
Detection: Emotet
Challenges
Back
Next
Software Engineer Problems: Trial & Error
Back
Next
Software Engineer Problems: Pull Request Conflicts
Back
Next
Software Engineer Problems: Failed Test Cases
Back
Next
Software Engineer Problems: Failed Ruff Tests
Back
Next
Software Engineer Problems: Trial & Error (Cont’d)
Back
Next
Software Engineer: Pending Outcome
Back
Next
Enhancements Merged - Google Backup
Back
Next
Enhancements Merged - Google Backup (Cont’d)
Back
Next
Enhancements Merged - Google Backup (Cont’d)
Back
Next
Enhancements Merged - Google Backup (Cont’d)
Back
Next
Enhancements Merged - MISP
Back
Next
Enhancements Merged - MISP (Cont’d)
Back
Next
Enhancements - WHOIS (Pending)
Back
Next
Enhancements - WHOIS (Pending)
Back
Next
Enhancements - Memory Analysis (Rejected)
Back
Next
Enhancements Community Merged
Back
Next
Enhancements Community Merged - WHOIS
Back
Next
Enhancements Community Merged - MISP
Back
Next
Enhancements Community Merged - MISP HTML
Back
Next
Enhancements Community Merged - MISP HTML
Back
Next
Drakvuf Setup Ubuntu 22.04 LTS - In Progress
Back
Next
The Epilogue
Putting it all together
Back
Next
Challenges & issues faced
Back
Next
Takeaways from the Project
Back
Next
Further exploration
Back
Next