1 of 14

Explainable Neural Binary Analysis

Jane Adams and Michael Davinroy

2 of 14

Glossary

Neural Binary Analysis

Using machine learning to infer analysis information about binary executables in a cybersecurity context, often for malware analysis and/or reverse engineering

Basic Block

The simplest ‘unit’ of code with no control flow calls in or out

Control Flow Graph (CFG)

An graph representation of binaries in which nodes are basic blocks and edges are control flow calls between basic blocks (ACFG: Attributed CFG)

3 of 14

Motivation / Problem Statement

What is the problem you want to solve?

Develop a visualization dashboard for viewing results of a GNN model which assesses code similarity for cybersecurity purposes

Who has this problem?

Cybersecurity analysts / researchers, for communicating their research to outside stakeholders

Why is it relevant / interesting?

Evaluating model/data integrity; Identifying and targeting security risks in enterprise systems

4 of 14

Background / Related Work

We are using the GMN-SNN model from DeepMind as a demo

We are using the binary function similarity dataset from Cisco-Talos

This could potentially be useful for other binary similarity models in that dataset, including: Asm2Vec, CodeCMR, Trex, Catalog1, FunctionSimSearch, GNN-S2V, SAFE, Zeek

5 of 14

The Datasets

Function A

Function B

Basic Block

8 of 14

Instructions are shown on hover

Entry nodes are highlighted in red

12 of 14

Results

Overall graph view gives a nice global perspective on the structures of ACFGs that the network thinks are similar
The color scheme helps further compare graphs at a global level when the structures vary, but the basic blocks are different lengths
Being able to actively examine basic block content and see what the root nodes are helps verify the functions are indeed similar
Feature explorer has high-level summary statistics about network structures
Model explorer gives a nice overview of which examples are classified as positive and which are negative, and what these respective euclidean distances are

13 of 14

Lessons Learned

Future Work

Explaining graph neural networks is difficult
Attention between nodes in the graph seems promising
Color nodes by specific instructions (e.g. make all basic blocks containing a “jump” instruction purple)
Align graphs such that similar nodes are in similar locations in our graph layout

Explaining graph neural networks is difficult
Similarity scores overlap more than expected and are positive only
Many function graphs contain multiple components (unexpected)
ACFG construction can be messy (not sure we can trust this data!)

Provably impossible, actually!

1 of 14

2 of 14

3 of 14

4 of 14

5 of 14

6 of 14

7 of 14

8 of 14

9 of 14

10 of 14

11 of 14

12 of 14

13 of 14

14 of 14