1 of 14

Explainable Neural Binary Analysis

Jane Adams and Michael Davinroy

2 of 14

Glossary

Neural Binary Analysis

Using machine learning to infer analysis information about binary executables in a cybersecurity context, often for malware analysis and/or reverse engineering

Basic Block

The simplest ‘unit’ of code with no control flow calls in or out

Control Flow Graph (CFG)

An graph representation of binaries in which nodes are basic blocks and edges are control flow calls between basic blocks (ACFG: Attributed CFG)

3 of 14

Motivation / Problem Statement

What is the problem you want to solve?

Develop a visualization dashboard for viewing results of a GNN model which assesses code similarity for cybersecurity purposes

Who has this problem?

Cybersecurity analysts / researchers, for communicating their research to outside stakeholders

Why is it relevant / interesting?

Evaluating model/data integrity; Identifying and targeting security risks in enterprise systems

4 of 14

Background / Related Work

We are using the GMN-SNN model from DeepMind as a demo

We are using the binary function similarity dataset from Cisco-Talos

This could potentially be useful for other binary similarity models in that dataset, including: Asm2Vec, CodeCMR, Trex, Catalog1, FunctionSimSearch, GNN-S2V, SAFE, Zeek

5 of 14

The Datasets

Function A

Function B

Basic Block

Basic Block

Basic Block

Basic Block

Basic Block

Basic Block

Basic Block

Basic Block

6 of 14

7 of 14

8 of 14

Instructions are shown on hover

Entry nodes are highlighted in red

9 of 14

10 of 14

11 of 14

12 of 14

Results

  • Overall graph view gives a nice global perspective on the structures of ACFGs that the network thinks are similar
  • The color scheme helps further compare graphs at a global level when the structures vary, but the basic blocks are different lengths
  • Being able to actively examine basic block content and see what the root nodes are helps verify the functions are indeed similar
  • Feature explorer has high-level summary statistics about network structures
  • Model explorer gives a nice overview of which examples are classified as positive and which are negative, and what these respective euclidean distances are

13 of 14

Lessons Learned

Future Work

  • Explaining graph neural networks is difficult
  • Attention between nodes in the graph seems promising
  • Color nodes by specific instructions (e.g. make all basic blocks containing a “jump” instruction purple)
  • Align graphs such that similar nodes are in similar locations in our graph layout
  • Explaining graph neural networks is difficult
  • Similarity scores overlap more than expected and are positive only
  • Many function graphs contain multiple components (unexpected)
  • ACFG construction can be messy (not sure we can trust this data!)

Provably impossible, actually!

14 of 14