1 of 17

TOWARDS LEARNING REPRESENTATIONS OF BINARY EXECUTABLE FILES

Shushan Arakelyan, Christophe Hauser, Erik Kline and Aram Galstyan

Information Sciences Institute

Hi, my name is Shushan and I will be presenting our work on Learning Representations of Binary executable files that I did with my collaborators here.

In order to apply a machine learning model to a task, we need a good vector representation of the data points of that task. For some types of data, turning a data point to a vector of numbers is intuitive and easy, whereas for some other tasks, that include binary programs doing so is non-trivial.�

On the other hand, static binary program analysis is traditionally relying on a combination of formal methods and heuristics, because precise analysis of binaries is computationally prohibitive. But we know that machine learning has a proven reputation of improving performance compared to heuristics. So our goal was coming up with a task-independent distributed representation for binary executable programs, that can be then used with ML models to solve binary program analysis tasks.

2 of 17

Contributions

A task-independent distributed representation learning model for binary executable programs

Modelling binary executable programs’ structure and computations

Improvement on state of the art and commercial tools on two problems – algorithm classification and vulnerability discovery

Information Sciences Institute

3 of 17

;mov edx,dword ptr [rbp-8]

;mov eax,dword ptr [rbp-0xC]

sub edx, eax

Source code

//init a and b

c = a – b;

Decompiled assembly

Lifted to Vex IR

1: t4=GET:I64(rdx)

2: t3=64to32(t4)

3: t6=GET:I64(rax)

4: t5=64to32(t6)

5: t0=Sub32(t3,t5)

6: PUT(cc_op)=0x7

7: t7=32Uto64(t3)

8: PUT(cc_dep1)=t7

9: t8=32Uto64(t5)

10: PUT(cc_dep2)=t8

11: t9=32Uto64(t0)

12: PUT(rdx)=t9

Information Sciences Institute

4 of 17

Constructing a Program Graph

1: t4=GET:I64(rdx) 7: t7=32Uto64(t3)

2: t5=64to32(t4) 8: PUT(cc_dep1)=t7

3: t6=GET:I64(rax) 9: t8=32Uto64(t5)

4: t3=64to32(t6) 10: PUT(cc_dep2)=t8

5: t0=Sub32(t5,t3) 11: t9=32Uto64(t0)

6: PUT(cc_op)=0x7 12: PUT(rdx)=t9

Information Sciences Institute

5 of 17

Program Graphs: Construction

1: t4=GET:I64(rdx) 7: t7=32Uto64(t3)

2: t3=64to32(t4) 8: PUT(cc_dep1)=t7

3: t6=GET:I64(rax) 9: t8=32Uto64(t5)

4: t5=64to32(t6) 10: PUT(cc_dep2)=t8

5: t0=Sub32(t3,t5) 11: t9=32Uto64(t0)

6: PUT(cc_op)=0x7 12: PUT(rdx)=t9

Information Sciences Institute

6 of 17

Program Graphs: Removing Redundancy

Information Sciences Institute

7 of 17

Removing Redundancy

Information Sciences Institute

8 of 17

Graph Convolutional Neural Networks

“Semi-supervised Classification with Graph Convolutional Networks”, T.N. Kipf, M. Welling, ICLR 2017

Information Sciences Institute

9 of 17

Algorithm Classification: Description, pt 1

Example prompts for programming competition problems (from ACM Timus):

Information Sciences Institute

10 of 17

Algorithm Classification: Description, pt 2

Problem 1	Problem 2	…	Problem 104
1. Solution for P1	1. Solution for P2		1. Solution for P104
2. Solution for P1	2. Solution for P2		2. Solution for P104
…	…	…	…
500. Solution for P1	500. Solution for P2		500. Solution for P104

Dataset introduced in “Convolutional Neural Networks over Tree Structures for Programming Language Processing”, L.Mou, G.Li, L.Zhang, T. Wang, Z. Jin

Information Sciences Institute

11 of 17

Algorithm Classification: Description, pt 3

Solution for ?

Information Sciences Institute

12 of 17

Algorithm Classification: Description, pt 3

Solution for ?

Problem 1	Problem 2	…	Problem 104

Information Sciences Institute

13 of 17

Algorithm Classification: Results

Model	Accuracy
TBCNN [1]	0.94
inst2vec [2]	0.9483
SVM baseline	0.93
Ours	0.97

Operates on source code

Operates on compiled code

[1] “Convolutional Neural Networks over Tree Structures for Programming Language Processing”, L.Mou, G.Li, L.Zhang, T. Wang, Z. Jin; AAAI 2016

[2] “Neural Code Comprehension: A Learnable Representation of Code Semantics”, T. Ben-Nun, A.S.Jakobovits, T.Hoefler; NeurIPS 2018

Three other methods learn representations for instructions, while ours learn a joint representation for the entire program.

Information Sciences Institute

14 of 17

Vulnerability Discovery: Task Description

CWE-ID	#examples	CWE-ID	#examples	CWE-ID	#examples
CWE121	9486	CWE197	2664	CWE476	888
CWE122	11946	CWE23	2960	CWE563	1116
CWE124	3612	CWE36	2960	CWE590	6954
CWE126	2639	CWE369	2736	CWE606	760
CWE127	3612	CWE400	2280	CWE617	918
CWE134	3800	CWE401	4176	CWE680	1776
CWE190	12093	CWE415	2588	CWE690	2368
CWE191	9048	CWE416	888	CWE758	1046
CWE194	3552	CWE427	740	CWE761	888
CWE195	3552	CWE457	2104	CWE762	6429

Dataset introduced in “Juliet 1.1 C/C++ and Java Test Suite”, T. Boland and P.E.Black

Information Sciences Institute

15 of 17

Vulnerability Discovery: Results, pt 1

We beat the baseline in all cases but two: CWE-ID 590 and CWE-ID 761, where SVM is marginally better.

Information Sciences Institute

16 of 17

Vulnerability Discovery: Results, pt 2

We can indirectly compare our results to two surveys ([3],[4]) that use Juliet Test Suite as a benchmark for commercial static vulnerability discovery tools.

[3] “Towards modeling the behavior of static code analysis tools” by L.M.R. Velicheti, D.C. Feiock, M.Peiris, R. Raje, J.J.Hill; CISR 2014

Our results are better than results for all tools reported, and some by margin of more than 20% accuracy.

[4] “On the capability of static code analysis to detect security vulnerabilities” by K. Goseva-Popstojanova, A.Perhinschi at IST, 2015

We consistently outperform three out of four reported tools; the fourth one we outperform in all cases but two.

Information Sciences Institute

17 of 17

THANK YOU

shushana@usc.edu

@sharakelyan

Information Sciences Institute