1 of 17

TOWARDS LEARNING REPRESENTATIONS OF BINARY EXECUTABLE FILES

Shushan Arakelyan, Christophe Hauser, Erik Kline and Aram Galstyan

Information Sciences Institute

2 of 17

Contributions

  • A task-independent distributed representation learning model for binary executable programs

  • Modelling binary executable programs’ structure and computations

  • Improvement on state of the art and commercial tools on two problems – algorithm classification and vulnerability discovery

Information Sciences Institute

3 of 17

 

;mov edx,dword ptr [rbp-8]

;mov eax,dword ptr [rbp-0xC]

sub edx, eax

Source code

//init a and b

c = a – b;

Decompiled assembly

Lifted to Vex IR

1: t4=GET:I64(rdx)

2: t3=64to32(t4)

3: t6=GET:I64(rax)

4: t5=64to32(t6)

5: t0=Sub32(t3,t5)

6: PUT(cc_op)=0x7

7: t7=32Uto64(t3)

8: PUT(cc_dep1)=t7

9: t8=32Uto64(t5)

10: PUT(cc_dep2)=t8

11: t9=32Uto64(t0)

12: PUT(rdx)=t9

Information Sciences Institute

4 of 17

Constructing a Program Graph

1: t4=GET:I64(rdx) 7: t7=32Uto64(t3)

2: t5=64to32(t4) 8: PUT(cc_dep1)=t7

3: t6=GET:I64(rax) 9: t8=32Uto64(t5)

4: t3=64to32(t6) 10: PUT(cc_dep2)=t8

5: t0=Sub32(t5,t3) 11: t9=32Uto64(t0)

6: PUT(cc_op)=0x7 12: PUT(rdx)=t9

Information Sciences Institute

5 of 17

Program Graphs: Construction

1: t4=GET:I64(rdx) 7: t7=32Uto64(t3)

2: t3=64to32(t4) 8: PUT(cc_dep1)=t7

3: t6=GET:I64(rax) 9: t8=32Uto64(t5)

4: t5=64to32(t6) 10: PUT(cc_dep2)=t8

5: t0=Sub32(t3,t5) 11: t9=32Uto64(t0)

6: PUT(cc_op)=0x7 12: PUT(rdx)=t9

Information Sciences Institute

6 of 17

Program Graphs: Removing Redundancy

Information Sciences Institute

7 of 17

Removing Redundancy

Information Sciences Institute

8 of 17

Graph Convolutional Neural Networks

Semi-supervised Classification with Graph Convolutional Networks”, T.N. Kipf, M. Welling, ICLR 2017

Information Sciences Institute

9 of 17

Algorithm Classification: Description, pt 1

Example prompts for programming competition problems (from ACM Timus):

Information Sciences Institute

10 of 17

Algorithm Classification: Description, pt 2

Problem 1

Problem 2

Problem 104

1. Solution for P1

1. Solution for P2

1. Solution for P104

2. Solution for P1

2. Solution for P2

2. Solution for P104

500. Solution for P1

500. Solution for P2

500. Solution for P104

Dataset introduced in “Convolutional Neural Networks over Tree Structures for Programming Language Processing”, L.Mou, G.Li, L.Zhang, T. Wang, Z. Jin

Information Sciences Institute

11 of 17

Algorithm Classification: Description, pt 3

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Information Sciences Institute

12 of 17

Algorithm Classification: Description, pt 3

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Solution for ?

Problem 1

Problem 2

Problem 104

Information Sciences Institute

13 of 17

Algorithm Classification: Results

Model

Accuracy

TBCNN [1]

0.94

inst2vec [2]

0.9483

SVM baseline

0.93

Ours

0.97

Operates on source code

Operates on compiled code

[1] “Convolutional Neural Networks over Tree Structures for Programming Language Processing”, L.Mou, G.Li, L.Zhang, T. Wang, Z. Jin; AAAI 2016

[2] “Neural Code Comprehension: A Learnable Representation of Code Semantics”, T. Ben-Nun, A.S.Jakobovits, T.Hoefler; NeurIPS 2018

Three other methods learn representations for instructions, while ours learn a joint representation for the entire program.

Information Sciences Institute

14 of 17

Vulnerability Discovery: Task Description

CWE-ID

#examples

CWE-ID

#examples

CWE-ID

#examples

CWE121

9486

CWE197

2664

CWE476

888

CWE122

11946

CWE23

2960

CWE563

1116

CWE124

3612

CWE36

2960

CWE590

6954

CWE126

2639

CWE369

2736

CWE606

760

CWE127

3612

CWE400

2280

CWE617

918

CWE134

3800

CWE401

4176

CWE680

1776

CWE190

12093

CWE415

2588

CWE690

2368

CWE191

9048

CWE416

888

CWE758

1046

CWE194

3552

CWE427

740

CWE761

888

CWE195

3552

CWE457

2104

CWE762

6429

Dataset introduced in “Juliet 1.1 C/C++ and Java Test Suite”, T. Boland and P.E.Black

Information Sciences Institute

15 of 17

Vulnerability Discovery: Results, pt 1

We beat the baseline in all cases but two: CWE-ID 590 and CWE-ID 761, where SVM is marginally better.

Information Sciences Institute

16 of 17

Vulnerability Discovery: Results, pt 2

We can indirectly compare our results to two surveys ([3],[4]) that use Juliet Test Suite as a benchmark for commercial static vulnerability discovery tools.

  • [3] “Towards modeling the behavior of static code analysis tools” by L.M.R. Velicheti, D.C. Feiock, M.Peiris, R. Raje, J.J.Hill; CISR 2014
    • Our results are better than results for all tools reported, and some by margin of more than 20% accuracy.
  • [4] “On the capability of static code analysis to detect security vulnerabilities” by K. Goseva-Popstojanova, A.Perhinschi at IST, 2015
    • We consistently outperform three out of four reported tools; the fourth one we outperform in all cases but two.

Information Sciences Institute

17 of 17

THANK YOU

shushana@usc.edu

@sharakelyan

Information Sciences Institute