TOWARDS LEARNING REPRESENTATIONS OF BINARY EXECUTABLE FILES
Shushan Arakelyan, Christophe Hauser, Erik Kline and Aram Galstyan
Information Sciences Institute
Contributions
Information Sciences Institute
;mov edx,dword ptr [rbp-8]
;mov eax,dword ptr [rbp-0xC]
sub edx, eax
Source code
//init a and b
c = a – b;
Decompiled assembly
Lifted to Vex IR
1: t4=GET:I64(rdx)
2: t3=64to32(t4)
3: t6=GET:I64(rax)
4: t5=64to32(t6)
5: t0=Sub32(t3,t5)
6: PUT(cc_op)=0x7
7: t7=32Uto64(t3)
8: PUT(cc_dep1)=t7
9: t8=32Uto64(t5)
10: PUT(cc_dep2)=t8
11: t9=32Uto64(t0)
12: PUT(rdx)=t9
Information Sciences Institute
Constructing a Program Graph
1: t4=GET:I64(rdx) 7: t7=32Uto64(t3)
2: t5=64to32(t4) 8: PUT(cc_dep1)=t7
3: t6=GET:I64(rax) 9: t8=32Uto64(t5)
4: t3=64to32(t6) 10: PUT(cc_dep2)=t8
5: t0=Sub32(t5,t3) 11: t9=32Uto64(t0)
6: PUT(cc_op)=0x7 12: PUT(rdx)=t9
Information Sciences Institute
Program Graphs: Construction
1: t4=GET:I64(rdx) 7: t7=32Uto64(t3)
2: t3=64to32(t4) 8: PUT(cc_dep1)=t7
3: t6=GET:I64(rax) 9: t8=32Uto64(t5)
4: t5=64to32(t6) 10: PUT(cc_dep2)=t8
5: t0=Sub32(t3,t5) 11: t9=32Uto64(t0)
6: PUT(cc_op)=0x7 12: PUT(rdx)=t9
Information Sciences Institute
Program Graphs: Removing Redundancy
Information Sciences Institute
Removing Redundancy
Information Sciences Institute
Graph Convolutional Neural Networks
“Semi-supervised Classification with Graph Convolutional Networks”, T.N. Kipf, M. Welling, ICLR 2017
Information Sciences Institute
Algorithm Classification: Description, pt 1
|
Example prompts for programming competition problems (from ACM Timus):
Information Sciences Institute
Algorithm Classification: Description, pt 2
Problem 1 | Problem 2 | … | Problem 104 |
1. Solution for P1 | 1. Solution for P2 | | 1. Solution for P104 |
2. Solution for P1 | 2. Solution for P2 | | 2. Solution for P104 |
… | … | … | … |
500. Solution for P1 | 500. Solution for P2 | | 500. Solution for P104 |
Dataset introduced in “Convolutional Neural Networks over Tree Structures for Programming Language Processing”, L.Mou, G.Li, L.Zhang, T. Wang, Z. Jin
Information Sciences Institute
Algorithm Classification: Description, pt 3
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Information Sciences Institute
Algorithm Classification: Description, pt 3
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Solution for ?
Problem 1 | Problem 2 | … | Problem 104 |
Information Sciences Institute
Algorithm Classification: Results
Model | Accuracy |
TBCNN [1] | 0.94 |
inst2vec [2] | 0.9483 |
SVM baseline | 0.93 |
Ours | 0.97 |
Operates on source code
Operates on compiled code
[1] “Convolutional Neural Networks over Tree Structures for Programming Language Processing”, L.Mou, G.Li, L.Zhang, T. Wang, Z. Jin; AAAI 2016
[2] “Neural Code Comprehension: A Learnable Representation of Code Semantics”, T. Ben-Nun, A.S.Jakobovits, T.Hoefler; NeurIPS 2018
Three other methods learn representations for instructions, while ours learn a joint representation for the entire program.
Information Sciences Institute
Vulnerability Discovery: Task Description
CWE-ID | #examples | CWE-ID | #examples | CWE-ID | #examples |
CWE121 | 9486 | CWE197 | 2664 | CWE476 | 888 |
CWE122 | 11946 | CWE23 | 2960 | CWE563 | 1116 |
CWE124 | 3612 | CWE36 | 2960 | CWE590 | 6954 |
CWE126 | 2639 | CWE369 | 2736 | CWE606 | 760 |
CWE127 | 3612 | CWE400 | 2280 | CWE617 | 918 |
CWE134 | 3800 | CWE401 | 4176 | CWE680 | 1776 |
CWE190 | 12093 | CWE415 | 2588 | CWE690 | 2368 |
CWE191 | 9048 | CWE416 | 888 | CWE758 | 1046 |
CWE194 | 3552 | CWE427 | 740 | CWE761 | 888 |
CWE195 | 3552 | CWE457 | 2104 | CWE762 | 6429 |
Dataset introduced in “Juliet 1.1 C/C++ and Java Test Suite”, T. Boland and P.E.Black
Information Sciences Institute
Vulnerability Discovery: Results, pt 1
We beat the baseline in all cases but two: CWE-ID 590 and CWE-ID 761, where SVM is marginally better.
Information Sciences Institute
Vulnerability Discovery: Results, pt 2
We can indirectly compare our results to two surveys ([3],[4]) that use Juliet Test Suite as a benchmark for commercial static vulnerability discovery tools.
Information Sciences Institute
THANK YOU
shushana@usc.edu
@sharakelyan
Information Sciences Institute