Modern Approaches in Human-Centric Decompilation
An exploration of past, present, and future decompilation techniques.
Slides: https://tinyurl.com/3mzb797t
Who is Zion?
Who is Zion?
@mahal0z on Twitter
Who is Zion?
@mahal0z on Twitter
Who is Zion?
@mahal0z on Twitter
What is Decompilation?
What is Decompilation?
(sourceless program)
1010101010101010101010010101010010110101
What is Decompilation?
(sourceless compiled program)
1010101010101010101010010101010010110101
What is Decompilation?
(sourceless compiled program)
if(...) {� // code�}�else {� // other code�} |
(program source code)
1010101010101010101010010101010010110101
What is Decompilation?
(sourceless compiled program)
if(...) {� // code�}�else {� // other code�} |
(program source code)
Decompilation
Why would you want source?
Why would you want source?
if(...) {� // code�}�else {� // other code�} |
Why would you want source?
if(...) {� // code�}�else {� // other code�} |
Why would you want source?
if(...) {� // code�}�else {� // other code�} |
Human-Centric Decompilation
Cognitive Load of Analysis
Human-Centric Decompilation
if(...) {�// code�}�else { // other code�} |
Cognitive Load of Analysis
Decompilation
Talk Outline
Decompilation Origins
Origins: dcc decompiler
(Dissertation, July 1994) [1]
[1] Cifuentes, Cristina. Reverse compilation techniques. Queensland University of Technology, Brisbane, 1994.
Origins: dcc decompiler - pipeline
mov [rbp-0x4], edi�mov [rbp-0x10], rsi�cmp [rbp-0x4], 0x3�jne 1173 |
Machine code
Origins: dcc decompiler - pipeline
mov [rbp-0x4], edi�mov [rbp-0x10], rsi�cmp [rbp-0x4], 0x3�jne 1173 |
Machine code
Control Flow Graph (CFG)
Lifting
Origins: dcc decompiler - pipeline
mov [rbp-0x4], edi�mov [rbp-0x10], rsi�cmp [rbp-0x4], 0x3�jne 1173 |
Machine code
Control Flow Graph (CFG)
Lifting
Structuring
if(...) {�// code�}�else {�// other code�} |
Origins: dcc decompiler - pipeline
mov [rbp-0x4], edi�mov [rbp-0x10], rsi�cmp [rbp-0x4], 0x3�jne 1173 |
Machine code
Control Flow Graph (CFG)
Lifting
Structuring
if(...) {�// code�}�else {�// other code�} |
Optimizing
c = (a == 3) ? a : b; |
Origins: dcc decompiler - optimization
Origins: dcc decompiler - structuring
Pattern Match
if(...) {�// code�}�else {�// other code�} |
Origins: dcc decompiler - structuring patterns
Decompilation Research Areas
Research Areas
Control Flow Graph Recovery
Control Flow Structuring
Type Inferencing & Variable Recovery
Fundamentals
Symbol Recovery
Usability
Research Areas
Control Flow Graph Recovery
Control Flow Structuring
Type Inferencing & Variable Recovery
Fundamentals
Symbol Recovery
Usability
CFG Recovery
mov [rbp-0x4], edi�mov [rbp-0x10], rsi�cmp [rbp-0x4], 0x3�jne 1173 |
Machine code
Control Flow Graph (CFG)
Lifting
CFG Recovery: phases
CFG Recovery
[2] Flores-Montoya, Antonio, and Eric Schulte. "Datalog disassembly." Proceedings of the 29th USENIX Conference on Security Symposium. 2020.
[3] Y. Shoshitaishvili et al., "SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis," 2016 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 2016.
[4] Di Federico, Alessandro, Mathias Payer, and Giovanni Agosta. "rev. ng: a unified binary analysis framework to recover CFGs and function boundaries." Proceedings of the 26th International Conference on Compiler Construction. 2017.
[5] Kim, Sun Hyoung, et al. "Refining Indirect Call Targets at the Binary Level." NDSS. 2021.
[6] Pang, Chengbin, et al. "Ground Truth for Binary Disassembly is Not Easy." 31st USENIX Security Symposium (USENIX Security 22). 2022.
CFG Recovery: Problems
Research Areas
Control Flow Graph Recovery
Control Flow Structuring
Type Inferencing & Variable Recovery
Fundamentals
Symbol Recovery
Usability
Type Inf & Var Recovery
mov [rbp-0x4], edi�mov [rbp-0x10], rsi�cmp [rbp-0x4], 0x3�jne 1173 |
Lifting
int c = 3; |
Type Inf & Var Recovery: Approaches
Type Inf & Var Recovery: Approaches
[7] Lee, JongHyup, Thanassis Avgerinos, and David Brumley. "TIE: Principled reverse engineering of types in binary programs." (2011).
[8] Noonan, Matt, Alexey Loginov, and David Cok. "Polymorphic type inference for machine code." Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2016.
[9] Chen, Qibin, et al. "Augmenting decompiler output with learned variable names and types." 31st USENIX Security Symposium (USENIX Security 22). 2022.
Research Areas
Control Flow Graph Recovery
Control Flow Structuring
Type Inferencing & Variable Recovery
Fundamentals
Symbol Recovery
Usability
Control Flow Structuring
Control Flow Graph (CFG)
Structuring
if(...) {�// code�}�else {�// other code�} |
Control Flow Structuring
Control Flow Structuring
[10] Schwartz, Edward J., et al. "Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring." Proceedings of the USENIX Security Symposium. Vol. 16. 2013.
[11] Yakdan, Khaled, et al. "No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations." NDSS. 2015.
[12] Gussoni, Andrea, et al. "A comb for decompiled c code." Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 2020.
Control Flow Structuring
[10] Schwartz, Edward J., et al. "Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring." Proceedings of the USENIX Security Symposium. Vol. 16. 2013.
[11] Yakdan, Khaled, et al. "No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations." NDSS. 2015.
[12] Gussoni, Andrea, et al. "A comb for decompiled c code." Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 2020.
Structuring: very impactful for readability
foo()
bar()
Structuring: very impactful for readability
foo()
bar()
if (a || b)� foo();�else� bar(); |
Structuring: very impactful for readability
foo()
bar()
if (a || b)� foo();�else� bar(); |
if (a)� goto lab_1;��if (b) {� lab_1:� foo();�}�else� bar(); |
Structuring: very impactful for readability
foo()
bar()
if (a || b)� foo();�else� bar(); |
if (a)� goto lab_1;��if (b) {� lab_1:� foo();�}�else� bar(); |
Structuring: very impactful for readability
foo()
bar()
if (a || b)� foo();�else� bar(); |
if (a)� goto lab_1;��if (b) {� lab_1:� foo();�}�else� bar(); |
Research Areas
Control Flow Graph Recovery
Control Flow Structuring
Type Inferencing & Variable Recovery
Fundamentals
Symbol Recovery
Usability
Auxiliary Techniques
[13] Burk, Kevin, et al. "Decomperson: How Humans Decompile and What We Can Learn From It." 31st USENIX Security Symposium (USENIX Security 22). 2022.
[14] Yakdan, Khaled, et al. "Helping johnny to analyze malware: A usability-optimized decompiler and malware analysis user study." 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 2016.
Short Break
Any questions?
Modern Decompilers
Modern Decompilers
Modern Decompilers
DREAM (angr)
foo()
bar()
Phoenix (angr)
IDA Pro (Hexrays)
Ghidra
So who is the best?
Previous ways to measure quality
Coreutils: fmt (best in class)
// ...� int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" : dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon( 0, shell_escape_quoting_style, file) );� return err < 0;�} |
Source
Coreutils: fmt (best in class)
// ...� v3 = quotearg_n_style_colon(0LL, 3LL, file);� goto LABEL_8;�}�if ( f == stdin )�{� clearerr_unlocked(f);� return 1;�}�else�{� if ( (unsigned int)rpl_fclose(f) )� {� v6 = *_errno_location();� if ( v6 >= 0 )� {� v7 = (const char *)quotearg_n_style_colon(0LL, 3LL, file);� v3 = (__int64)v7;� if ( v6 )� {� error(0, v6, "%s", v7);� return paragraph;� }�LABEL_8:� v4 = dcgettext(0LL, "read error", 5);� error(0, 0, v4, v3);� return paragraph;� }� }� return 1;�} |
// ...� int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" : dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon( 0, shell_escape_quoting_style, file) );� return err < 0;�} |
Source
IDA Pro 8.0
Coreutils: fmt (best in class)
// ...� v3 = quotearg_n_style_colon(0LL, 3LL, file);� goto LABEL_8;�}�if ( f == stdin )�{� clearerr_unlocked(f);� return 1;�}�else�{� if ( (unsigned int)rpl_fclose(f) )� {� v6 = *_errno_location();� if ( v6 >= 0 )� {� v7 = (const char *)quotearg_n_style_colon(0LL, 3LL, file);� v3 = (__int64)v7;� if ( v6 )� {� error(0, v6, "%s", v7);� return paragraph;� }�LABEL_8:� v4 = dcgettext(0LL, "read error", 5);� error(0, 0, v4, v3);� return paragraph;� }� }� return 1;�} |
// ...� int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" : dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon( 0, shell_escape_quoting_style, file) );� return err < 0;�} |
Source
IDA Pro 8.0
// ...� piVar5 = __errno_location();� iVar2 = *piVar5;� if (iVar2 < 0) {� return 1;� } uVar3 = quotearg_n_style_colon(0,3,param_2); if (iVar2 != 0) {� error(0,iVar2,"%s",uVar3);� return uVar1;� }� }� else {� if (param_1 == stdin) {� clearerr_unlocked(param_1);� }� else {� rpl_fclose();� } uVar3 = quotearg_n_style_colon(0,3,param_2);� }� uVar4 = dcgettext(0,"read error",5);� error(0,0,uVar4,uVar3);� return uVar1;�} |
Ghidra 10.2
Coreutils: fmt (best in class)
// ...� v3 = quotearg_n_style_colon(0LL, 3LL, file);� goto LABEL_8;�}�if ( f == stdin )�{� clearerr_unlocked(f);� return 1;�}�else�{� if ( (unsigned int)rpl_fclose(f) )� {� v6 = *_errno_location();� if ( v6 >= 0 )� {� v7 = (const char *)quotearg_n_style_colon(0LL, 3LL, file);� v3 = (__int64)v7;� if ( v6 )� {� error(0, v6, "%s", v7);� return paragraph;� }�LABEL_8:� v4 = dcgettext(0LL, "read error", 5);� error(0, 0, v4, v3);� return paragraph;� }� }� return 1;�} |
// ...� int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" : dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon( 0, shell_escape_quoting_style, file) );� return err < 0;�} |
Source
IDA Pro 8.0
// ...� piVar5 = __errno_location();� iVar2 = *piVar5;� if (iVar2 < 0) {� return 1;� } uVar3 = quotearg_n_style_colon(0,3,param_2); if (iVar2 != 0) {� error(0,iVar2,"%s",uVar3);� return uVar1;� }� }� else {� if (param_1 == stdin) {� clearerr_unlocked(param_1);� }� else {� rpl_fclose();� } uVar3 = quotearg_n_style_colon(0,3,param_2);� }� uVar4 = dcgettext(0,"read error",5);� error(0,0,uVar4,uVar3);� return uVar1;�} |
Ghidra 10.2
Coreutils: fmt (academia)
// ...� int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" : dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon( 0, shell_escape_quoting_style, file) );� return err < 0;�} |
Source
Phoenix
// … v6 = quotearg_n_style_colon(0x0, 0x3, a1);� if (v4 == 0)� {� goto LABEL_400e43;� }� error(0x0, v4, "%s");� }� else� {� if (a0 != *(&stdin))� {� rpl_fclose();� }� else� {� clearerr_unlocked(a0);� }� v5 = quotearg_n_style_colon(0x0, 0x3, a1);�LABEL_400e43:� error(0x0, 0x0, dcgettext(NULL, "read error", 0x5));� } return get_paragraph(a0, a1, a2, a3, a4, a5);�} |
// ... � v7 = quotearg_n_style_colon(0x0, 0x3, a1);�}�else if (a0 != *(&stdin))�{� v4 = rpl_fclose();� if (v4 != 0) { v5 = __errno_location();� v6 = *(v5);� if (*(v5) >= 0) { � v8 = quotearg_n_style_colon(0x0, 0x3, a1);� if (v6 != 0)� error(0x0, v6, "%s");� } � }� if (v4 == 0 || *(v5) < 0) return 1;�}�else {� clearerr_unlocked(a0);� return 1; }�if (v3 != 0 || v6 == 0 && v4 != 0 && a0 != *(&stdin) && *(v5) >= 0)� error(0x0, 0x0, dcgettext(NULL, "read error", 0x5));��if (v3 != 0 || v4 != 0 && a0 != *(&stdin) && *(v5) >= 0)� return get_paragraph(a0, a1, a2, a3, a4, a5); |
DREAM
Coreutils: Unstructured Code
The summation of all gotos present in decompilation of all binaries in the Coreutils package compiled with “-O2” optimization level. Across 895 unique functions.
Coreutils: Unstructured Code
The summation of all gotos present in decompilation of all binaries in the Coreutils package compiled with “-O2” optimization level. Across 895 unique functions.
NoMoreGotos: TooManyConditions
// ...� v3 = quotearg_n_style_colon(0LL, 3LL, file);� goto LABEL_8;�}�if ( f == stdin )�{� clearerr_unlocked(f);� return 1;�}�else�{� if ( (unsigned int)rpl_fclose(f) )� {� v6 = *_errno_location();� if ( v6 >= 0 )� {� v7 = (const char *)quotearg_n_style_colon(0LL, 3LL, file);� v3 = (__int64)v7;� if ( v6 )� {� error(0, v6, "%s", v7);� return paragraph;� }�LABEL_8:� v4 = dcgettext(0LL, "read error", 5);� error(0, 0, v4, v3);� return paragraph;� }� }� return 1;�} |
// ... � v7 = quotearg_n_style_colon(0x0, 0x3, a1);�}�else if (a0 != *(&stdin))�{� v4 = rpl_fclose();� if (v4 != 0) { v5 = __errno_location();� v6 = *(v5);� if (*(v5) >= 0) { � v8 = quotearg_n_style_colon(0x0, 0x3, a1);� if (v6 != 0)� error(0x0, v6, "%s");� } � }� if (v4 == 0 || *(v5) < 0) return 1;�}�else {� clearerr_unlocked(a0);� return 1; }�if (v3 != 0 || v6 == 0 && v4 != 0 && a0 != *(&stdin) && *(v5) >= 0)� error(0x0, 0x0, dcgettext(NULL, "read error", 0x5));��if (v3 != 0 || v4 != 0 && a0 != *(&stdin) && *(v5) >= 0)� return get_paragraph(a0, a1, a2, a3, a4, a5); |
DREAM
IDA Pro 8.0
NoMoreGotos: TooManyConditions
The summation of all boolean operators present in decompilation of all binaries in the Coreutils package compiled with “-O2” optimization level. Across 895 unique functions.
So does anyone win?!?
The summation of all cyclomatic complexity by function present in the decompilation of all binaries in the Coreutils package compiled with “-O2” optimization level. Across 895 unique functions.
Future Areas of Research
Evaluating Decompilation Quality
Control Flow Structuring
CFG Recovery
Usability
Usability: AI
Usability: AI
Questions?
Thank You