1 of 78

Modern Approaches in Human-Centric Decompilation

An exploration of past, present, and future decompilation techniques.

Slides: https://tinyurl.com/3mzb797t

2 of 78

Who is Zion?

3 of 78

Who is Zion?

@mahal0z on Twitter

4 of 78

Who is Zion?

@mahal0z on Twitter

5 of 78

Who is Zion?

@mahal0z on Twitter

6 of 78

What is Decompilation?

7 of 78

What is Decompilation?

(sourceless program)

8 of 78

1010101010101010101010010101010010110101

What is Decompilation?

(sourceless compiled program)

9 of 78

1010101010101010101010010101010010110101

What is Decompilation?

(sourceless compiled program)

if(...) {� // code�}�else {� // other code�}

(program source code)

10 of 78

1010101010101010101010010101010010110101

What is Decompilation?

(sourceless compiled program)

if(...) {� // code�}�else {� // other code�}

(program source code)

Decompilation

11 of 78

Why would you want source?

12 of 78

Why would you want source?

if(...) {� // code�}�else {� // other code�}

13 of 78

Why would you want source?

if(...) {� // code�}�else {� // other code�}

14 of 78

Why would you want source?

if(...) {� // code�}�else {� // other code�}

15 of 78

Human-Centric Decompilation

Cognitive Load of Analysis

16 of 78

Human-Centric Decompilation

if(...) {�// code�}�else {

// other code�}

Cognitive Load of Analysis

Decompilation

17 of 78

Talk Outline

  1. Decompilation Origins
  2. Decompilation Research Areas
    1. Fundamental Techniques
    2. Auxiliary Techniques
  3. Modern Decompilers
    • Recent Decompilers
    • Who’s Winning?
    • Why are they winning?
    • What about DREAM?
  4. Future Techniques

18 of 78

Decompilation Origins

19 of 78

Origins: dcc decompiler

(Dissertation, July 1994) [1]

[1] Cifuentes, Cristina. Reverse compilation techniques. Queensland University of Technology, Brisbane, 1994.

20 of 78

Origins: dcc decompiler - pipeline

mov [rbp-0x4], edimov [rbp-0x10], rsicmp [rbp-0x4], 0x3jne 1173

Machine code

21 of 78

Origins: dcc decompiler - pipeline

mov [rbp-0x4], edimov [rbp-0x10], rsicmp [rbp-0x4], 0x3jne 1173

Machine code

Control Flow Graph (CFG)

Lifting

22 of 78

Origins: dcc decompiler - pipeline

mov [rbp-0x4], edimov [rbp-0x10], rsicmp [rbp-0x4], 0x3jne 1173

Machine code

Control Flow Graph (CFG)

Lifting

Structuring

if(...) {�// code�}�else {�// other code�}

23 of 78

Origins: dcc decompiler - pipeline

mov [rbp-0x4], edimov [rbp-0x10], rsicmp [rbp-0x4], 0x3jne 1173

Machine code

Control Flow Graph (CFG)

Lifting

Structuring

if(...) {�// code�}�else {�// other code�}

Optimizing

c = (a == 3) ? a : b;

24 of 78

Origins: dcc decompiler - optimization

  • Use data-flow analysis
  • Based on Compiler Optimizations:
    • Dead Variable Elimination
    • Constant Propagation

25 of 78

Origins: dcc decompiler - structuring

Pattern Match

if(...) {�// code�}�else {�// other code�}

26 of 78

Origins: dcc decompiler - structuring patterns

27 of 78

Decompilation Research Areas

28 of 78

Research Areas

Control Flow Graph Recovery

Control Flow Structuring

Type Inferencing & Variable Recovery

Fundamentals

Symbol Recovery

Usability

29 of 78

Research Areas

Control Flow Graph Recovery

Control Flow Structuring

Type Inferencing & Variable Recovery

Fundamentals

Symbol Recovery

Usability

30 of 78

CFG Recovery

mov [rbp-0x4], edimov [rbp-0x10], rsicmp [rbp-0x4], 0x3jne 1173

Machine code

Control Flow Graph (CFG)

Lifting

31 of 78

CFG Recovery: phases

  1. Disassemble [2], [6]
  2. Lift to Intermediate Representation [3]
  3. Identify Function Boundaries [4]
  4. Resolve Indirect Control Flow [5], [6]

32 of 78

CFG Recovery

[2] Flores-Montoya, Antonio, and Eric Schulte. "Datalog disassembly." Proceedings of the 29th USENIX Conference on Security Symposium. 2020.

[3] Y. Shoshitaishvili et al., "SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis," 2016 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 2016.

[4] Di Federico, Alessandro, Mathias Payer, and Giovanni Agosta. "rev. ng: a unified binary analysis framework to recover CFGs and function boundaries." Proceedings of the 26th International Conference on Compiler Construction. 2017.

[5] Kim, Sun Hyoung, et al. "Refining Indirect Call Targets at the Binary Level." NDSS. 2021.

[6] Pang, Chengbin, et al. "Ground Truth for Binary Disassembly is Not Easy." 31st USENIX Security Symposium (USENIX Security 22). 2022.

  • Disassemble [2], [6]
  • Lift to Intermediate Representation [3]
  • Identify Function Boundaries [4]
  • Resolve Indirect Control Flow [5], [6]

33 of 78

CFG Recovery: Problems

  • Measuring Accuracy
  • Resolving Indirect Jumps (more)

34 of 78

Research Areas

Control Flow Graph Recovery

Control Flow Structuring

Type Inferencing & Variable Recovery

Fundamentals

Symbol Recovery

Usability

35 of 78

Type Inf & Var Recovery

mov [rbp-0x4], edimov [rbp-0x10], rsicmp [rbp-0x4], 0x3jne 1173

Lifting

int c = 3;

36 of 78

Type Inf & Var Recovery: Approaches

  • Dependency Analysis [7], [8]
  • Storage Location [7]
  • Machine Learning [9]

37 of 78

Type Inf & Var Recovery: Approaches

  • Dependency Analysis [7], [8]
  • Storage Location [7]
  • Machine Learning [9]

[7] Lee, JongHyup, Thanassis Avgerinos, and David Brumley. "TIE: Principled reverse engineering of types in binary programs." (2011).

[8] Noonan, Matt, Alexey Loginov, and David Cok. "Polymorphic type inference for machine code." Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2016.

[9] Chen, Qibin, et al. "Augmenting decompiler output with learned variable names and types." 31st USENIX Security Symposium (USENIX Security 22). 2022.

38 of 78

Research Areas

Control Flow Graph Recovery

Control Flow Structuring

Type Inferencing & Variable Recovery

Fundamentals

Symbol Recovery

Usability

39 of 78

Control Flow Structuring

Control Flow Graph (CFG)

Structuring

if(...) {�// code�}�else {�// other code�}

40 of 78

Control Flow Structuring

  • Pattern matching, semantic preserving [10] (Phoenix)
  • Region based, guarding conditions [11] (DREAM)
    • Effect: has no gotos in output
  • Guarding with block duplication [12] (rev.ng)

41 of 78

Control Flow Structuring

  • Pattern matching, semantic preserving [10] (Phoenix)
  • Region based, guarding conditions [11] (DREAM)
    • Effect: has no gotos in output
  • Guarding with block duplication [12] (rev.ng)

[10] Schwartz, Edward J., et al. "Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring." Proceedings of the USENIX Security Symposium. Vol. 16. 2013.

[11] Yakdan, Khaled, et al. "No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations." NDSS. 2015.

[12] Gussoni, Andrea, et al. "A comb for decompiled c code." Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 2020.

42 of 78

Control Flow Structuring

  • Pattern matching, semantic preserving [10] (Phoenix)
  • Region based, guarding conditions [11] (DREAM)
    • Effect: has no gotos in output
  • Guarding with block duplication [12] (rev.ng)

[10] Schwartz, Edward J., et al. "Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring." Proceedings of the USENIX Security Symposium. Vol. 16. 2013.

[11] Yakdan, Khaled, et al. "No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations." NDSS. 2015.

[12] Gussoni, Andrea, et al. "A comb for decompiled c code." Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 2020.

43 of 78

Structuring: very impactful for readability

foo()

bar()

44 of 78

Structuring: very impactful for readability

foo()

bar()

if (a || b)� foo();else� bar();

45 of 78

Structuring: very impactful for readability

foo()

bar()

if (a || b)� foo();�else� bar();

if (a)� goto lab_1;��if (b) {� lab_1:� foo();�}�else� bar();

46 of 78

Structuring: very impactful for readability

foo()

bar()

if (a || b)� foo();�else� bar();

if (a)� goto lab_1;��if (b) {� lab_1:� foo();�}�else� bar();

47 of 78

Structuring: very impactful for readability

foo()

bar()

if (a || b)� foo();�else� bar();

if (a)� goto lab_1;��if (b) {� lab_1:� foo();�}�else� bar();

48 of 78

Research Areas

Control Flow Graph Recovery

Control Flow Structuring

Type Inferencing & Variable Recovery

Fundamentals

Symbol Recovery

Usability

49 of 78

Auxiliary Techniques

  • Usability:
    • Understanding decompilation techniques [13], [14]
    • Better information displaying [14]
    • External analysis integration
  • Symbol Recovery
    • Variable Names [9]
    • Function Summarization

[13] Burk, Kevin, et al. "Decomperson: How Humans Decompile and What We Can Learn From It." 31st USENIX Security Symposium (USENIX Security 22). 2022.

[14] Yakdan, Khaled, et al. "Helping johnny to analyze malware: A usability-optimized decompiler and malware analysis user study." 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 2016.

50 of 78

Short Break

Any questions?

51 of 78

Modern Decompilers

52 of 78

Modern Decompilers

  • IDA Pro: $$$, considered industry standard
  • Binary Ninja: $
  • Ghidra: Open-source, considered best free option
  • angr: Open-source, research oriented
  • Reko: Open-source, easy to understand
  • DREAM: Open-source, Windows XP-only
  • Phoenix: Not available

53 of 78

Modern Decompilers

  • IDA Pro: $$$, considered industry standard
  • Binary Ninja: $
  • Ghidra: Open-source, considered best free option
  • angr: Open-source, research oriented
    • (Phoenix Implementation)
    • (DREAM Implementation)
  • Reko: Open-source, easy to understand
  • DREAM: Open-source, Windows XP-only
  • Phoenix: Not available

54 of 78

DREAM (angr)

  • Control Flow Structuring: region identification, condition guard, pattern match loops

foo()

bar()

55 of 78

Phoenix (angr)

  • Control Flow Structuring: pattern matching, with no knowledge regions

56 of 78

IDA Pro (Hexrays)

  • Control Flow Structuring: (guessed) pattern matching, with no knowledge regions, extended pattern knowledge
  • Naming scheme based on signature (symbol recovery)
  • Integration with dynamic analysis
  • Highly optimized variable defs and use

57 of 78

Ghidra

  • Control Flow Structuring: pattern matching, with no knowledge regions, extended pattern knowledge
  • Integration with dynamic analysis
  • Some symbol recovery

58 of 78

So who is the best?

59 of 78

Previous ways to measure quality

  • Goto count [10], [11], [12]
  • Recompilability [10]
  • Number of lines [11]
  • Cyclomatic Complexity [12]
    • ~ “Linear paths through a program”

60 of 78

Coreutils: fmt (best in class)

// ...int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" :

dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon(

0, shell_escape_quoting_style, file)

);� return err < 0;�}

Source

61 of 78

Coreutils: fmt (best in class)

// ...� v3 = quotearg_n_style_colon(0LL, 3LL, file);� goto LABEL_8;�}�if ( f == stdin )�{� clearerr_unlocked(f);� return 1;�}�else�{� if ( (unsigned int)rpl_fclose(f) )� {� v6 = *_errno_location();� if ( v6 >= 0 )� {� v7 = (const char *)quotearg_n_style_colon(0LL, 3LL, file);� v3 = (__int64)v7;� if ( v6 )� {� error(0, v6, "%s", v7);� return paragraph;� }�LABEL_8:� v4 = dcgettext(0LL, "read error", 5);� error(0, 0, v4, v3);� return paragraph;� }� }� return 1;�}

// ...int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" :

dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon(

0, shell_escape_quoting_style, file)

);� return err < 0;�}

Source

IDA Pro 8.0

62 of 78

Coreutils: fmt (best in class)

// ...� v3 = quotearg_n_style_colon(0LL, 3LL, file);� goto LABEL_8;�}�if ( f == stdin )�{� clearerr_unlocked(f);� return 1;�}�else�{� if ( (unsigned int)rpl_fclose(f) )� {� v6 = *_errno_location();� if ( v6 >= 0 )� {� v7 = (const char *)quotearg_n_style_colon(0LL, 3LL, file);� v3 = (__int64)v7;� if ( v6 )� {� error(0, v6, "%s", v7);� return paragraph;� }�LABEL_8:� v4 = dcgettext(0LL, "read error", 5);� error(0, 0, v4, v3);� return paragraph;� }� }� return 1;�}

// ...int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" :

dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon(

0, shell_escape_quoting_style, file)

);� return err < 0;�}

Source

IDA Pro 8.0

// ...� piVar5 = __errno_location();� iVar2 = *piVar5;� if (iVar2 < 0)

{� return 1;� }

uVar3 = quotearg_n_style_colon(0,3,param_2);

if (iVar2 != 0) {� error(0,iVar2,"%s",uVar3);� return uVar1;� }� }� else

{� if (param_1 == stdin)

{� clearerr_unlocked(param_1);� }� else

{� rpl_fclose();� }

uVar3 = quotearg_n_style_colon(0,3,param_2);� }� uVar4 = dcgettext(0,"read error",5);� error(0,0,uVar4,uVar3);� return uVar1;�}

Ghidra 10.2

63 of 78

Coreutils: fmt (best in class)

// ...� v3 = quotearg_n_style_colon(0LL, 3LL, file);� goto LABEL_8;�}�if ( f == stdin )�{� clearerr_unlocked(f);� return 1;�}�else�{� if ( (unsigned int)rpl_fclose(f) )� {� v6 = *_errno_location();� if ( v6 >= 0 )� {� v7 = (const char *)quotearg_n_style_colon(0LL, 3LL, file);� v3 = (__int64)v7;� if ( v6 )� {� error(0, v6, "%s", v7);� return paragraph;� }�LABEL_8:� v4 = dcgettext(0LL, "read error", 5);� error(0, 0, v4, v3);� return paragraph;� }� }� return 1;�}

// ...int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" :

dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon(

0, shell_escape_quoting_style, file)

);� return err < 0;�}

Source

IDA Pro 8.0

// ...� piVar5 = __errno_location();� iVar2 = *piVar5;� if (iVar2 < 0)

{� return 1;� }

uVar3 = quotearg_n_style_colon(0,3,param_2);

if (iVar2 != 0) {� error(0,iVar2,"%s",uVar3);� return uVar1;� }� }� else

{� if (param_1 == stdin)

{� clearerr_unlocked(param_1);� }� else

{� rpl_fclose();� }

uVar3 = quotearg_n_style_colon(0,3,param_2);� }� uVar4 = dcgettext(0,"read error",5);� error(0,0,uVar4,uVar3);� return uVar1;�}

Ghidra 10.2

64 of 78

Coreutils: fmt (academia)

// ...int err = ferror_unlocked(f) ? 0 : -1;� if (f == stdin)� clearerr_unlocked(f);� else if (rpl_fclose(f) != 0 && err < 0)� err = (*__errno_location());� if (0 <= err)� error(0, err, err ? "%s" :

dcgettext(((void *)0), "read error", 5),� quotearg_n_style_colon(

0, shell_escape_quoting_style, file)

);� return err < 0;�}

Source

Phoenix

//

v6 = quotearg_n_style_colon(0x0, 0x3, a1);� if (v4 == 0)� {� goto LABEL_400e43;� }� error(0x0, v4, "%s");� }� else� {� if (a0 != *(&stdin))� {� rpl_fclose();� }� else� {� clearerr_unlocked(a0);� }� v5 = quotearg_n_style_colon(0x0, 0x3, a1);�LABEL_400e43:� error(0x0, 0x0, dcgettext(NULL, "read error", 0x5));� }

return get_paragraph(a0, a1, a2, a3, a4, a5);�}

// ... � v7 = quotearg_n_style_colon(0x0, 0x3, a1);�}�else if (a0 != *(&stdin))�{� v4 = rpl_fclose();� if (v4 != 0) {

v5 = __errno_location();� v6 = *(v5);� if (*(v5) >= 0) { � v8 = quotearg_n_style_colon(0x0, 0x3, a1);� if (v6 != 0)� error(0x0, v6, "%s");� } � }� if (v4 == 0 || *(v5) < 0)

return 1;�}�else {� clearerr_unlocked(a0);� return 1;

}�if (v3 != 0 || v6 == 0 && v4 != 0 && a0 != *(&stdin) && *(v5) >= 0)� error(0x0, 0x0, dcgettext(NULL, "read error", 0x5));��if (v3 != 0 || v4 != 0 && a0 != *(&stdin) && *(v5) >= 0)� return get_paragraph(a0, a1, a2, a3, a4, a5);

DREAM

65 of 78

Coreutils: Unstructured Code

The summation of all gotos present in decompilation of all binaries in the Coreutils package compiled with “-O2” optimization level. Across 895 unique functions.

66 of 78

Coreutils: Unstructured Code

The summation of all gotos present in decompilation of all binaries in the Coreutils package compiled with “-O2” optimization level. Across 895 unique functions.

67 of 78

NoMoreGotos: TooManyConditions

// ...� v3 = quotearg_n_style_colon(0LL, 3LL, file);� goto LABEL_8;�}�if ( f == stdin )�{� clearerr_unlocked(f);� return 1;�}�else�{� if ( (unsigned int)rpl_fclose(f) )� {� v6 = *_errno_location();� if ( v6 >= 0 )� {� v7 = (const char *)quotearg_n_style_colon(0LL, 3LL, file);� v3 = (__int64)v7;� if ( v6 )� {� error(0, v6, "%s", v7);� return paragraph;� }�LABEL_8:� v4 = dcgettext(0LL, "read error", 5);� error(0, 0, v4, v3);� return paragraph;� }� }� return 1;�}

// ... � v7 = quotearg_n_style_colon(0x0, 0x3, a1);�}�else if (a0 != *(&stdin))�{� v4 = rpl_fclose();� if (v4 != 0) {

v5 = __errno_location();� v6 = *(v5);� if (*(v5) >= 0) { � v8 = quotearg_n_style_colon(0x0, 0x3, a1);� if (v6 != 0)� error(0x0, v6, "%s");� } � }� if (v4 == 0 || *(v5) < 0)

return 1;�}�else {� clearerr_unlocked(a0);� return 1;

}�if (v3 != 0 || v6 == 0 && v4 != 0 && a0 != *(&stdin) && *(v5) >= 0)� error(0x0, 0x0, dcgettext(NULL, "read error", 0x5));��if (v3 != 0 || v4 != 0 && a0 != *(&stdin) && *(v5) >= 0)� return get_paragraph(a0, a1, a2, a3, a4, a5);

DREAM

IDA Pro 8.0

68 of 78

NoMoreGotos: TooManyConditions

The summation of all boolean operators present in decompilation of all binaries in the Coreutils package compiled with “-O2” optimization level. Across 895 unique functions.

69 of 78

So does anyone win?!?

The summation of all cyclomatic complexity by function present in the decompilation of all binaries in the Coreutils package compiled with “-O2” optimization level. Across 895 unique functions.

70 of 78

Future Areas of Research

71 of 78

Evaluating Decompilation Quality

  1. How close was the decompilation to source?
  2. Is quality based on closeness?
  3. Can we use source-code metrics?
    1. Gotos, Cyclomatic Complexity, If-stmt amount…
  4. Understand how humans perceive difficulty in Decompilation

72 of 78

Control Flow Structuring

  • What is the middle ground between IDA and DREAM?
  • Is our fundamental structuring approach correct?
  • Why are there still gotos?

73 of 78

CFG Recovery

  • How do we deal with abstraction like classes?
  • How inaccurate is our indirect jump resolution?

74 of 78

Usability

75 of 78

Usability: AI

  • Variable name recovery
  • Types
  • Structuring?
  • Summarization?

76 of 78

Usability: AI

  • Variable name recovery
  • Types
  • Structuring?
  • Summarization

77 of 78

Questions?

78 of 78

Thank You