1 of 153

The (Long) Journey To A Multi-Architecture Disassembler

Nicolas Falliere, Joan Calvet, Cedric Lucas

PNF Software

REcon Montréal 2019

2 of 153

Who Are We?

  • PNF Software: small company, founded in 2013
  • Main activity: developing JEB decompiler

2

You?

We are hiring

JEB Decompiler - PNF Software - 2019

3 of 153

PNF Software In Three Dates

  • 2012: JEB 1.0 release�Decompiles Android apps into Java code. Interactive UI. Scripting.

  • 2014: JEB 2.0 release�Decompiles Windows/Linux executables (x86, ARM, MIPS,...) into C code

  • 2018: JEB 3.0 release�Decompiles non-native platforms (Ethereum, WebAssembly)

3

JEB Decompiler - PNF Software - 2019

4 of 153

JEB’s Native Decompilation Pipeline

Simplified View

4

Disassembler

Native Model

(routines, data…)

IR Converters

Raw IR�(low-level)

IR Optimizers

Final IR�(higher-level, typed)

AST Builder

AST Optimizers

Raw AST

Native Code File

High Level Representation (e.g. pseudo-C)

Focus of this presentation!

JEB Decompiler - PNF Software - 2019

5 of 153

JEB’s Disassembler (1)�Informal Definitions

  • Static analysis producing an assembly-based view of a whole binary, representing what can possibly happen at runtime
    • Note: we are not talking merely about individual machine instructions translation�

5

JEB Decompiler - PNF Software - 2019

6 of 153

JEB’s Disassembler (1)�Informal Definitions

  • Static analysis producing an assembly-based view of a whole binary, representing what can possibly happen at runtime
    • Note: we are not talking merely about individual machine instructions translation�
  • Intended to be a “safe” analysis, i.e. avoiding the following mistakes (ordered by seriousness):
    • Data considered as code (domino effect with wrong cross-references, branches, etc)
    • Code considered as data
    • Missed code and data

6

JEB Decompiler - PNF Software - 2019

7 of 153

JEB’s Disassembler (2)�Why Do We Need It?

  • The disassembler provides foundations for JEB’s decompilation pipeline, in particular:
    • Code versus data separation
    • Control flow
    • Data flow
    • Abstractions: routines, control flow graphs, basic blocks…�
  • It is also useful for manual analysis (especially when decompilation fails…)

7

JEB Decompiler - PNF Software - 2019

8 of 153

JEB’s Disassembler (3)�Development Context

8

  • Developed in-house over the last four years (in Java, like the rest of JEB)�
  • Most of the logic is architecture independant (except for instruction parsing and some heuristics)�
  • Can also parse non-native code (e.g. Ethereum, WebAssembly)

JEB Decompiler - PNF Software - 2019

9 of 153

This Presentation’s Intent

  • Describe what happen in a “real world” disassembler, i.e. go further than the definition of disassemblers in textbooks, where they are usually divided into two families:�
    • Linear disassemblers: sequential disassembly of all bytes belonging to code areas
      • Problem: data mixed with code�
    • Recursive disassemblers: disassemble bytes by following the control flow
      • Problem: following control flow is hard�

9

JEB Decompiler - PNF Software - 2019

10 of 153

This Presentation’s Intent

  • Describe what happen in a “real world” disassembler, i.e. go further than the definition of disassemblers in textbooks, where they are usually divided into two families:�
    • Linear disassemblers: sequential disassembly of all bytes belonging to code area
      • Problem: data mixed with code�
    • Recursive disassemblers: disassemble bytes by following the control flow
      • Problem: following control flow is hard�
  • Hopefully, it will convince you (if needed) that disassembling remains a complex and interesting problem, even in an era of decompilers

10

JEB Decompiler - PNF Software - 2019

11 of 153

Outline

  1. Introduce a simple and naive disassembler algorithm

  • Show some limitations of the simple algorithm

  • Present the way we overcome these limitations in JEB

11

JEB Decompiler - PNF Software - 2019

12 of 153

Disclaimer

This is a research talk (not a sales talk), showing work-in-progress and not intended to present the best ever solution to disassembling.

12

JEB Decompiler - PNF Software - 2019

13 of 153

  1. Manually Disassembling a Toy Example

Setting The Scene

13

14 of 153

14

ROUTINE 1

Visual Studio x86 Compiler

(no optimizations, no inlining, no symbols)

secret.c

secret.exe

CALL ROUTINE 2

ROUTINE 2

…�CALL ROUTINE 3�…

XOR 0x37373737,...

“Ideal”

Disassembler

EXPECTED OUTPUT SKETCH

JEB Decompiler - PNF Software - 2019

15 of 153

How To Get There?

15

?

secret.exe

First, we need some prerequisites...

ROUTINE 1

CALL ROUTINE2

ROUTINE 2

…�CALL ROUTINE 3�…

XOR 0x37373737,...

JEB Decompiler - PNF Software - 2019

16 of 153

Prerequisite 1: Executable File Parsers

  • Executable files formats (PE, ELF, Mach-O,...) provide necessary information for disassemblers:

16

16

Architecture Information

x86, little-endian,...

Memory Mapping

secret.exe

Section 1

Section 2

0x40001000

0x40003000

Entry Point

PE Parser

Section 3

0x40004000

JEB Decompiler - PNF Software - 2019

17 of 153

Interlude: Executable File Parsers in JEB

  • Implement ICodeObjectUnit, an interface providing access to the parsed information (memory mapping, symbols,…) in a unified manner:

17

JEB Decompiler - PNF Software - 2019

18 of 153

Prerequisite 2: Instruction Disassemblers

18

Binary Blobs

55

X86

ARM

MIPS

0F 00 40 10

Parsed Instructions

Mnemonic

PUSH

Operand(s)

EBP

Next Instruction(s)

Fallthrough

SUBNE R0, PC Fallthrough

BEQZ $v0, 0x0F Fallthrough,

+0x0F

JEB Decompiler - PNF Software - 2019

19 of 153

Interlude: Instruction Disassemblers in JEB

  • Implement interface IProcessor:

  • Parsed native instructions all implement IInstruction interface, allowing us to process them generically

19

JEB Decompiler - PNF Software - 2019

20 of 153

Back To Our Toy Example

  • Let’s assume we have a PE parser and a x86 instruction disassembler�
  • How to produce secret.exe disassembly?

20

ROUTINE 1

CALL ROUTINE2

ROUTINE 2

…�CALL ROUTINE 3�…

XOR 0x37373737,...

secret.exe

?

JEB Decompiler - PNF Software - 2019

21 of 153

Picking a First “Intuitive” Strategy

  • Could we use linear disassembly?�

21

JEB Decompiler - PNF Software - 2019

22 of 153

Picking a First “Intuitive” Strategy

  • Could we use linear disassembly?�
  • Visual Studio often puts data in .text section on x86
    • For example: PE data directories, import/export tables, jumptables…

22

JEB Decompiler - PNF Software - 2019

23 of 153

Picking a First “Intuitive” Strategy

  • Could we use linear disassembly?�
  • Visual Studio often puts data in .text section on x86
    • For example: PE data directories, import/export tables, jumptables…
  • Therefore, recursive disassembly might be more suitable: start from the entry-point and follow the code… let’s try!

23

JEB Decompiler - PNF Software - 2019

24 of 153

24

Input Memory Mapping

Disassembler Data Structures

EP

25 of 153

25

PUSH EBP

Disassembler Data Structures

Input Memory Mapping

26 of 153

26

PUSH EBP

MOV EBP, ESP

Disassembler Data Structures

Input Memory Mapping

27 of 153

27

PUSH EBP

MOV EBP, ESP�CMP [EBP+8], 1�

Disassembler Data Structures

Input Memory Mapping

28 of 153

28

Conditional branch: end of block, continue analyzing fallthrough, store target for later analysis

Disassembler Data Structures

PUSH EBP

MOV EBP, ESP

CMP [EBP+8], 1�JNZ 0x40103B

Input Memory Mapping

Addresses To Analyze Later

Intra Routine

0x40103B

29 of 153

29

Fast Forward...

30 of 153

30

Disassembler Data Structures

[...]

CALL 0x4034D0

PUSH EBP

MOV EBP, ESP

CMP [EBP+8], 1�JNZ 0x40103B

Routine call: continue analyzing fallthrough, store target for later analysis

Input Memory Mapping

Addresses To Analyze Later

Intra Routine

0x40103B

Others Routines

0x4034D0

31 of 153

31

Fast Forward...

32 of 153

32

Addresses To Analyze Later

Intra Routine

Disassembler Data Structures

0x40103B

[...]

CALL 0x4034D0

[...]

PUSH EBP

MOV EBP, ESP

CMP [EBP+8], 1�JNZ 0x40103B

RET instruction: end of block, pop next address to analyze

Others Routines

0x4034D0

[...]

RET

Input Memory Mapping

33 of 153

33

Fast Forward...

34 of 153

34

Addresses To Analyze Later

Others Routines

Disassembler Data Structures

[...]

CALL 0x4034D0

[...]

PUSH EBP

MOV EBP, ESP

CMP [EBP+8], 1�JNZ 0x40103B

No more addresses to analyze: current CFG is finished. Analyze next routine.

[...]

RET

[...]

Input Memory Mapping

Intra Routine

0x4034D0

35 of 153

35

Fast Forward...

36 of 153

End Result

36

Routine 1 - starts at 0x401010

Routine 2 - starts at 0x4034D0

MOV EAX, 4

SHL EAX, 0

MOV ECX, [EBP+CH]

MOV EDX, [ECX+EAX]

PUSH EDX

CALL 0x4034D0

ADD ESP, 4

JMP LOC_401044

PUSH EBP

MOV EBP, ESP

CMP [EBP+8], 1�JNZ 0x40103B

POP EBP

RET

XOR EAX, EAX

PUSH EBP

MOV EBP, ESP

MOV EAX, [EBP+8]

PUSH EAX

CALL 0x4034F5

ADD ESP, 4

XOR EAX, [0x414880]

POP EBP

RET

JEB Decompiler - PNF Software - 2019

37 of 153

In the end, we produced the expected disassembly with a simple recursive algorithm!

The magic was in the instruction disassembler...

37

38 of 153

So, it’s not that hard to disassemble whole executables?

How does this algorithm generalize to others Visual Studio executables? To others compilers? To others architectures?

38

39 of 153

2. Some Questionable Assumptions We Made

More Or Less Consciously

39

40 of 153

Assumption 1: CALL Always Return To Caller

40

“Routine call: continue analyzing fallthrough, store target for later analysis”

41 of 153

Counter-Example: Non Returning Calls �Exiting APIs - Visual Studio CRT

41

JEB Decompiler - PNF Software - 2019

42 of 153

Counter-Example: Non Returning Calls �Infinitely Looping Routines - GCC 4.9

42

JEB Decompiler - PNF Software - 2019

43 of 153

Head-Scratching With Non Returning Routines (1)

  • We need to identify non returning CALLs, otherwise our CFG will be incorrect

43

JEB Decompiler - PNF Software - 2019

44 of 153

Head-Scratching With Non Returning Routines (1)

  • We need to identify non returning CALLs, otherwise our CFG will be incorrect

  • Non returning external APIs can be identified from their names, if we have full prototypes with non-returning attribute

44

JEB Decompiler - PNF Software - 2019

45 of 153

Head-Scratching With Non Returning Routines (1)

  • We need to identify non returning CALLs, otherwise our CFG will be incorrect

  • Non returning external APIs can be identified from their names, if we have full prototypes with non-returning attribute

  • Non returning internal routines (e.g. wrappers around non returning APIs) can be identified by analyzing their CFG
    • If CFG has no returning blocks, then it is a non-returning routine

45

JEB Decompiler - PNF Software - 2019

46 of 153

Head-Scratching With Non Returning Routines (2)

  • We can only know an internal routine is non returning after having analyzed it
    • What if we are on a CALL, and we do not have analyzed the target yet?

46

JEB Decompiler - PNF Software - 2019

47 of 153

Head-Scratching With Non Returning Routines (2)

  • We can only know an internal routine is non returning after having analyzed it
    • What if we are on a CALL, and we do not have analyzed the target yet?

  • We could stop analyzing the caller, and analyze the callee first, but:
    • Maintaining the caller’s state could be tricky; possible state explosions with chain of calls
    • Often, we do not know where the callee is!

47

JEB Decompiler - PNF Software - 2019

48 of 153

The JEB Way

  • For external APIs: custom C parser to build type libraries from compiler/SDK header files
    • Type libraries provide non-return attributes for standard APIs�

48

JEB Decompiler - PNF Software - 2019

49 of 153

The JEB Way

  • For external APIs: custom C parser to build type libraries from compiler/SDK header files
    • Type libraries provide non-return attributes for standard APIs�
  • For internal routines:
    • Identify simple cases at the time of the CALL (e.g. trampolines to non-returning API)
    • Otherwise, terminate caller’s analysis assuming callee returns, then analyze callee; if callee found non-returning, re-analyze its callers
      • Tricky: the first caller analysis, without knowing non-returning callees, can be hard to “undo”

49

JEB Decompiler - PNF Software - 2019

50 of 153

Assumption 2: Routine Control Flow Graphs Are Distincts

50

“No more addresses to analyze: current CFG is finished.”

51 of 153

51

Counter-Example:

Routines Sharing Code in Visual Studio 2017 CRT

JEB Decompiler - PNF Software - 2019

52 of 153

52

Then, we parse and found a branch within an existing basic block!

Head-Scratching On Shared Code

Let’s say, we parse first

Do we duplicate instructions into another basic block, or do we split the existing basic block?

1

2

1

2

JEB Decompiler - PNF Software - 2019

53 of 153

Head-Scratching On Shared Code

  • Basic block = series of instructions executed in a row
    • Foundation for later analysis: basic blocks can be processed without dealing with possible control flow changes
    • Exception: exceptions (haha)

53

JEB Decompiler - PNF Software - 2019

54 of 153

Head-Scratching On Shared Code

  • Basic block = series of instructions executed in a row
    • Foundation for later analysis: basic blocks can be processed without dealing with possible control flow changes
    • Exception: exceptions (haha)
  • Duplicating instructions means having different possible basic blocks at the same address
    • Later analysis would become harder to write, as we would need to check all possible basic blocks
    • Likely not a good idea!

54

JEB Decompiler - PNF Software - 2019

55 of 153

The JEB Way

  • During disassembling we build “skeletons” basic blocks (simple containers for instructions, easily modifiable), and split them when we found a branch coming directly within a block

55

JEB Decompiler - PNF Software - 2019

56 of 153

The JEB Way

  • During disassembling we build “skeletons” basic blocks (simple containers for instructions, easily modifiable), and split them when we found a branch coming directly within a block

  • Once disassembly is finished, we build final control flow graphs with proper basic blocks, which means:
    • An address belongs to at most one basick block
    • Basic blocks can be shared among routines

56

JEB Decompiler - PNF Software - 2019

57 of 153

The JEB Way: End Result

57

JEB Decompiler - PNF Software - 2019

58 of 153

Assumption 3: Branch Instructions Immediately End Basic-Blocks

58

Conditional branch: end of block, continue analyzing fallthrough, store target for later analysis”

59 of 153

Counter-Example: MIPS Branch Delay Slots

59

Conditional branch

(if $v0 == $s5)

Branch delay slots (always executed)

Conditional branch

(if $v0 == 0)

JEB Decompiler - PNF Software - 2019

60 of 153

Head-Scratching With Delay Slots (1)

  • Where do we cut the basic block?

60

JEB Decompiler - PNF Software - 2019

61 of 153

Head-Scratching With Delay Slots (1)

  • Where do we cut the basic block?�
  • A basic block = a series of instructions executed successively, i.e. the delay slot belongs to the same block as the branch

61

JEB Decompiler - PNF Software - 2019

62 of 153

Head-Scratching With Delay Slots (1)

  • Where do we cut the basic block?�
  • A basic block = a series of instructions executed successively, i.e. the delay slot belongs to the same block as the branch�
  • But… branch instructions in the middle of basic blocks breaks one of the most common assumption on control flow graphs!�
  • Can we avoid that?

62

JEB Decompiler - PNF Software - 2019

63 of 153

Head-Scratching With Delay Slots (2)

  • A CFG is just a representation, what if we simply revert instructions’ order in the graph?

63

?

JEB Decompiler - PNF Software - 2019

64 of 153

Head-Scratching With Delay Slots (2)

  • A CFG is just a representation, what if we simply revert instructions’ order in the graph?

  • Order of expression evaluation would then be broken
    • Branch condition must be evaluated first
    • Here according to the graph $v0 is set to 1 before being used as the branch condition �=> not a conditional branch anymore

64

?

JEB Decompiler - PNF Software - 2019

65 of 153

Head-Scratching With Delay Slots (3)

  • What if we group the branch and its delay slot into one artificial instruction?

65

?

JEB Decompiler - PNF Software - 2019

66 of 153

Head-Scratching With Delay Slots (3)

  • What if we group the branch and its delay slot into one artificial instruction?

  • It is actually legal to have a branch directly on the delay slot instruction, which we cannot not handle with that representation...

66

?

JEB Decompiler - PNF Software - 2019

67 of 153

The JEB Way

  • Branch instructions are allowed in the middle of basic blocks��
  • Instruction disassemblers provide number of delay slot(s) for branch instructions:

67

JEB Decompiler - PNF Software - 2019

68 of 153

Example of End Result (JEB’s CFG)

68

JEB Decompiler - PNF Software - 2019

69 of 153

Assumption 4: The Instruction Set Remains The Same

69

(never said anything about that, but that was taken for granted, right?)

70 of 153

Counter-Example: ARM/Thumb Switch

70

ARM and Thumb are different instruction sets sharing the same encoding space

Both can be in the same executable...

Thumb ISA

ARM ISA

BLX: Branch with Link and eXchange instruction set

JEB Decompiler - PNF Software - 2019

71 of 153

Head-Scratching With Instruction Set Switching

  • Instruction disassemblers must handle all possible instruction sets for a given architecture�

71

JEB Decompiler - PNF Software - 2019

72 of 153

Head-Scratching With Instruction Set Switching

  • Instruction disassemblers must handle all possible instruction sets for a given architecture�
  • Knowing an address is code (and not data) is not enough, we need to know the actual instruction set�
  • This information can come from various sources:
    • Address is called in a certain way (e.g. ARM’s BLX)
    • Address is referenced in a certain way (e.g. ELF ARM symbol with address LSB set to 1)
    • Address has a specific alignment
    • ...�

72

JEB Decompiler - PNF Software - 2019

73 of 153

The JEB Way

  • JEB’s instruction disassemblers can handle different instructions sets, and switch from one to another based on information provided by the disassembler

  • On unknown code addresses, the disassembler heuristically orders instruction sets by likelihood, and tries all of them until it gets a “correct” result (i.e. a proper routine)

73

JEB Decompiler - PNF Software - 2019

74 of 153

Assumption 5: Control Flow Can Always Be Followed

74

“Routine call: continue analyzing fallthrough address, store target for later analysis

75 of 153

Counter-Example: Jumptables�VS2017 x86 Compact Switch

75

JEB Decompiler - PNF Software - 2019

76 of 153

Counter-Example: Jumptables�VS2017 x86 Compact Switch

76

Discovering control flow here means finding ECX’s possible values

JEB Decompiler - PNF Software - 2019

77 of 153

How To Find Possible Values For Indirect Operands?�(i.e. register or memory operands)

  • We need these values (in particular) to follow indirect branches �
  • For a better control flow we need a better data flow (and vice versa...)�
  • What if… we try to answer that in a syntactic manner?�

77

JEB Decompiler - PNF Software - 2019

78 of 153

The JEB Way: Pattern Matching For Visual Studio x86 Jumptables

78

JEB Decompiler - PNF Software - 2019

79 of 153

The JEB Way: Pattern Matching For Visual Studio x86 Jumptables

79

Get jumptable address

JEB Decompiler - PNF Software - 2019

80 of 153

The JEB Way: Pattern Matching For Visual Studio x86 Jumptables

80

Get jumptable address

Get jumptable size

JEB Decompiler - PNF Software - 2019

81 of 153

The JEB Way: Pattern Matching For Visual Studio x86 Jumptables

81

Get jumptable address

Get jumptable size

Parse jumptable

JEB Decompiler - PNF Software - 2019

82 of 153

The JEB Way: Pattern Matching For Visual Studio x86 Jumptables

82

Get jumptable address

Get jumptable size

Parse jumptable

Control flow improvement

JEB Decompiler - PNF Software - 2019

83 of 153

The JEB Way: Pattern Matching For Visual Studio x86 Jumptables

Specific to Visual Studio x86

Get jumptable address

Get jumptable size

Parse jumptable

Control flow improvement

Generic

(More on how we integrate such compiler-specific logic into the disassembler later)

JEB Decompiler - PNF Software - 2019

84 of 153

Computing control flow with pattern matching... really?

84

JEB Decompiler - PNF Software - 2019

85 of 153

Syntactic solutions might be acceptable, though inelegant, when the target code is very common, because:

  • Development effort is (usually) limited
  • Parsing performances are good

But obviously syntactic solutions cannot scale in the context of a multi-architecture disassembler...

85

JEB Decompiler - PNF Software - 2019

86 of 153

Moar Jumptables Examples�Case 1: ARM GCC

JEB Decompiler - PNF Software - 2019

87 of 153

Moar Jumptables Examples�Case 2: ARM GCC -fPIC

JEB Decompiler - PNF Software - 2019

88 of 153

Moar Jumptables Examples�Case 3: ARM Thumb GCC

JEB Decompiler - PNF Software - 2019

89 of 153

  • These are only a few of the existing ARM jumptables implementations, syntactic solutions seem therefore a long journey to go…

  • And there are cases that are just out of reach!

JEB Decompiler - PNF Software - 2019

90 of 153

MIPS

Position-Independent Code

(System V ABI)

90

Entry Point

?

Syntactic methods clearly not suitable here!

JEB Decompiler - PNF Software - 2019

91 of 153

How To Find Possible Values For Indirect Operands?

Back To The Original Question (And Let’s Forget About Syntactic Solutions)

  • Could we simulate routines execution, such that we would know the machine state (registers and memory) between each instruction?�
    • Need native instructions’ semantics to update the state (that’s the hard part)�
    • Spoiler alert: not always doable in a safe way due to unknown inputs, but could solve simple cases in a “generic” way (i.e. more generic than the syntactic ones)

91

JEB Decompiler - PNF Software - 2019

92 of 153

Interlude: JEB’s Intermediate Language

Definition

Low-level imperative language, made of expressions:

  • 16 different expressions�
  • Expressions can contain sub-expressions (accessed through a tree-like structure)�
  • Highest-level expressions are statements

92

JEB Decompiler - PNF Software - 2019

93 of 153

Interlude: JEB’s Intermediate Language

Example

s32:var1[0:8[ = s8:var2 + s8:var3

93

Textual Representation

JEB Decompiler - PNF Software - 2019

94 of 153

Interlude: JEB’s Intermediate Language

Example

s32:var1[0:8[ = s8:var2 + s8:var3

94

Object Representation

IEAssign

IESlice

IEVar

IERange

IEOperation

IEVar

IEVar

s32:var1

[0:8[

s8:var2

s8:var3

+

Textual Representation

JEB Decompiler - PNF Software - 2019

95 of 153

Interlude: JEB’s Intermediate Language

Main Purpose

The IL’s primary purpose is to allow expressing native instruction’s semantics:

95

X86 Instruction

xor eax, dword ds:[10000h]

s32:_eax = (s32:_eax ^ 32<s16:_ds>[i32:10000h])�s1:_zf = (s32:_eax ? i1:0 : i1:1)�s1:_sf = s32:_eax[31:32[�s1:_pf = PARITY(s32:_eax[0:8[)�s1:_of = i1:0�s1:_cf = i1:0

Semantic Representation in JEB’s IL

JEB Decompiler - PNF Software - 2019

96 of 153

Interlude: JEB’s Intermediate Language

Main Purpose

  • IL semantic representation is the foundation for JEB’s decompilation pipeline:

96

Disassembler

IR Converters

IL Semantic Representation

IR Optimizers

Optimized IL�(higher-level, typed)

AST Builder

AST Optimizers

JEB Decompiler - PNF Software - 2019

97 of 153

Interlude: JEB’s Intermediate Language

Main Purpose

  • Heavy lifting is done by native-to-IL converters:

97

JEB Decompiler - PNF Software - 2019

98 of 153

Having access to native-to-IL converters allows us to implement the simulation at the IL level:

  • We only have to handle the 16 IL expressions (rather than all native instructions)�
  • All architectures for which we have a native-to-IL converter will benefit from it�
  • Drawback: the performance cost of IL conversion

98

JEB Decompiler - PNF Software - 2019

99 of 153

The JEB Way : Intermediate Language Simulation

  1. Convert each native routine into its equivalent semantic representation in JEB’s IL
    • Produces CFG of IL statements (no optimizations at this point)
    • For example, here is secret.exe main() first IL basic block:

99

s32:_esp = (s32:_esp - i32:00000004h)

32<s16:_ss>[s32:_esp] = s32:_ebp

s32:_ebp = s32:_esp

32<s16:_ds>[i32:004152A0h] = i32:00401000h

s1:_zf = ((32<s16:_ss>[(s32:_ebp + i32:00000008h)] - i32:00000002h) ? i1:0 : i1:1)

s1:_sf = (32<s16:_ss>[(s32:_ebp + i32:00000008h)] - i32:00000002h)[31:32[

s1:_pf = PARITY((32<s16:_ss>[(s32:_ebp + i32:00000008h)] - i32:00000002h)[0:8[)

s1:_cf = (32<s16:_ss>[(s32:_ebp + i32:00000008h)] <u i32:00000002h)

s1:_af = ((32<s16:_ss>[(s32:_ebp + i32:00000008h)] ^ i32:00000002h) ^

s32:_eip = (s1:_zf ? i32:00401033h : i32:0040104Dh)

JEB Decompiler - PNF Software - 2019

100 of 153

The JEB Way : Intermediate Language Simulation

2. Simulate IL routine to build the machine state at each instruction:

    • Start from a clean state (pseudo-realistic values in registers, and allocated stack memory)�
    • Implement IL expression evaluation:

100

JEB Decompiler - PNF Software - 2019

101 of 153

The JEB Way : Intermediate Language Simulation

101

IEOperation evaluation (excerpt)

JEB Decompiler - PNF Software - 2019

102 of 153

The JEB Way : Intermediate Language Simulation

  • Routine simulator follows the control flow from the entry-point and evaluates all encountered statements�

102

JEB Decompiler - PNF Software - 2019

103 of 153

The JEB Way : Intermediate Language Simulation

  • Routine simulator follows the control flow from the entry-point and evaluates all encountered statements�
  • Simulator is intended to be “safe”, i.e. to provide only trustable values:�
    • When IL expression evaluation fails (because of missing information, e.g. an uninitialized memory slot), we abort the routine simulation�
    • On subroutine calls, we consider all registers spoiled (i.e. now containing an unknown value)�

103

JEB Decompiler - PNF Software - 2019

104 of 153

The JEB Way : Intermediate Language Simulation

3. Report the values found during simulation to the disassembler

104

JEB Decompiler - PNF Software - 2019

105 of 153

The JEB Way:

MIPS

Position-Independent

Code�(before IL simulation)

105

JEB Decompiler - PNF Software - 2019

106 of 153

The JEB Way:

MIPS

Position-Independent

Code

(after IL simulation)

106

JEB Decompiler - PNF Software - 2019

107 of 153

Interlude: Revisiting Syntactic Solutions With JEB IL

Example: x86/ARM Jumptables

Disassembly

Optimized JEB’s IL

...�if (32[(s32:_SP0 - i32:8h)] >u i32:9h)

goto 0027

...

s32:_eip = 32[((s32:_ecx * i32:4h) + i32:00401440h)]

...

if (s32:_R3 >u i32:9h)

goto 0021�...

s32:_PC = 32[((s32:_R3 * i32:4h) + i32:00010714h)]

JEB Decompiler - PNF Software - 2019

108 of 153

Interlude: Revisiting Syntactic Solutions With JEB IL

Example: x86/ARM Jumptables

Disassembly

Optimized JEB’s IL

...�if (32[(s32:_SP0 - i32:8h)] >u i32:9h)

goto 0027

...

s32:_eip = 32[((s32:_ecx * i32:4h) + i32:00401440h)]

...

if (s32:_R3 >u i32:9h)

goto 0021

...

s32:_PC = 32[((s32:_R3 * i32:4h) + i32:00010714h)]

One pattern catches both implementations!

JEB Decompiler - PNF Software - 2019

109 of 153

IL simulation provides only concrete and trustable values, and therefore does not always work.

So what if we cannot follow the control flow from main()? Do we have another way to find secret_algo()?

109

JEB Decompiler - PNF Software - 2019

110 of 153

Distinguishing Code From Data

  • In theory, intractable problem (like anything interesting in program analysis)��

110

JEB Decompiler - PNF Software - 2019

111 of 153

Distinguishing Code From Data

  • In theory, intractable problem (like anything interesting in program analysis)��
  • In practice, it is indeed a hard problem on most architectures, because:
    • Code and data share the same memory space
    • Almost any series of bytes corresponds to a machine instruction, due to instruction sets encoding “density”�

111

JEB Decompiler - PNF Software - 2019

112 of 153

Distinguishing Code From Data

  • In theory, intractable problem (like anything interesting in program analysis)��
  • In practice, it is indeed a hard problem on most architectures, because:
    • Code and data share the same memory space
    • Almost any series of bytes corresponds to a machine instruction, due to instruction sets encoding “density”�
  • But… context can help!

112

JEB Decompiler - PNF Software - 2019

113 of 153

Distinguishing Code From Data

  • Raw dump of a x86 executable compiled by Visual Studio 2017:

113

CODE or DATA?

push ebp�mov ebp, esp

(classic VS routine prologue)

int 3�int 3

...(classic VS code padding)

likely CODE!

pop ebp

ret

(classic VS routine epilogue)

JEB Decompiler - PNF Software - 2019

114 of 153

Distinguishing Code From Data

  • Memory view of a x86 executable compiled by GCC 4.9:

114

KNOWN CODE

KNOWN CODE

KNOWN DATA

0x8048000

0x8050000

KNOWN DATA

CODE or DATA?

CODE or DATA?

GCC for x86 (usually) does not mix code and data!

likely CODE

likely DATA

JEB Decompiler - PNF Software - 2019

115 of 153

The JEB Way: Compiler-Specific Heuristics

  • Identify the compiler that served to create the target and apply specific heuristics�

115

JEB Decompiler - PNF Software - 2019

116 of 153

The JEB Way: Compiler-Specific Heuristics

  • Identify the compiler that served to create the target and apply specific heuristics�
    • Example: address A is considered to be likely code if:

116

compiler is gcc or clang�and architecture is x86 �and no obfuscations/malformations�and A is within code area �and bytes at A do not look like code padding

JEB Decompiler - PNF Software - 2019

117 of 153

The JEB Way: Compiler-Specific Heuristics�How To Integrate Them Into a Generic Disassembler? (1)

  • At startup, JEB loads different “extensions” for each compiler/file format/architecture:

117

JEB Decompiler - PNF Software - 2019

118 of 153

The JEB Way: Compiler-Specific Heuristics�How To Integrate Them Into a Generic Disassembler? (2)�

  • Disassembler then queries loaded extensions for heuristics, for example:

118

See INativeCodeAnalyzerExtension

JEB Decompiler - PNF Software - 2019

119 of 153

The JEB Way: Compiler-Specific Heuristics�What If... Heuristics Are Wrong?

  • Mistakes happen: misidentified compiler, new/old compiler version, obfuscation…�

119

JEB Decompiler - PNF Software - 2019

120 of 153

The JEB Way: Compiler-Specific Heuristics�What If... Heuristics Are Wrong?

  • Mistakes happen: misidentified compiler, new/old compiler version, obfuscation…�
  • Disassembler feedback loop:
    • Errors are logged (e.g. supposedly-code bytes cannot be disassembled, routine cannot be built,...)
    • If too many errors made, disassembler switches back to “safe mode”, where only conservative heuristics are applied�

120

JEB Decompiler - PNF Software - 2019

121 of 153

The JEB Way: Compiler-Specific Heuristics�What If... Heuristics Are Wrong?

  • Mistakes happen: misidentified compiler, new/old compiler version, obfuscation…�
  • Disassembler feedback loop:
    • Errors are logged (e.g. supposedly-code bytes cannot be disassembled, routine cannot be built,...)
    • If too many errors made, disassembler switches back to “safe mode”, where only conservative heuristics are applied�
  • Last resort: user can change disassembler decisions

121

JEB Decompiler - PNF Software - 2019

122 of 153

Assumption 6: All Code Matters

122

What’s Up With atoi()?

123 of 153

Counter-Example: Statically Linked Library Routines

123

... [large routine] ...

Pretty complex routine, but… it’s “just” atoi()!

JEB Decompiler - PNF Software - 2019

124 of 153

Identifying Library Routines

  • One of the oldest reverse-engineering problem...�

124

JEB Decompiler - PNF Software - 2019

125 of 153

Identifying Library Routines

  • One of the oldest reverse-engineering problem...�
  • Such identification allows users to read documentation rather than analyze code

125

JEB Decompiler - PNF Software - 2019

126 of 153

Identifying Library Routines

  • One of the oldest reverse-engineering problem...�
  • Such identification allows users to read documentation rather than analyze code

  • Such identification allows automatic analysis to have precise information on the routine, in particular its prototype:

126

JEB Decompiler - PNF Software - 2019

127 of 153

The JEB Way: Signatures (Basic) Workflow

127

JEB

Files with named routines

Object Files

Signatures�Generator

Signatures

Disassembler

Executables With Symbols

JEB Projects

Generates

Loads and matches

against routines

...

JEB Decompiler - PNF Software - 2019

128 of 153

Interlude: JEB Native Signatures (1)

  • Signatures are designed as generic containers to identify an “item”, i.e. anything possibly defined in JEB native model (routines, basic blocks, data blobs, etc)�

128

JEB Decompiler - PNF Software - 2019

129 of 153

Interlude: JEB Native Signatures (1)

  • Signatures are designed as generic containers to identify an “item”, i.e. anything possibly defined in JEB native model (routines, basic blocks, data blobs, etc)�
  • For that purpose, they contain two collections:�
    • Features: characteristics of the item serving to identify it
      • Each feature can be matched against another�
    • Attributes: knowledge on the item (name, origin,...)

129

JEB Decompiler - PNF Software - 2019

130 of 153

Interlude: JEB Native Signatures (2)

  • Example: signature for _memcpy_s routine from Visual Studio 2008 /MT standard libraries:�
    • Features:
      • Routine size: 57
      • Routine hash: A2D4D55381F634B0...34F4F0176D181218D
    • Attributes:
      • Routine name: _memcpy_s
      • Original file name: libcmt.lib>memcpy_s.obj
      • Compiler name: Microsoft Visual C++ 2008 (9.0)

130

JEB Decompiler - PNF Software - 2019

131 of 153

Which Features To Identify Compiler Library Routines?

  • Compiler routines signatures should be false positive free, because:
    • JEB’s decompilation pipeline is going to rely on them (e.g. by using the corresponding prototypes)
    • Users are going to rely blindly on them in the UI�

131

JEB Decompiler - PNF Software - 2019

132 of 153

Which Features To Identify Compiler Library Routines?

  • Compiler routines signatures should be false positive free, because:
    • JEB’s decompilation pipeline is going to rely on them (e.g. by using the corresponding prototypes)
    • Users are going to rely blindly on them in the UI�
  • False positives happen when a matched signature does not convey the original routine behavior (or the original prototype):
    • For example, _memmove and _memcpy have the exact same code (and therefore behavior) in VS2013 /MT libraries; identifying one as the other is not a false positive in this context

132

JEB Decompiler - PNF Software - 2019

133 of 153

Feature: Routine Code Hash (1)

  • Hash computed from the routine assembly code:�
    • Using assembly rather than binary code allows to avoid generating signatures for all different endianness (e.g. on MIPS)�
    • We use a custom (and simple) hash algorithm producing a 64-byte value�
  • Hash computation is (almost) the same for x86/ARM/MIPS thanks to IInstruction interface�

133

JEB Decompiler - PNF Software - 2019

134 of 153

Feature: Routine Code Hash (2)

  • The parts of the routine code depending of its actual location are normalized before hash computation :

134

X86 object file snippet

call [address]

xor ecx, ecx

mov [address], eax

More complex normalization cases exist (e.g. ARM relocations can change BL mnemonic to BLX)

Normalization: abstract absolute addresses

(any constant actually)

JEB Decompiler - PNF Software - 2019

135 of 153

It Might Not Be Enough...

  • For example in the case of wrappers:

135

Same routine code hash, but different behaviors

=> another feature: name of called routines

JEB Decompiler - PNF Software - 2019

136 of 153

It Might (Still) Not Be Enough...

136

Same routine code hash and callee routines, but different behaviors

=> another feature: constants

JEB Decompiler - PNF Software - 2019

137 of 153

Signatures Generation Strategy (1)

  • Adding all possible features in each signature is non-optimal, because:
    • Matching performance (during disassembling) depends on the number of features in signatures�
    • Some features can be useless (e.g. a large routine is likely to be uniquely identified only by its hash)�
    • Some features can be a burden (e.g. with called routines name the callees routines have to be matched before the caller)�
  • Therefore, we ideally want the minimal set of features to identify a routine without false positives�

137

JEB Decompiler - PNF Software - 2019

138 of 153

Signatures Generation Strategy

Pragmatic Approach

  • Start with a minimal set of features for each routine:
    • routine code hash
    • for small routines: called routine names�

138

JEB Decompiler - PNF Software - 2019

139 of 153

Signatures Generation Strategy (2)

Pragmatic Approach

  • Start with a minimal set of features for each routine:
    • routine code hash
    • for small routines: called routine names�
  • Compare with the routines of the same library:�
    • If no other routine has the same features, keep the signature as-is�
    • If there exists a routine with a different name (= different behavior) and the same features: insert additional features until the collision is resolved

139

JEB Decompiler - PNF Software - 2019

140 of 153

How To Deal With Indistinguishable Routines?

  • Some routines do not have particular features to distinguish them, e.g. very small routines:

140

JEB Decompiler - PNF Software - 2019

141 of 153

How To Deal With Indistinguishable Routines?

  • Some routines do not have particular features to distinguish them, e.g. very small routines:

  • But their location might help:
    • We generate signatures with “low” confidence for these routines; by default those signatures are not used during matching
    • If a memory area contains a large number of routines identified from the same library, we apply the untrusted signatures to this particular area

141

JEB Decompiler - PNF Software - 2019

142 of 153

JEB Signatures Packages

  • More than 200 packages at this point (mainly x86/x64 Visual Studio, ARM/ARM64 Android NDK)
    • Visual Studio was the primary target, which might explain some biases in the signatures generation process�
  • Based on identified compiler, packages are loaded at runtime, and each disassembled routine is tentatively matched
    • Also, users can generate their own packages

142

Packages List Extract

JEB Decompiler - PNF Software - 2019

143 of 153

3. Enough With Broken Assumptions, What’s The Point?

143

144 of 153

Let’s Sum Up

  • Originally, we successfully disassembled secret.exe with a simplistic recursive algorithm, but we made a lot of assumptions on the way �
  • Then, we showed that many of these assumptions can be broken just by looking at standard compilers’ code
  • You probably have in mind a tons of others broken assumptions we made (instructions do not overlap, code does not modify itself…)
    • It’s the way many obfuscation techniques work, by breaking assumptions made by analysis tools

144

JEB Decompiler - PNF Software - 2019

145 of 153

Pessimistic Realistic Conclusion

There is no such thing as a disassembler able to correctly disassemble all programs for all architectures/compilers

  • No interesting assumptions can hold true on so many diverse programs�
  • If I was still an academic, I would here pedantically mention the link with the halting problem and its generalization, the Rice’s theorem

145

JEB Decompiler - PNF Software - 2019

146 of 153

We cannot disassemble correctly all programs, but we might still be able to do “ok” on a subset of them.

146

JEB Decompiler - PNF Software - 2019

147 of 153

What Can We Do?

  • Divide the universe of programs into “groups” with the following properties:
    • There exists reliable ways to check if a program belongs to the group
    • There are non trivial assumptions holding true for the whole group�

147

JEB Decompiler - PNF Software - 2019

148 of 153

What Can We Do?

  • Divide the universe of programs into “groups” with the following properties:
    • There exists reliable ways to check if a program belongs to the group
    • There are non trivial assumptions holding true for the whole group�
  • When a program does not belong to a known group, apply only conservative assumptions�

148

JEB Decompiler - PNF Software - 2019

149 of 153

The JEB Way

(Very Simplified) Disassembler Workflow

149

Initialization

Executable File Parsing

Compiler Identification

JEB Decompiler - PNF Software - 2019

150 of 153

The JEB Way

(Very Simplified) Disassembler Workflow

150

Strategy

Modifies

Modifies

Initialization

Disassembler Engine

Executable File Parsing

Gaps Processing

CFGs Building

IL Simulation

Signatures Matching

Extensions

Errors Logger

Gap Processors

Recursive Disassembling

Compiler Identification

Instantiates

Queries

Report Errors

JEB Decompiler - PNF Software - 2019

151 of 153

Final Notes

  • Building “groups” of programs is easier (for developers) if: �
    • The disassembler is coded in an “informative” way, i.e. it explicitly reports broken assumptions�
    • The disassembler is thoroughly tested on diverse sample sets �
    • Users report samples breaking the analysis (when they can ;-))
      • But ideally, the disassembler should provide them the way to tweak the broken assumptions

151

JEB Decompiler - PNF Software - 2019

152 of 153

Conclusion

  • Hopefully this presentation convinced you (if it was needed) that disassembly remains a complicated problem, albeit an old one

  • Interestingly, many novel anti-exploitation techniques tend to make disassembly easier, because they provide hints for “code versus data”
    • ELF executable segments, Microsoft Control Flow Guard, Intel Control-flow Enforcement Technology,...�

152

JEB Decompiler - PNF Software - 2019

153 of 153

Thank you!

Contact:

  • join us on slack!
  • twitter @jebdec
  • email us privately at support@pnfsoftware.com

153

JEB Decompiler - PNF Software - 2019