1 of 23

Efficient Binary-Level Coverage Analysis

M. Ammar Ben Khadra, Dominik Stoffel, and Wolfgang Kunz

2 of 23

Why you need to care?

2/23

3 of 23

Why you need to care?

Inspires confidence in software (up to a level)
Compliance in aerospace & automotive, e.g., DO-178C, ISO-26262
Coverage-guided fuzzing, e.g., AFL
Several other uses, e.g., test suite reduction

3/23

4 of 23

The current workflow

4/23

SRC

Compile

BIN

Run

Report

COV

SRC

“coverage instrumentation prevents at least some compiler optimizations. This results in longer test run times, which leads to more timeouts”

Ivanković et. al., “Code Coverage at Google”, ESEC/FSE’19

5 of 23

A simpler alternative

5/23

Compiler

GCC

Clang

ICC

Optimization

Debug

Release

LTO

Language

C

C++

Rust

BIN

Run

Report

COV

6 of 23

Why binary-level coverage?

6/23

a!=0?

b==0?

c!=0?

Detailed report also for tricky code, e.g., exception handling
Flexibility to instrument specific functions only

b+c

0

7 of 23

Introducing bcov

Static instrumentation for x86-64 ELF binaries

Patch. Code segment for trampolines. Data segment for coverage data
Run with our bcov-rt library
Report basic block/instruction coverage

7/23

8 of 23

Basic mechanism

Insert detours to trampolines to track coverage
Single mov instruction updates coverage data

8/23

Original code

36b62: cmp eax,0x140

36b67: sete al

36b6a: jmp 36bce

Patched code

Trampoline

36b62: cmp eax,0x140 36b67: jmp 6002b8

6002b8: mov BYTE PTR [rip+0xadd88],1 6002bf: sete al

6002c2: jmp 36bce

9 of 23

Many techniques, one tool

9/23

Precision

Efficiency

Transparency

Sliced microexecution

Probe pruning

Optimized probe selection

Single-instruction coverage update

Jump table instrumentation

Semantic-preserving instruction rewriting

Non-return analysis

Detour hosting

10 of 23

Focus of the talk

Probe pruning. We bring Agrawal’s technique to binary-level instrumentation

H. Agrawal. “Dominators, super blocks, and program coverage”, POPL ’94

Sliced microexecution. Novel jump table analysis combining backward slicing and microexecution
Detour hosting. Host detour of a short BB (size < 5 bytes) in a large neighbouring BB. Try to choose the host optimally

10/23

Due to the short time available, we focus in the rest of the talk on the following contributions, which we consider to be the highlights of this work. The first is the probe pruning technique previously proposed by Agrawal in his POPL’94 paper. We choose this technique because of the superior efficiency and flexibility it provides compared to alternatives.

Also, we propose a novel and a highly precise jump table analysis. It combines backward program slicing with microexecution. Hence, we simply called it: sliced microexecution.

The final technique is detour hosting. Variable-length ISAs, such as x86, allow basic blocks to be smaller than the size of a detour, which is 5 bytes. Skipping such small basic blocks will incur a loss in the accuracy of the coverage reported by our tool. To address this problem, we try to choose a host for the detour among the neighbouring basic blocks. We try to make this choice optimal using a greedy algorithm.

11 of 23

Probe pruning

Recover CFG from binary and from it derive SB-DG
Leaf-node policy: covering D, E, and H achieves 100% coverage
Any-node policy: probing A or C enables precise coverage tracking

11/23

CFG

SB-DG

Instrumenting all basic blocks would be highly inefficient. Instead, and by leveraging dominance relationships between basic blocks, we can drastically reduce the number of the required probes. Agrawal’s probe pruning technique works in several steps. Basically, we have to recover the pre-dominator and post-dominator trees, and then we merge them in a dominator graph. Later, we can build the superblock dominator graph that is shown in our example. bcov offers two instrumentation policies based on this graph. In the leaf-node policy, we instrument only the leaves. Covering all leaf nodes implies that we have reached 100% coverage.

However, the leaf-node policy may not be precise enough. For example, it is possible to visit the node C without visiting D. This means that the coverage report might miss A and C even after visiting them. Therefore, we implement the any-node policy where we identify superblocks that are critical. In our example, the superblock (A,C) is critical, which means that either A or C must be instrumented in order to track coverage without data loss.

12 of 23

Sliced microexecution (1)

Does it depend on a constant base?
Slice should depend on a single variable, the jump table index.

12/23

9f6a1: lea r15,[rip+0xe69e4] ; set table base

. ; omitted instructions

9f6f0: movzx eax,r12b ; index is r12b

9f6f4: cmp r12b ,0x5b ; bound comparison

9f6f8: mov QWORD PTR [rsp+0x8],rax

. ; omitted instructions

9f707: ja 9f880 ; jump to default case

9f70d: mov rax,QWORD PTR [rsp+0x8]

9f712: movsxd rax,DWORD PTR [r15+rax*4]

9f716: add rax,r15

9f719: jmp rax ; jump to matching case

Consider the indirect jump instruction shown in this example, which represents a jump table. Our goal is to precisely recover its control flow targets. This is important because instrumentation decisions in bcov depend on the precision of the recovered control flow graph. In fact, instrumentation errors can break the binary. So high precision is important.

We propose sliced microexecution as a precise and robust method for jump table analysis. It combines backward program slicing with microexecution. Microexecution is simply the ability to run any code fragment automatically. We use this novel combination to answer, or falsify, a series of hypotheses about each indirect jump.

The first question is whether the jump table depends on a constant base address? by following the dependencies using backward slicing, we can see that the base address of the jump table is constant. As a result, this table depends on one variable only, which is the index.

13 of 23

Sliced microexecution (2)

Does it have a bound condition?
Yes, condition @ 9f707, which is influenced by r12b

13/23

9f6a1: lea r15,[rip+0xe69e4] ; set table base

. ; omitted instructions

9f6f0: movzx eax,r12b ; index is r12b

9f6f4: cmp r12b ,0x5b ; bound comparison

9f6f8: mov QWORD PTR [rsp+0x8],rax

. ; omitted instructions

9f707: ja 9f880 ; jump to default case

9f70d: mov rax,QWORD PTR [rsp+0x8]

9f712: movsxd rax,DWORD PTR [r15+rax*4]

9f716: add rax,r15

9f719: jmp rax ; jump to matching case

14 of 23

Sliced microexecution (3)

What are the bound limits?
Microexecute the slice. Assign values to r12b in [0, 0x5b) and (0x5b, )

14/23

9f6a1: lea r15,[rip+0xe69e4] ; set table base

. ; omitted instructions

9f6f0: movzx eax,r12b ; index is r12b

9f6f4: cmp r12b ,0x5b ; bound comparison

9f6f8: mov QWORD PTR [rsp+0x8],rax

. ; omitted instructions

9f707: ja 9f880 ; jump to default case

9f70d: mov rax,QWORD PTR [rsp+0x8]

9f712: movsxd rax,DWORD PTR [r15+rax*4]

9f716: add rax,r15

9f719: jmp rax ; jump to matching case

15 of 23

Detour hosting

15/23

Original code

Call @ 0xad6800 is 3 bytes. Return address is 0xad6803
Inserted detour to relocated host @ 0xad67f3
Adjusted return address @ 0x1d31b40

ad67f3: jmp ad6803 ; host

ad67f5: nop [multi-byte] ; padding

ad6800: call QWORD PTR [rax+0x58]

Patched code

Trampoline

ad67f3: jmp 1d31afa ; jump to relocated host

ad67f8: call 1d31b39 ; hosted detour

ad67fd: nop DOWRD PTR [rax]

ad6800: jmp ad67f8 ; jump to hosted detour

1d31b39: mov BYTE PTR [rip+0x4f8d01],1 1d31b40: sub QWORD PTR [rsp], -6 1d31b45: jmp QWORD PTR [rax + 0x58]

The size of basic blocks in x86 can be smaller than 5 bytes. In such a case, we cannot directly insert a detour. For example, we cannot insert a detour in place of this call instruction without overwriting the next basic block. Skipping small basic blocks might be an option. However, it will cause a loss in coverage accuracy. The loss can be high depending on the number of dominators in the superblock dominator graph. So what can we do? We know for sure that we should avoid relocating the entire function to a larger area where we can inline coverage update code. Also, we want to avoid using the INT3 instruction, which is a one-byte instruction used by debuggers to raise software interrupts. It is simply too costly.

Our method is detour hosting where for a group of guests and a group of suitable hosts we try to find an optimal mapping between them using a greedy algorithm. For the sake of example, let us assume that we have found a host for this particular call instruction. bcov will generate the following patch. Now, we are using a 2-byte jump instruction instead of the usual 5-byte detour. This short jump will target the detour hosted in the neighbouring basic block. Note that padding bytes were overwritten in the process. The call detour will target the trampoline, where coverage data will be updated, as usual. However, we now have to adjust the return address to match the one of the original binary. Finally, the original call instruction is rewritten to a jump instruction with the same effect. In this way, detour hosting allowed us to reuse the space already available without a significant increase in the overhead.

16 of 23

Experimental setup

16/23

Binary

gas

perl

python

llc

ffmpeg

libmagickcore

libxerces

libopencv

Compiler

gcc-5.5

gcc-7.4

clang-5.0

clang-8.0

Optimization

Debug

Release

LTO

Total

95 binary

1.6 million function

17 of 23

Scalability and transparency

17/23

bcov can analyze and patch llc (4 million instructions) in 30 seconds
No test regressions after replacing original binaries with instrumented versions

18 of 23

Performance overhead

18/23

Replaced original binaries with ones instrumented with bcov
Average performance overhead is 14%.

19 of 23

Overhead compared to DBI tools

19/23

Pin and DynamoRIO (DR) introduced regressions
bcov offers better efficiency, usability, and transparency

20 of 23

Coverage report accuracy (1)

20/23

BIN

bcov

drcov

COV

Non-determinism!

BIN

bcov + drcov

COV

TP = bcov /\ drcov
FP = bcov /\ !drcov
FN = !bcov /\ drcov
TN = N/A

COV

We use the coverage analysis client drcov to measure the accuracy of the coverage reported by bcov. Our initial idea is to run a binary twice using the same input and then compare the reported coverage. However, we soon discovered that this was not such a good idea because our subjects feature non-deterministic behavior. Therefore, we came up with a different solution where we trace a binary that is already instrumented by bcov. This solved the issue and allowed us to reliably compare both reports at basic block level. On this basis, the basic blocks reported by bcov can be classified to True Positives (TP), False Positives (FP), and False Negatives (FN). Note that the True Negative (TN) category is not applicable to our settings since drcov reports basic blocks that are not tracked by our tool. For example, basic blocks in the libc library.

21 of 23

Coverage report accuracy (2)

21/23

Binaries compiled using gcc-7.4 in release build
Table describes the test suite runs. Dump size given in MB

Module	Process #	drcov size	bcov size	BB #	Inst. #
xerces	80	12.34	4.32	116,378	420,096
magick	58	7.71	2.90	125,521	521,107
llc	7,862	3481.97	4176.16	1,067,151	4,343,021
gas	1,235	71.94	38.56	60,511	220,447
ffmpeg	3,309	423.45	762.39	496,404	3,050,228

22 of 23

Coverage report accuracy (3)

22/23

True Positives (TP) are listed as average/maximum
Precision = TP / (TP + FP), Recall = TP / (TP + FN)

Module	TP BB	TP Inst.	Precision	Recall
xerces	9,523.2 / 21,927	40,651.2 / 92,144	99,98%	99.42%
magick	5,689.4 / 20,709	21,614.9 / 83,444	99,98%	99.94%
llc	4,5184.5 / 90,952	257,209.5 / 461,656	99,98%	99.68%
gas	2,916.4 / 5,015	11,045.8 / 19,578	99.93%	99.67%
ffmpeg	9,682.0 / 14,489	41,439.1 / 63,591	99.98%	99.94%

23 of 23

Try it out!

After 20 years of Dyninst, it is time to revisit the trade offs
We implemented bcov from scratch in more than 17 KLoC in C++
The result is outstanding efficiency, transparency, and accuracy
Reach out on github, ESECFSE subreddit, or directly per email

23/23

https://github.com/abenkhadra/bcov