1 of 24

Past and Present of Program Diffing

Joxean Koret

2 of 24

Past and Present of Program Diffing

  1. Introduction
  2. BinDiff
  3. Diaphora+Pigaios and other new tools
  4. Academia
  5. The possible future

3 of 24

Introduction

What is program diffing?

  • Technique(s) used to find differences between 2 or more programs.

What is BinDiffing?

  • The process of finding the differences between 2 or more binary programs.
  • Term coined by Thomas Dullien (aka Halvar Flake).

What are we going to see in this talk:

  • What was invented back in the day, available tools, how it evolved, how it continued and where is it most likely going to.

4 of 24

BinDiffing

Let’s talk about the first...

  • Time the idea was presented: “More Fun with Graphs”, BlackHat Federal 2003. Halvar Flake.
  • Known prototype: Written somewhere around 2003 by Halvar Flake.
    • There was a paper by Todd Sabin and supposedly another prototype somewhere around 2004: “Comparing Binaries with Graph Isomorphism”.
    • https://securiteam.com/securityreviews/5EP0320CKC/
  • Commercial tool: “Sabre BinDiff 1.X”, in 2004.

5 of 24

BinDiff, the tool

Commercial (closed source) tool written in C++ with GUI in Java, by Halvar. Rolf Rolles joined Sabre (later Zynamics) and rewrote it to allow per-block matching.

Rolf left and the core was rewritten again by Halvar and Sören Meyer-Eppler somewhere around 2007.

The initial version of BinDiff (1.6?) had the following features for finding matches:

  • Graph isomorphism based heuristics as well as string, recursive functions, prime products, instruction level, etc… based heuristics.
  • Didn’t have a GUI, it just launched 2 WinGraphs.
  • Ability to port function names between databases.

6 of 24

BinDiff, the tool

Zynamics times:

  • Version 2.0 and 2.1 added the first external Java tool GUI and support for Linux.
  • Version 3.0 to 3.2 added more tight integration with IDA and support for MacOSX.

7 of 24

BinDiff, the tool

Google times (after they bought Zynamics):

  • External-facing BinDiff development was halted.
  • Version 4.0 added call graph views, proximity browsing, importing IDA comments and incremental diffing among many other new features.
  • Version 4.1 to 4.3 dropped and re-added support for MacOSX, added more architectures and… received minimal to no public changes.
    • Since version 4.1, BinDiff is essentially kept alive by Christian Blichmann doing 20% work on it.
  • Version 5 was published in 2019, years after the last version.

8 of 24

BinDiff, the tool

BinDiff 3.0 was published in 2009.

BinDiff 4.0 was published in 2011.

BinDiff 4.1 was published in 2014.

BinDiff 4.2 was published in 2016.

BinDiff 4.3 was published in 2017.

And BinDiff 5 was published in 2019.

During BinDiff 4.0 to 5 little to no support happened.

9 of 24

Other tools

Since the first version of BinDiff was published, other various tools appeared:

  • Turbo Diff.
  • Patch Diff/Patch Diff 2.
  • Darun Grim.
  • Diaphora.
  • YaDiff.

The quality of these alternative tools, all of them Open Source, greatly varies. However, they all have one thing in common: they are (or seem to be) abandoned, with the only exception of YaDiff and Diaphora.

10 of 24

Diaphora, reviving binary diffing

The initial version of Diaphora was published in 2015, at SyScan. I wrote it due to despair with BinDiff, after Zynamics was bought by Google, and with all the other dead and/or unmaintained Open Source alternatives.

Also, because BinDiff didn’t have many of the features I wanted.

Let’s talk about the features, at 2015, that were the top of the top...

11 of 24

Binary Diffing Features in 2015

BinDiff 4.X featured the following features:

  • Export/import function names and comments.
  • Support for assembly level patch diffing.
  • Support for importing symbols from libraries.
  • Support for diffing, inter-exchangeably, different architectures.
    • Ie: diffing one x86 program against a PPC one.
  • Call graph matching.
  • An independent GUI but with tight IDA integration, the de-facto RE tool.
  • Support for Windows, Linux and MacOSX.

And… that’s about it.

12 of 24

Binary Diffing Features in 2015

BinDiff 4.X turned out to be frustrating for my daily job:

  • Take one binary and work on it.
  • When a new version appears, port function names, comments… as well as structs, enums and everything around pseudo-code.
  • All of this had to be done manually.
  • When I had around 10 scripts wrote for doing it every single time my new target updated I decided it was stupid and that I had to do something.
    • I was researching AVs during that time, and the cores change daily, weekly or monthly if you’re lucky.
  • Asking Zynamics (Google) to implement anything new was not an option.
  • And I started writing Diaphora.

13 of 24

Diaphora as of 2019

Diaphora added many new features not available in other public tools:

  • Pseudo-code based heuristics.
  • Support for exporting and importing pseudo-code comments.
  • Visually diffing changes at pseudo-code level.
  • Support for exporting and importing enums, structs and prototypes.
  • Command line batch exporting and diffing.
  • Export hooks, aka Python scripting support for exporting.
  • Parallel diffing.
  • Direct source code matching and diffing. An external tool for now: Pigaios.
  • ML based heuristics. Only for Pigaios, for now.

14 of 24

New tools since 2015

Since Diaphora was published, only one more Open Source tool appeared: YaDiff, from the YaCo project (a Collaborative Reverse-Engineering plugin tool for IDA).

YaDiff (the diffing part), however, only focus on exporting and importing symbols between databases.

Extremely fast. Simplistic heuristics. A good and fast tool when it works.

It lacks support for exporting/importing enums, structs and any kind of things related to the pseudo-code.

It seems to be maintained as of today and, probably, will be ported to Ghidra.

15 of 24

Program Diffing in the academia

There are various great papers (with no accompanying source code or binary whatsoever, with only some little exceptions) about program diffing in the academia.

Some of my favourites papers from which I have extracted many ideas for Diaphora or Pigaios are shown in the next slides...

16 of 24

Academic Papers

Efficient Features for Function Matching Between Binary Executables

  • Paper by Chariton Karamitas (aka Huku) et al.
  • Explains “a set of carefully chosen features, extracted from a binary's CG and CFG” for initial functions matching as well as algorithms to “propagate approximate matching”.
  • First time I’ve seen anyone researching heuristics “applying Markov lumping techniques to function CFGs”.

A new algorithm for Diaphora (КОКА, from Koret-Karamitas) was implemented based on the awesome ideas of Huku.

17 of 24

Academic Papers

BinPro: A Tool for Binary Source Code Provenance.

  • By Dhaval Miyani et al.
  • Matching functions in source codes against binaries. The (never ever published tool) used the ROSE compiler to extract features from source codes, and IDA Python for binaries.
  • It explains also some Machine Learning techniques used to train what a good and a bad match is.

No code whatsoever was ever released, but the paper called my attention and served as the basis for Pigaios.

18 of 24

Academic Papers

BinMatch: A Semantics-based Hybrid Approach on Binary Code Clone Analysis

  • By Yikun Hu et al.
  • The idea is clear from the paper’s title: semantic based heuristics. Really an interesting idea. Extremely complex, in my opinion, to implement. Especially harder when talking about inter-exchangeably diffing different architectures.
  • No code whatsoever was ever published.

19 of 24

Academic Papers

Unsupervised Features Extraction for Binary Similarity Using Graph Embedding Neural Networks

  • By Roberto Baldoni, Giuseppe Antonio Di Luna, Luca Massarelli et al.
  • The idea is to train a NN to learn what are the best features to extract from binary functions to then find matches.
  • No code was ever released or I failed at finding it.

20 of 24

Academic Papers

SAFE: Self-Attentive Function Embeddings for Binary Similarity

  • By Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni et al.
  • Another paper by the same people as the previous one. In this one they research how to use Self-Attentive Neural Networks without having to do feature extraction (manual or otherwise) to find good matches auto-magically.
  • Code was released and is Open Source!

21 of 24

Academic Papers

Debin: Predicting Debug Information in Stripped Binaries

  • By Jingxuan He, Pesho Ivanov, Petar Tsankov et al.
  • How to automatically learn what are possible good function name candidates by training with binaries with symbols.
  • Code was released and is now Open Source!
    • https://github.com/eth-sri/debin
  • During my testing, it generated a gigantic number of false positives.
  • It might possibly work after it’s trained with gigantic datasets.
    • Or it might get overtrained...

22 of 24

Academic Papers

DeClassifier: Class-Inheritance Inference Engine for Optimized C++ Binaries.

  • By Rukayat Ayomide Erinfolami and Aravind Prakash.
  • How to find classes and their inheritance in binaries.
  • It might be a cool idea as the basis for diffing also classes and their inheritance between binaries.
    • Idea for the future that makes more sense after Ghidra.
  • No source code or binaries ever released. After I mailed the authors they told me they plan to release a prototype in the future.

23 of 24

The possible future

The following is what I think will be the future of program diffing:

  • Reverse engineering framework independent program diffing tools.
    • Now that Ghidra is here, Diaphora at least will be ported to it.
  • More use of Machine Learning techniques to find proper matches, determine if a match is good or not, learn specific per-use-case matches, etc…
  • Diffing of classes and their inheritance.
  • More heuristics and tools doing source-code-to-binary matching and diffing.
  • Semantical diffing.

24 of 24

Thank you!

Any questions?