1 of 38

Code Vectors

Understanding Programs Through Embedded Abstracted Symbolic Traces

2 of 38

Understanding Programs

Why?��If we can understand programs maybe we can find bugs or provide better suggestions to developers.

3 of 38

Understanding Programs

How?��We can leverage the vast amount of open source code and transform that code into data. Then we can apply off-the-shelf learning techniques to that data.

4 of 38

Understanding Programs Pictorially

Lots of code

Off-the-shelf learning

5 of 38

Our Plan

First, we’ll turn programs into data

Then, we’ll learn from that data

Finally, we’ll use what we learned to do interesting things

6 of 38

Our Plan (Revised)

First, we’ll turn programs into data

Then, we’ll learn from that data

  • What if it’s in the wrong form to feed to standard learning techniques?�

Finally, we’ll use what we learned to do interesting things

7 of 38

Programs are just text

We could just think of a program as a sequence of bytes. Or a stream of characters. Or a sequence of tokens.�

We already have well-established learning techniques that work of sequences.

8 of 38

Programs are just text

a == b�

!(a != b)

9 of 38

If we really want to learn from programs we need a deeper representation.

10 of 38

The (1st) Problem

The kinds of data we can extract from programs (trees, graphs, logical formulas) does not mesh with the kinds of data we can easily learn from.

11 of 38

Enter Embeddings

Corpus

Embedding Learner

Vectors

12 of 38

Enter Embeddings

Corpus

Embedding Learner

Vectors

Programs

Artifact Extractor

13 of 38

The (2nd) Problem

What do we embed?�

We could try:

Characters / Tokens / ASTs / CFGs / Def-Use Chains / Dynamic Traces / ...

14 of 38

What do we want?

To answer this question we need to add some constraints:

  • We want a program representation that is semantically rich. (Characters and tokens are out.)�
  • We do not want to have to run the program. (No dynamic traces.)�
  • We want something that looks like sentences already.

15 of 38

Enter Symbolic Traces

Benefits: very semantically rich, almost like running the program (without having to actually run it), generates a lot of data.

16 of 38

The (3rd) Problem

There are too many words in a symbolic trace.

Local variables, user defined functions, literals and path conditions combine to create an extraordinarily large vocabulary.

17 of 38

The (4th) Problem

lock(&obj);

… 1000 lines of code …

unlock(&obj);

18 of 38

The Distributional Hypothesis

Similar words appear in similar contexts.

Is this true for artifacts extracted from programs?

19 of 38

Enter Abstraction

To solve both problems 3 and 4 we’ll need to introduce some form of abstraction.

20 of 38

Abstracted Symbolic Traces

21 of 38

Abstracted Symbolic Traces : Example

Start

Called(lock)

22 of 38

Abstracted Symbolic Traces : Example

Start

Called(lock)

Called(alloc)

23 of 38

Abstracted Symbolic Traces : Example

Start

Called(lock)

Called(alloc)

RetEq(alloc, 0)

24 of 38

Abstracted Symbolic Traces : Example

Start

Called(lock)

Called(alloc)

RetEq(alloc, 0)

ParamShare(unlock, lock)

25 of 38

Abstracted Symbolic Traces : Example

Start

Called(lock)

Called(alloc)

RetEq(alloc, 0)

ParamShare(unlock, lock)

Called(unlock)

26 of 38

Abstracted Symbolic Traces : Example

Start

Called(lock)

Called(alloc)

RetEq(alloc, 0)

ParamShare(unlock, lock)

Called(unlock)

Error

RetError(ENOMEM)

End

27 of 38

In Summary

Corpus

Embedding Learner

Vectors

Programs

Symbolic Execution

Abstractions

28 of 38

Embedded Abstracted Symbolic Traces

29 of 38

Do the embeddings work?

We ran several experiments (against the Linux kernel) to understand if the embeddings worked:

  • Analogies�
  • Simple Similarity�
  • Queries via Word Vector Averaging�
  • Ablation Study�
  • Syntactic vs Semantic�
  • Error Code Misuse

30 of 38

Do the embeddings work?

Check out the paper for all the results! We’ll explore the following:

  • Analogies�
  • Simple Similarity�
  • Queries via Word Vector Averaging�
  • Ablation Study�
  • Syntactic vs Semantic�
  • Error Code Misuse

31 of 38

Analogies

(mutex_lock_nested, mutex_unlock)�(lock_page, unlock_page)�(fh_lock_nested, fh_unlock)�(xfs_ilock, xfs_iunlock)�(btrfs_tree_lock, btrfs_tree_unlock)�(double_lock_hb, double_unlock_hb)

32 of 38

Analogies

20 different classes of analogies.

Just over 19,000 analogies in the benchmark.

33 of 38

93.9%

17,890/19,042 analogies are solved @Top-1 using the learned embedding

34 of 38

Error Code Misuse

Imagine a trace where something goes wrong:

… kzalloc kzalloc_$EQ_0 $ERR $RET_ENOMEM $END�

Can we predict how these failing traces end? (Can we tell if they return the wrong error code?)

35 of 38

Error Code Misuse - Setup

We learned an LSTM model for predicting the error codes of 20,000 failing traces.�

Then, we manually found 15 patched functions that returned bad (incorrect) error codes that were later fixed. This gave us 68 traces to test with.

… kzalloc kzalloc_$EQ_0 $ERR $RET_EIO $RET_ENOMEM $END�

36 of 38

76.5%

Accuracy @Top-3 for predicting the correct error code�In 57/68 (84%) of tests the Top-3 results do not include the bad error code

37 of 38

Questions?

What about prefix dominance?

When do the analogies not work?

Which abstractions are the most important?

What about more recent word-vector learners?

Is this technique applicable to other languages?

How do you scale to the Linux kernel?

38 of 38

Thanks!