Code Vectors
Understanding Programs Through Embedded Abstracted Symbolic Traces
Understanding Programs
Why?��If we can understand programs maybe we can find bugs or provide better suggestions to developers.
Understanding Programs
How?��We can leverage the vast amount of open source code and transform that code into data. Then we can apply off-the-shelf learning techniques to that data.
Understanding Programs Pictorially
Lots of code
Off-the-shelf learning
Our Plan
First, we’ll turn programs into data
Then, we’ll learn from that data
Finally, we’ll use what we learned to do interesting things
Our Plan (Revised)
First, we’ll turn programs into data
Then, we’ll learn from that data
Finally, we’ll use what we learned to do interesting things
Programs are just text
We could just think of a program as a sequence of bytes. Or a stream of characters. Or a sequence of tokens.�
We already have well-established learning techniques that work of sequences.
Programs are just text
a == b�
!(a != b)
If we really want to learn from programs we need a deeper representation.
The (1st) Problem
The kinds of data we can extract from programs (trees, graphs, logical formulas) does not mesh with the kinds of data we can easily learn from.
Enter Embeddings
Corpus
Embedding Learner
Vectors
Enter Embeddings
Corpus
Embedding Learner
Vectors
Programs
Artifact Extractor
The (2nd) Problem
What do we embed?�
We could try:
Characters / Tokens / ASTs / CFGs / Def-Use Chains / Dynamic Traces / ...
What do we want?
To answer this question we need to add some constraints:
Enter Symbolic Traces
Benefits: very semantically rich, almost like running the program (without having to actually run it), generates a lot of data.
The (3rd) Problem
There are too many words in a symbolic trace.
Local variables, user defined functions, literals and path conditions combine to create an extraordinarily large vocabulary.
The (4th) Problem
lock(&obj);
… 1000 lines of code …
unlock(&obj);
The Distributional Hypothesis
Similar words appear in similar contexts.
Is this true for artifacts extracted from programs?
Enter Abstraction
To solve both problems 3 and 4 we’ll need to introduce some form of abstraction.
Abstracted Symbolic Traces
Abstracted Symbolic Traces : Example
Start
Called(lock)
Abstracted Symbolic Traces : Example
Start
Called(lock)
Called(alloc)
Abstracted Symbolic Traces : Example
Start
Called(lock)
Called(alloc)
RetEq(alloc, 0)
Abstracted Symbolic Traces : Example
Start
Called(lock)
Called(alloc)
RetEq(alloc, 0)
ParamShare(unlock, lock)
Abstracted Symbolic Traces : Example
Start
Called(lock)
Called(alloc)
RetEq(alloc, 0)
ParamShare(unlock, lock)
Called(unlock)
Abstracted Symbolic Traces : Example
Start
Called(lock)
Called(alloc)
RetEq(alloc, 0)
ParamShare(unlock, lock)
Called(unlock)
Error
RetError(ENOMEM)
End
In Summary
Corpus
Embedding Learner
Vectors
Programs
Symbolic Execution
Abstractions
Embedded Abstracted Symbolic Traces
Do the embeddings work?
We ran several experiments (against the Linux kernel) to understand if the embeddings worked:
Do the embeddings work?
Check out the paper for all the results! We’ll explore the following:
Analogies
(mutex_lock_nested, mutex_unlock)�(lock_page, unlock_page)�(fh_lock_nested, fh_unlock)�(xfs_ilock, xfs_iunlock)�(btrfs_tree_lock, btrfs_tree_unlock)�(double_lock_hb, double_unlock_hb)
Analogies
20 different classes of analogies.
Just over 19,000 analogies in the benchmark.
93.9%
17,890/19,042 analogies are solved @Top-1 using the learned embedding
Error Code Misuse
Imagine a trace where something goes wrong:
… kzalloc kzalloc_$EQ_0 $ERR $RET_ENOMEM $END�
Can we predict how these failing traces end? (Can we tell if they return the wrong error code?)
Error Code Misuse - Setup
We learned an LSTM model for predicting the error codes of 20,000 failing traces.�
Then, we manually found 15 patched functions that returned bad (incorrect) error codes that were later fixed. This gave us 68 traces to test with.
… kzalloc kzalloc_$EQ_0 $ERR $RET_EIO $RET_ENOMEM $END�
76.5%
Accuracy @Top-3 for predicting the correct error code�In 57/68 (84%) of tests the Top-3 results do not include the bad error code
Questions?
What about prefix dominance?
When do the analogies not work?
Which abstractions are the most important?
What about more recent word-vector learners?
Is this technique applicable to other languages?
How do you scale to the Linux kernel?
Thanks!