Checking the boundaries of static analysis
Halvar Flake - Syscan 2013
Latest slides: http://goo.gl/pcDkV
Introduction
Survey of some research on static analysis
This talk: Focus on two big different flavors:
Abstract Interpretation
SMT-centric analysis
Convergence of the approaches
Abstract Interpretation
SMT-based analysis techniques
Topics (Part 1)
Introduction to SMT-based analyses and abstract interpretation
Common weaknesses of AI implementations
Ideal world: How they could be fixed
Possible improvements through SMT solvers
Topics (Part 2)
What bugs are hard to find ?
What bugs are easy to find ?
Why are browsers the "perfect storm" for static analyzers ?
Warning (Part 1)
Topic requires a lot of machinery
Machinery ~ rope when rock-climbing
Needed to climb safely
Talk will focus on description of vistas, not teach detailed knotting techniques
Warning (Part 2)
Survey talk
"Other people's work"
Incomplete bibliography at the end
SMT-based analyses
Idea: Generate set of equations from code
Find solutions to equations
SMT Solvers - Lazy
off-the-shelf SAT to find assignment for boolean "skeleton"
subsolvers deal with individual clauses
combination of solvers for various theories
"bitvectors", "arrays", etc.
SMT Solvers
Lots of progress
SAT revolution + IT Security + MSR
z3
Whitebox fuzzing / input crafting at MS: Solvers very mature, solve huge formulas
SMT Solvers
Formulas based on concrete paths through the program
Size of formulas dependent on number of operations in trace
Heavily influenced by quantity of memory writes
SMT Solvers in practice
Great black-box oracle that will say "one solution is X" or "no solution" to given equation
Ability to solve complex equations quite impressive
Translation from most IRs to SMT input format near-trivial
Abstract Interpretation
"In line X of the program, the property Y will always be true."
Derives such statements in a structured way
How does it work ?
Concrete Execution
Input state:
a = 4, b = 10
a = a + b;
Output state:
a = 14, b = 10
Concrete Execution
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Set of input states
Set of output states
a = a + b;
Concrete vs. Abstract Interpretation
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Set of input states
Set of output states
a = a + b;
Abstract Domain Element
Abstraction
Concrete vs. Abstract Interpretation
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Set of input states
Set of output states
a = a + b;
Abstract Domain Element
Abstraction
Abstraction
Abstract Domain Element
Concrete vs. Abstract Interpretation
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Set of input states
Set of output states
a = a + b;
Abstract Domain Element
Abstraction
Abstraction
Abstraction
abstract
transform
Abstract Domain Element
Concrete vs. Abstract Interpretation
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Input state:
a = 4, b = 10
Output state:
a = 14, b = 10
Set of input states
Set of output states
a = a + b;
Abstract Domain Element
Abstraction
Concretization
Abstraction
abstract
transform
Abstract Domain Element
Abstract vs. concrete domains
Compute on "simpler" structures.
Need "join" ("set union") and "intersect".
Translate effects of code to abstract domain. Call them "transforms".
Example: Intervals
Incoming set of states contains 990 elements
(a = 20, b=5),
(a = 22, b=6),
...,
(a = 2000, b = 5)
Approximate with interval:
(a is in [20...2000], b is in [5...6])
�Intervals can be joined and intersected.
Compute on abstract domain
a in [20..2000], b in [5..6]
a = a + b;
a in [25..2006], b in [5..6]
"Abstract Interpretation"
Interpreting (~executing) the program on an abstract domain instead of on a concrete set of states.
Central Questions
Properties of good abstract domains ?
What abstract domains can we think of ?
Intervals, polyhedral domains etc.
How to lift concrete operations to abstract domains ?
Common pitfalls & problems
Myopic static analysis
Clumsy construction of products of domains
Codependency of intermediate representation and abstract domains
Summarization of heap cells and implications
Improve with solvers ?
Myopic static analysis
Clumsy construction of products of domains
Codependency of intermediate representation and abstract domains
Summarization of heap cells and implications
Myopic static analysis
Sequence of instructions
Each instruction is lifted to abstract domain
Each instruction overapproximates concrete operation
Cascade of imprecision leads to failure
Myopic static analysis
A fancy way of writing NOP
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x = [0,0x1BEEF]
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x = [0,0x1BEEF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x = [0,0x1BEEF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x2 = [0,0xFF00]
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x = [0,0x1BEEF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x2 = [0,0xFF00]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x = [0,0x1BEEF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x2 = [0,0xFF00]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x4=[0,0]
x = [0,0x1BEEF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x2 = [0,0xFF00]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x4=[0,0x10000]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x4=[0,0],res=[0x,0xFF]
x = [0,0x1BEEF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x2 = [0,0xFF00]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x4=[0,0x10000]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x4=[0,0x10000],res=[0x,0xFF]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]x4=[0,0],res=[0x,0xFFFF]
x = [0,0x1BEEF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x = [0,0x1BEEF]
x1 = [0,0xFF]
x2 = [0,0xFF00]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x4=[0,0x10000]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]
x4=[0,0x10000],res=[0x,0xFF]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]x4=[0,0],res=[0x,0xFFFF]
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]x4=[0,0],res=[0x,0x1FFFF]
Myopic static analysis: Summary
Clumsy product domains
Multiple abstract domains can be combined
Often necessary: More than one property needs to be tracked
Motivating example: Strided intervals for binary analysis
Motivating example: Strided Intervals
Byte array with intervals in 4-byte memory cells
Array read via interval index
x = array[ y ] where y = [0,12]
Without alignment: x = [0, 0x30000000]
[0, 0x10]
[0, 0x20]
[0, 0x30]
[0, 50]
[0, 100]
Strided Intervals
Byte array with intervals in 4-byte memory cells
[0, 0x10]
[0, 0x20]
[0, 0x30]
[0, 50]
[0, 100]
10 00 00 00
20 00 00 00
30 00 00 00
00 00 00 10
20 00 00 00
00 20 00 00
00 00 20 00
00 00 00 20
30 00 00 00
00 30 00 00
00 00 30 00
00 00 00 30
[0, 0x30000000]
Strided Intervals
Alignment information helps
x = array[ y ] where (y = [0,12] and 0 mod 4)
[0, 0x10]
[0, 0x20]
[0, 0x30]
[0, 50]
[0, 100]
00 00 00 10
00 00 00 20
00 00 00 30
[0, 0x30]
Clumsy product domains
Cartesian product is easy to build
We know how to compute on Intervals
We know how to compute on Alignment
Imprecision creeps in
Clumsy product domains
x1 = read_16();
x1 &= 0xFFFFFFF0
if x1 < 110 go right
x1 += 10
x1 = [0,0xFFFF], 0 mod 1
Clumsy product domains
x1 = read_16();
x1 &= 0xFFFFFFF0
if x1 < 110 go right
x1 += 10
x1 = [0,0xBEEF], 0 mod 1
x1 = [0,0xFFFF], 0 mod 16
Clumsy product domains
x1 = read_16();
x1 &= 0xFFFFFFF0
if x1 < 110 go right
x1 += 10
x1 = [0,0xBEEF], 0 mod 1
x1 = [0,0xFFFF], 0 mod 16
x1 = [0,110], 0 mod 16
Clumsy product domains
[0, 110], 0 mod 16 makes no sense
It is a valid element of
It is not a very sane abstraction from the set of possible states
Better: [0, 96], 0 mod 16
Clumsy product domains
Component-wise transforms make things worse:
[0, 110] x (0 mod 16) + [10, 10] x (0 mod 1)
= [0, 120] x (0 mod 1)
We want to compute on the reduced product
Rewrite all transforms :-(
Clumsy product domains: Summary
We want to compute on the reduced product
We can only easily compute on
Leads to cascading imprecision
Codependent IRs and domains
Different parties implement different IRs
REIL, RREIL, (TSL) etc.
Intention: Make it "easy" to implement static analysers
Codependend IRs and domains
ILs and the domains end up "married"
REIL works for bitvector stuff
RREIL works for relational operations (intervals etc.)
TSL similar (but much more generic) to REIL - bitvector oriented
Codependend IRs and domains
Hard to derive the relational operator ">"
Proposal by [RREIL]: Translate to language with relational operators
Useful for certain relational domains
REIL for "ja":
t0 = (ZF == 0)
t1 = (CF == 0)
t2 = (t0 & t1)
if t2:
goto ...
Codependend IRs and domains
Design for REIL was influenced by a bitvector-oriented domain
Design for RREIL was influenced by relational domains
Frequent symptom: Have domain, IL doesn't work well for domain -- need to adopt IL
Defeats the purpose of an IL ?
Recap
Myopic static analysis
Clumsyness of constructing products of domains
Codependency of intermediate representation and abstract domains
Summarization of heap cells and implications
How to improve ? Ideal world
Assume you have an algorithm which ...
... given semantic spec of instructions
... given a description of the abstract domain
computes good / tight transformers for you
How to improve ? Ideal world
Codependency: Solved
Clumsy direct product: Also solved. Algorithm would rewrite transformers.
Myopic analysis: Mostly solved. Treat entire basic block as one instruction. Compute transformers.
Reality
[GulTi06] and [CoCoMa11] deal with reduced product construction (via Nelson-Oppen-like methods)
[BraKi10], [KiSo10] and [Reg04] design algorithms for special abstract domains to compute transforms
[ThaRe12CAV] and [ThaRe12SAS] derive transforms automatically for some generic domains
Practical Home-Brew tips: Myopia
SMT solvers can be great for "post-basic-block-narrowing"
Domain-specific - but use SMT as oracle to "narrow down" results
Example: Bsearch after myopic analysis
x1 = x & 0xFF
x2 = x & 0xFF00
x3 = x & 0xFF0000
x4 = x & 0xFF000000
res = x1
res = res | x2
res = res | x3
res = res | x4
x=[0,0x1BEEF],x1=[0,0xFF], x2=[0,0xFF00],x3=[0,0x10000]x4=[0,0],res=[0x,0x1FFFF]
(assert (bvugt res #x0001FFFF))
(assert (bvugt res #x0000FFFF))
(assert (bvugt res #x00017FFF))
(..)
(assert (bvugt res #x0001BEEF))
Practical Home-Brew tips: Product Domains
Hope your solver can operate on both domains
Use similar narrowing techniques
Have the solver deal with communication between domains
Domain-specific, but black-box oracle really helps
Summarization of heap cells
Heap-allocated objects need to be tracked
In binary analyses, usually summarized by allocation address
In source analyses, often summarized by type
Summarization of heap cells
Common design: "most recently allocated"
One heap cell to represent the most recently allocated cell of type X
One heap cell to represent all other instances
Summarization of heap cells
Forward
Back
0x200
Forward
Back
0x100
Forward
Back
0x400
Forward
Back
0x800
Forward
Back
0x400
Forward
Back
0x200
Forward
Back
[0,800]
Most recently allocated heap cell
Summary Node
Summarization of heap cells
Forward
Back
0x200
Forward
Back
0x100
Forward
Back
0x400
Forward
Back
0x800
Forward
Back
0x400
Forward
Back
0x200
Forward
Back
[0,800]
Most recently allocated heap cell
Summary Node
Summarization of heap cells
Problematic for UAF analysis
Example:
Forward
Back
object *
Forward
Back
object *
Forward
Back
object *
Forward
Back
object *
Forward
Back
object *
bIsValid
bIsValid
bIsValid
bIsValid
bIsValid
Summarization of heap cells
Problematic for UAF analysis
Example:
valid*
valid *
invalid *
valid *
valid *
false
true
true
true
true
Summarization of heap cells
Problematic for UAF analysis
Example:
valid or invalid
Summary Node
true or false
Summarization of heap cells
Fieldwise "join" vs. whole-structure join
Data structures ~ logically grouped data fields
Most analyses treat them as "random bag of data fields"
Summary nodes often have illogical state - leads to more illogical state and false positives
AI becomes more solver-driven
3 of 4 discussed problems can be improved by solver-like techniques
Solvers are "cheap and really fast undergrad calculators"
AI community is adopting solvers (both tools and language) - and benefitting
Solvers become AI driven ?
SMT solving algorithms can be rewritten in terms of abstract interpretation
Unclear if this will lead to benefits for the solver community
DPLL vs. Staalmarck ?
Better solvers through abstraction ?
Abstract Interpretation
SMT-based analysis techniques
Solver-augmented AI
AI-augmented solvers ?
What bugs are hard to find ?
Analysis of memory-copying loops with multiple internal states
crackaddr_vuln.c vs. crackaddr_notvuln.c
Analysis of things that get too imprecise by heap summaries (UAF a good example)
Analysis of situations with unclear control flow or missing data
What bugs are easy to find ?
Unencumbered by heap summarization
Unencumbered by loopy state machines
Things where the entire code flow is easily determined
What bugs are easy to find ?
Straight-from-packet or straight-from-userland integer overflows
Straight-from-packet or straight-from-userland bad API calls
Straight-from-userland double-fetches
Why are browsers a perfect storm for static analyzers ?
Extremely large and C++
Multithreaded
Mix pointers & values in bizarre ways (lowest 3 bits indicate JS type etc.)
Why are browsers a perfect storm for static analyzers ?
Control-flow attacker-driven through Javascript
JITed code
State machines in state machines
That's it :-)
Questions ?