1 of 17

2024710737

Suyeon Lee

Racing on the Negative Force: Efficient Vulnerability

Root-Cause Analysis through Reinforcement

Learning on Counterexamples

USENIX ’24

Dandan Xu, SKLOIS, Institute of Information Engineering, Chinese Academy of Sciences,

China, and School of Cyber Security, University of Chinese Academy of Sciences, China;

Di Tang, Yi Chen, and XiaoFeng Wang, Indiana University Bloomington; Kai Chen,

SKLOIS, Institute of Information Engineering, Chinese Academy of Sciences, China,

and School of Cyber Security, University of Chinese Academy of Sciences, China;

Haixu Tang, Indiana University Bloomington; Longxing Li, SKLOIS, Institute of

Information Engineering, Chinese Academy of Sciences, China, and School of

Cyber Security, University of Chinese Academy of Sciences, China

2 of 17

Background

Statistical RCA framework (sota: Spectrum-based Fault Localization (SFL))�- It estimates the statistical correlation between program elements and crashes in crash inputs and non-crash inputs to find the root cause. �- Elements with a high correlation are more likely to be the Root Cause.�
Predicate�: A program element that acts as a necessary condition for a crash to occur�- All predicates are expressed in the form Var op Thr. (Var: variable, op: operator, Thr: threshold)�ex) eax <= 7 is a predicate that predicts a crash will occur when the value of the eax register is less than or equal to 7.

3 of 17

Background

AFL(American Fuzzy Lop)�: A fuzzer that focuses on code coverage, generating various inputs to find program errors.�- Statistical RCA uses AFL to generate error and normal inputs for analysis.�- AFL is not ideal for root cause analysis due to its focus on code coverage and bias towards error inputs.

Reinforcement Learning�- Adversarial Multi Armed Bandits (AMAB) � : A game where a player maximizes rewards by making the best choices.�- Exploitation & Exploration: Exploitation uses existing knowledge, while Exploration tries new things.

4 of 17

Introduction

Motivation�- Fuzzing is good at finding vulnerabilities, but Root Cause Analysis (RCA) is still difficult.�- Existing methods are too slow and impractical for real-world use.�- This is because they focus too much on crash inputs, making it hard to pinpoint the exact problem in the program.

Contribution�- Showed that using counterexamples can speed up RCA�- Developed RACING, a new RL-based RCA method. RACING learns from fuzzing results to find �the root cause more efficiently.

5 of 17

Counterexample of RCA

Statistical RCA�- It ranks predicates based on their correlation with crashes to find the root cause.�- Existing research focuses on crash inputs, making it hard to distinguish between predicates and requiring many samples.�- This paper proposes using counterexamples to improve sampling strategy and find the root cause faster with fewer samples.

Counterexample?�: An input sample that shows a result different from the expected one�1) CoR(Counterexamples for Rankings) �2) CoP(Counterexamples for Predicates)

6 of 17

Counterexample of RCA

CoR(Counterexample for Ranking)�: Counterexamples that significantly change the ranking of program elements

RCA is modeled as a ranking problem of predicates based on their statistical correlation with crashes. The goal is to find the actual ranking.

Existing Sampling Strategies�- Random Sampling requires many samples and is inefficient�- PoC-based Sampling generates samples by mutating known crash inputs. Efficient, but biased towards crashes, making it difficult to distinguish between predicates

CoR-based Sampling : Samples around CoRs to better distinguish the contribution of each predicate and speed up the convergence of rankings

7 of 17

Counterexample of RCA

2) CoP(Counterexample for Predicate)�: Counterexamples that deviate from the predicted crash correlation of a specific program element

Amplifies the error rate in crash prediction for each predicate, increasing the statistical difference between predicates
Effect: Larger differences between predicates allow for accurate ranking estimation with fewer samples, leading to faster analysis

8 of 17

RACING (Root-cAuse-analysis on Counterexample based reinforcement learnING)

RACING's Goal: Guide the fuzzing process with reinforcement learning to generate effective counterexamples, thereby accelerating root cause analysis.

RL on Counterexamples�- RACING learns from previous fuzzing results to determine the next fuzzing strategy�- Reward Mechanism: Calculates rewards based on the difference between predicted and actual results after each fuzzing, promoting the generation of counterexamples�- Performs two actions - 1) seed selection and 2) mutation method selection. The next action is chosen based on past rewards, with higher rewards leading to higher selection probability.

9 of 17

RACING

Optimizing CoP Generation�- The closer the seed (original input) and the mutated input are, the higher the probability of generating a CoP.�- A crash input can produce either a crash or a non-crash, but a non-crash input almost always produces a non-crash.

Implementation�- Replace AFL's seed selection and mutation strategies with RACING's strategies.�- Code Instrumentation: Collect information about each instruction during program execution for analysis. Insert logging handlers after instructions in the PoC execution path using LLVM Pass.

10 of 17

RACING

Algorithm

11 of 17

Evaluation - Setting

Environment�- Being conducted on an Ubuntu 20.04 server with 32-core CPU, 128GB memory, and 22TB storage

Vulnerabilities�- 30 vulnerabilities in total (19 from Aurora dataset + 11 from real-world)

Vulnerability Diversity�- Evaluated RACING's performance on various programs (21), types (9), and complexities (453 to 49,689 instructions)

Ground Truth�- The root cause of each vulnerability was manually determined through patch code analysis and debugging

12 of 17

Evaluation - Setting

13 of 17

Evaluation - Results

RQ1: How does the effectiveness of RACING compare to current works for locating root causes?�- All root causes were identified within the top 50. (73.33% within the top 10)�- Compared to Aurora, RACING's ranking was on average 3.8 higher.

RQ2: How much does RACING improve efficiency?�- Analysis time: RACING is on average 13.22 times faster than Aurora.�- 26 out of 30 vulnerabilities were analyzed within 1 hour, 4 within 1 minute.�- Aurora wastes a lot of time generating test cases, while RACING quickly converges to the result.

14 of 17

Evaluation - Results

15 of 17

Ablation Study

Reward Mechanism�- Using only CoR is faster than Aurora, but using CoP together further reduces analysis time
CoP Generation Optimization�- Optimization techniques increase the probability of CoP generation and reduce analysis time
Top K: Increasing the value of k increases analysis time, but the ranking of the root cause hardly changes. Setting k=100 is sufficient

16 of 17

Conclusion

Limitations�- Root cause detection can be difficult when counterexample generation is challenging � (e.g., incorrect predicate ranking, accidental normal execution)�- Currently requires source code. This limitation can be overcome with future advancements in binary code analysis tools�- Lacks support for compound predicates. 🡪 RACING+

Future Work�- Extend RACING to support compound predicates (RACING+)

1 of 17

2 of 17

3 of 17

4 of 17

5 of 17

6 of 17

7 of 17

8 of 17

9 of 17

10 of 17

11 of 17

12 of 17

13 of 17

14 of 17

15 of 17

16 of 17

17 of 17