1 of 36

Automated Test Input Generation for Android: �Are We There Yet?

Shauvik Roy Choudhary (Georgia Tech)

Alessandra Gorla (IMDEA Software Institute, Spain)

Alessandro Orso (Georgia Tech)

Partly supported by NSF, MSR, IBM Research, Google

2 of 36

Apps on the playstore

83% smartphones

worldwide are

Android based

3 of 36

4 of 36

Automated Test Input Generation Techniques

Dynodroid

FSE’13

A3E�OOPSLA’13

SwiftHand

OOPSLA’13

DroidFuzzer

MoMM’13

Orbit

FASE’13

Monkey

2008

ACTEve

FSE’12

GUIRipper

ASE’12

JPF-Android

SENotes’12

PUMA

Mobisys’14

EvoDroid

FSE’14

IntentFuzzer

WODA’14

Null IntentFuzzer

2009

5 of 36

Tools Strategies

  1. Instrumentation strategy -- App/Platform�
  2. Events generation strategy -- UI/System�
  3. Testing strategy -- Black-box/White-box�
  4. Exploration strategy -- Random/Model-based/Systematic

6 of 36

  1. Random Exploration Strategy

Randomly selects an event for exploration

Tools: Monkey, Dynodroid

Advantages

  • Efficient/fast generation of events
  • Suitable for stress testing/sanity checks

Drawbacks

  • Hard to generate specific inputs
  • App behavior/coverage agnostic
    • typically generates redundant events

7 of 36

2. Model-based Exploration Strategy

Use GUI model of the app to systematically explore

Typically FSMs (states = activities, edges = events)

Tools: A3E-DF, SwiftHand, GUIRipper, PUMA

Advantages

  • More interpretable and intuitive
  • Less redundant events

Drawbacks

  • Events that alter non-GUI state not considered

8 of 36

3. Systematic Exploration Strategy

Use sophisticated techniques (e.g., symbolic execution, evolutionary algorithms) to systematically explore the app

Tools: ACTEve, EvoDroid

Advantages

  • Exploration of behavior hard to reach with�random techniques

Drawbacks

  • Less scalable than other techniques

(xleft < $x < xright) ∧ (ytop < $y < ybottom)

SAT Solver

$x = 5; $y = 10

9 of 36

Automated Test Input Generation Techniques

Name

Doesn’t need

Instrumentation

Events

Exploration Strategy

Testing Strategy

Platform

App

UI

System

Monkey

Random

Black-box

ACTEve

Systematic

White-box

Dynodroid

Random

Black-box

A3E-DF

Model-based

Black-box

SwiftHand

Model-based

Black-box

GUIRipper

Model-based

Black-box

PUMA

Model-based

Black-box

10 of 36

Evaluation

Image Credit: Daily Alchemy

11 of 36

Evaluation Criteria

  1. Ease of use�
  2. Android framework compatibility�
  3. Code coverage achieved�
  4. Fault detection ability

12 of 36

Mobile App Benchmarks

Combination of all subjects (68) used from F-Droid and other open source repos

13 of 36

Experimental Setup

Debian Host

Ubuntu Guest

2 cores

6GB RAM

VirtualBox

Vagrant

Android Emulators

4GB RAM

�Emulators: �v2.3 (Gingerbread)�v4.1 (Jelly Bean)�v4.4 (KitKat)

�Tools installed on guest:

  • Removed default timeouts�
  • Default config; �No special tuning

14 of 36

Experiment Protocol

  • Run each tool for 1 hour on each benchmark
  • Repeat 10 times to account for non-deterministic behavior
  • Collect Results
    • Coverage Report (every 5 min)��
    • Logcat => Extracted Failures

Emma HTML Reports

Parse and extract statement coverage

Logcat �from device

Parse and extract unique stack traces (RegEx)

15 of 36

Results

Image Credit: ToTheWeb

16 of 36

C1. Ease of Use & C2. Android Compatibility

Name

Ease of Use

Compatibility

OS

Emulator/Device

Monkey

NO_EFFORT

Any

Any

ACTEve

MAJOR_EFFORT

v2.3

Emu (Custom)

Dynodroid

NO_EFFORT

v2.3

Emu (Custom)

A3E-Depth-first

LITTLE_EFFORT

Any

Any

SwiftHand

MAJOR_EFFORT

v4.1+

Any

GUIRipper

MAJOR_EFFORT

Any

Emulator

PUMA

LITTLE_EFFORT

v4.3+

Any

17 of 36

C3. Overall Code Coverage Achieved

18 of 36

C3. Coverage Analysis by Benchmark App

Divide And Conquer

Random�MusicPlayer

k9mail

Password�MakerPro

...

#Applications

% Coverage

19 of 36

C3. Code Coverage Achieved Over Time

20 of 36

C4. Fault Detection Ability

21 of 36

Pairwise Comparison: Coverage and Failures

Coverage

22 of 36

Pairwise Comparison: Coverage and Failures

Failures

23 of 36

Pairwise Comparison: Coverage and Failures

Coverage

Failures

24 of 36

Observations and Discussion

25 of 36

1.

Random testing can be effective

(somehow surprisingly)

26 of 36

2.

Strategy makes a difference

(in the behaviors covered)

27 of 36

3.

System events matter

(in addition to UI events)

Broadcast Receiver

Intents

SMS

Notifications

28 of 36

4.

Restarts should be minimized

(for efficient exploration)

29 of 36

5.

Practical considerations matter

(for practical usefulness)

30 of 36

5.1

Practical considerations matter

(for practical usefulness)

Manual Inputs

31 of 36

5.2

Practical considerations matter

(for practical usefulness)

Initial State

32 of 36

Open Issues for Future work

Image Credits: Back to the Future (Universal Pictures)

33 of 36

1.

Reproducibility

(allow for reproducing

observed behaviors)

Image Source: http://ncp-e.com

34 of 36

2.

Mocking

and

Sandboxing

(support reproducibility, avoid side effects, ease testing)

Source: http://googletesting.blogspot.com

35 of 36

3.

Find problems across platforms

(address fragmentation)

Image Credit: OpenSignal

36 of 36

Infrastructure

http://www.cc.gatech.edu/~orso/software/androtest