1 of 36

Automated Test Input Generation for Android: �Are We There Yet?

Shauvik Roy Choudhary (Georgia Tech)

Alessandra Gorla (IMDEA Software Institute, Spain)

Alessandro Orso (Georgia Tech)

Partly supported by NSF, MSR, IBM Research, Google

2 of 36

Apps on the playstore

83% smartphones

worldwide are

Android based

3 of 36

4 of 36

Automated Test Input Generation Techniques

Dynodroid

FSE’13

A3E�OOPSLA’13

SwiftHand

OOPSLA’13

DroidFuzzer

MoMM’13

Orbit

FASE’13

Monkey

2008

ACTEve

FSE’12

GUIRipper

ASE’12

JPF-Android

SENotes’12

PUMA

Mobisys’14

EvoDroid

FSE’14

IntentFuzzer

WODA’14

Null IntentFuzzer

2009

5 of 36

Tools Strategies

Instrumentation strategy -- App/Platform�
Events generation strategy -- UI/System�
Testing strategy -- Black-box/White-box�
Exploration strategy -- Random/Model-based/Systematic

6 of 36

Random Exploration Strategy

Randomly selects an event for exploration

Tools: Monkey, Dynodroid

Advantages

Efficient/fast generation of events
Suitable for stress testing/sanity checks

Drawbacks

Hard to generate specific inputs
App behavior/coverage agnostic

typically generates redundant events

7 of 36

2. Model-based Exploration Strategy

Use GUI model of the app to systematically explore

Typically FSMs (states = activities, edges = events)

Tools: A3E-DF, SwiftHand, GUIRipper, PUMA

Advantages

More interpretable and intuitive
Less redundant events

Drawbacks

Events that alter non-GUI state not considered

8 of 36

3. Systematic Exploration Strategy

Use sophisticated techniques (e.g., symbolic execution, evolutionary algorithms) to systematically explore the app

Tools: ACTEve, EvoDroid

Advantages

Exploration of behavior hard to reach with�random techniques

Drawbacks

Less scalable than other techniques

(x_left < $x < x_right) ∧ (y_top < $y < y_bottom)

SAT Solver

$x = 5; $y = 10

9 of 36

Automated Test Input Generation Techniques

Name	Doesn’t need Instrumentation		Events		Exploration Strategy	Testing Strategy
	Platform	App	UI	System
Monkey	✔	✔	✔	✖	Random	Black-box
ACTEve	✖	✖	✔	✔	Systematic	White-box
Dynodroid	✖	✔	✔	✔	Random	Black-box
A3E-DF	✔	✖	✔	✖	Model-based	Black-box
SwiftHand	✔	✖	✔	✖	Model-based	Black-box
GUIRipper	✔	✖	✔	✖	Model-based	Black-box
PUMA	✔	✔	✔	✖	Model-based	Black-box

10 of 36

Evaluation

Image Credit: Daily Alchemy

11 of 36

Evaluation Criteria

Ease of use�
Android framework compatibility�
Code coverage achieved�
Fault detection ability

12 of 36

Mobile App Benchmarks

Combination of all subjects (68) used from F-Droid and other open source repos

13 of 36

Experimental Setup

Debian Host

Ubuntu Guest

2 cores

6GB RAM

VirtualBox

Vagrant

Android Emulators

4GB RAM

�Emulators: �v2.3 (Gingerbread)�v4.1 (Jelly Bean)�v4.4 (KitKat)

�Tools installed on guest:

Removed default timeouts�
Default config; �No special tuning

14 of 36

Experiment Protocol

Run each tool for 1 hour on each benchmark
Repeat 10 times to account for non-deterministic behavior
Collect Results

Coverage Report (every 5 min)��
Logcat => Extracted Failures

Emma HTML Reports

Parse and extract statement coverage

Logcat �from device

Parse and extract unique stack traces (RegEx)

15 of 36

Results

Image Credit: ToTheWeb

16 of 36

C1. Ease of Use & C2. Android Compatibility

Name	Ease of Use	Compatibility
		OS	Emulator/Device
Monkey	NO_EFFORT	Any	Any
ACTEve	MAJOR_EFFORT	v2.3	Emu (Custom)
Dynodroid	NO_EFFORT	v2.3	Emu (Custom)
A3E-Depth-first	LITTLE_EFFORT	Any	Any
SwiftHand	MAJOR_EFFORT	v4.1+	Any
GUIRipper	MAJOR_EFFORT	Any	Emulator
PUMA	LITTLE_EFFORT	v4.3+	Any

17 of 36

C3. Overall Code Coverage Achieved

18 of 36

C3. Coverage Analysis by Benchmark App

Divide And Conquer

Random�MusicPlayer

k9mail

Password�MakerPro

...

#Applications

% Coverage

19 of 36

C3. Code Coverage Achieved Over Time

20 of 36

C4. Fault Detection Ability

21 of 36

Pairwise Comparison: Coverage and Failures

Coverage

22 of 36

Pairwise Comparison: Coverage and Failures

Failures

23 of 36

Pairwise Comparison: Coverage and Failures

Coverage

Failures

24 of 36

Observations and Discussion

25 of 36

1.

Random testing can be effective

(somehow surprisingly)

26 of 36

2.

Strategy makes a difference

(in the behaviors covered)

27 of 36

3.

System events matter

(in addition to UI events)

Broadcast Receiver

Intents

SMS

Notifications

28 of 36

4.

Restarts should be minimized

(for efficient exploration)

29 of 36

5.

Practical considerations matter

(for practical usefulness)

30 of 36

5.1

Practical considerations matter

(for practical usefulness)

Manual Inputs

31 of 36

5.2

Practical considerations matter

(for practical usefulness)

Initial State

32 of 36

Open Issues for Future work

Image Credits: Back to the Future (Universal Pictures)

33 of 36

1.

Reproducibility

(allow for reproducing

observed behaviors)

Image Source: http://ncp-e.com

34 of 36

2.

Mocking

and

Sandboxing

(support reproducibility, avoid side effects, ease testing)

Source: http://googletesting.blogspot.com

35 of 36

3.

Find problems across platforms

(address fragmentation)

Image Credit: OpenSignal

36 of 36

Infrastructure

http://www.cc.gatech.edu/~orso/software/androtest