1 of 277

Welcome Back!

2 of 277

Developer Experience, FTW!

Niranjan Tulpule

3 of 277

Software development is being democratized

4 of 277

Core computing platforms are more accessible than ever

PCs

Smartphones & Tablets

2014

1.5M

1983

0

5 of 277

Free developer tools

Open Source building blocks

Free developer education

We’re lowering barrier to becoming a developer

6 of 277

2015

2016

2017

2018

2019

2020

2014

2013

2012

2011

Total Number of Active Apps in the App Store

2010

It’s never been easier to write apps

5M

3M

1M

7 of 277

Valuation of these 9 companies as a country's GDP would be in the top 50

  • Uber: $66B
  • Snapchat: $40B
  • Whatsapp: $16B
  • Airbnb: $25B
  • Flipkart: $15B
  • Pinterest: $11B
  • Lyft: $5.5B
  • Ola Cabs: $5B
  • Gojek: $1.3B

Valuation of these companies as a country's GDP would be in the top 50

8 of 277

51%Stability Issues

41%Functionality Related

7%Speed

1%Other

Classification of 1-star reviews (Sampling of Play Store reviews, May 2016)

Writing high quality apps is still hard

9 of 277

2.5K+

100+

~700

100M

Compounded complexity

Device

Manufacturer

model

OS versions

Carriers

Permutations

10 of 277

Improving software quality & testability by investing in Developer Experience.

11 of 277

Develop

Release

Monitor

Firebase Test Lab

for Android

12 of 277

Test on your users’ devices

13 of 277

Use with your existing workflow

Ahmed to place product shot here

>

_

Android Studio

Command line

Jenkins

Jenkins logo by Charles Lowell and Frontside CC BY-SA 3.0 https://wiki.jenkins-ci.org/display/JENKINS/Logo

14 of 277

Robo crawls your app automatically

15 of 277

Create Espresso tests by just using your app

Ahmed to place product shot here

16 of 277

Millions of Tests, and counting!

After extensive evaluation of the market, we've found that Firebase Test Lab is the best product for writing and running Espresso tests directly from Android Studio, saving us tons of time and effort around automated testing.

- Timothy West, Jet

17 of 277

Actionable Results at your fingertips

Get actionable results at your fingertips

Develop

Release

Monitor

Firebase Test Lab

for Android

Play Pre-Launch Report

18 of 277

Pre-launch report

Pre-launch reports summarize issues found when testing your app on a wide range of devices

19 of 277

20 of 277

21 of 277

22 of 277

Apps using the Play Pre-Launch Report show ~20% fewer crashes!

~60% of the crashes seen on Pre-Launch Report are fixed before public rollout.

23 of 277

Actionable Results at your fingertips

Get actionable results at your fingertips

Develop

Release

Monitor

Firebase Test Lab

for Android

Play Pre-Launch Report

Firebase Crash Reporting

24 of 277

Firebase Crash Reporting

Get actionable insights and comprehensive analytics whenever your users experience crashes and other errors

25 of 277

  • Integrate Gradle/Pod

  • 0-1 init lines of code

  • Start capturing errors!

26 of 277

fatal error A

6K

7K

non-fatal error A

5K

6K

fatal error B

4K

4.8K

fatal error C

3K

3K

Clustering

27 of 277

28 of 277

Get the big picture with comprehensive metrics on app versions, OS levels and device models

29 of 277

Find the exact line where the error happens

30 of 277

Minimize the time and effort to

resolve issues with data about your users’ devices

31 of 277

Log custom events before an error happens

//On Android

FirebaseCrash.log("Activity created.");

//On iOS

FIRCrashLog(@"Button clicked.");

32 of 277

Provide more context with events leading up to an error

33 of 277

Understand the Impact of Crashes on the Bottom Line

Confidential + Proprietary

34 of 277

Fix the bug, then win them back with

a timely push notification

Confidential + Proprietary

35 of 277

Looking ahead

Machine learning

Compilers

Toolchains

36 of 277

The shift to mobile caught us by surprise...

PCs

Smartphones & Tablets

2014

1.5M

1983

0

37 of 277

Thank You

38 of 277

Docker Based Geo Dispersed Test Farm �- Test Infrastructure Practice in Intel Android Program

Chen Guobing, Yu Jerry

38

39 of 277

Agenda

  • Test Infrastructure Challenges
  • Test as a Service
  • Docker Based Test Farm
  • Test Distribution
  • Technical Challenges
  • Questions

39

40 of 277

Taxonomies

40

41 of 277

Test Infrastructure Challenges

  • Maximize the use of Development Vehicles (Engineering samples)
  • Maximize the use of automated test
  • Minimize the maintenance cost of the Test Infra, test benches and test assets

41

42 of 277

Test as a Service – What We Need

Anyone

Any automated Test

Any Device

Anywhere

Anytime

42

43 of 277

Target Users - Usages

  • Test on demand and automated release testing
  • Failed test cases Re-run or failure reproduce
  • Automated pre-commit and post-commit testing
  • Test on demand, developer’s own build
  • Work with other dev tool, e.g. dichotomy check

Continuous Integration

Testing

QA

Release Testing

Developer

Testing

43

44 of 277

Docker Based Geo Dispersed Test Farm

44

45 of 277

Test Distribution

Test Catalog

Capability:

Platform:

Location:

Campaign A

capability: pmeter

Run campaign A on XYZ platform in SH

Test Distributor

Test Campaign ← Capability → Test Bench

45

46 of 277

Technical Challenges – Anywhere, Any Device

  • DUT and Test Equipment controls

$ docker run … --device=/dev/bus/usb/001/004

--device=/dev/ttySerial0 …

  • DUT state transition management

46

47 of 277

Technical Challenges – Anyone, Any Automated Test

  • Hierarchal code maintain

  • Easily customized

  • All-in-one in delivery

  • Create once, run anywhere

Release and deliver test suites in the way of docker image.

47

48 of 277

Questions?��Contacts: �jerry.yu@intel.comguobing.chen@intel.com

48

49 of 277

OpenHTF

an open-source hardware testing framework

https://github.com/google/openhtf

50 of 277

Motivation for OpenHTF

Drastically reduce the amount of boilerplate code needed to:

exercise a piece of hardware

take measurements along the way

generate a record of the whole process

Make operator interactions simple but flexible.

Allow test engineers to focus on authoring actual test logic.

“Simplicity is requisite for reliability.” ~Edsger W. Dijkstra

51 of 277

Google:

A Software Company

...at least, it used to be!

52 of 277

Google:

Now With More Hardware!

53 of 277

Our Solution

A python library that provides a set of convenient abstractions for authoring hardware testing code.

54 of 277

Use Cases

Manufacturing Floor

Automated Lab

Benchtop

55 of 277

Core Abstractions

Test

Plug

Test Equipment &

Device Under Test

Output Callback:

JSON to disk,

upload via network, etc.

Phase

Output

Record

Measurement

56 of 277

Tests & Phases

57 of 277

Plugs

58 of 277

Web GUI

59 of 277

Q&A

60 of 277

Detecting loop inefficiencies automatically

(to appear in FSE 2016)

Monika Dhok (IISc Bangalore, India)*

Murali Krishna Ramanathan (IISc Bangalore, India)

61 of 277

Software efficiency is very important

Performance issues are hard to detect during testing �

These issues are found even in well tested commercial softwares�

Degrade application responsiveness and user experience

62 of 277

Performance bugs are critical

Implementation mistakes that cause inefficiency

Difficult to catch them during compiler optimizations

Fixing them can result in large speedups, thereby improving efficiency

63 of 277

Redundant traversal bugs

When program iterates over a data structure repeatedly without any intermediate modifications

Public class A{

1. Public boolean containsAny(Collection c1, Collection c2){

2. Iterator itr = c1.iterator();

3. while(itr.hasNext())

4. if(c2.contains(itr.next()))

5. Return true;

6. Return false;

}

}

Complexity : O(size(c1) x size(c2))

64 of 277

Performance tests are written by developers

65 of 277

Detecting redundant traversals

Toddler [ICSE 13]

66 of 277

Static analysis techniques alone are not effective

Challenges :

How to confirm the validity of the bug?�

How to expose the root cause?

Execution trace can be helpful�

How to detect that the performance bug is fixed?

67 of 277

Automated tests not effective for performance bugs

Toddler[ICSE 13]

68 of 277

Challenges involved in writing performance tests

Virtual call resolution � Generating tests for all possible resolutions of method � invocation is not scalable

Generating appropriate context� Realization of the defect can be dependent on certain � conditions that affect the reachability of the inefficient loop

Arrangement of elements � Problem can only occur when data structure has large � elements arranged in particular fashion

69 of 277

Glider

We propose a novel and scalable approach to automatically generate tests for exposing loop inefficiencies

70 of 277

Glider is available online

https://drona.csa.iisc.ernet.in/~sss/tools/glider

71 of 277

Performance bug caught by glider

72 of 277

Results

We have implemented our approach on SOOT bytecode framework�and evaluated it on number of libraries

Our approach detected 46 bugs across 7 java libraries including 34 �previously unknown bugs.

Tests generated using our approach significantly outperform the �randomly generated tests.

73 of 277

Questions?

74 of 277

NEED FOR SPEED

accelerate tests from 3 hours to 3 minutes

emo@komfo.com

75 of 277

3

hours

3

minutes

600 API tests

76 of 277

Before

After

The

3 Minute

Goal

77 of 277

It’s not about the numbers or techniques you’ll see.

It’s all about continuous improvement.

78 of 277

Dedicated

Environment

79 of 277

Execution Time in Minutes

180

123

New Environment

80 of 277

Empty Databases

81 of 277

The time needed to create data for one test:

And then the test starts

Call 12 API endpoints

Modify data in 11 tables

Takes about 1.2 seconds

82 of 277

180

123

Execution Time in Minutes

89

Empty Databases

83 of 277

Simulate

Dependencies

84 of 277

+Some More

STUB

STUB

STUB

STUB

STUB

STUB

STUB

Stub all external dependencies

Core API

85 of 277

Transparent

Fake SSL certs

Dynamic Responses

Local Storage

Return Binary Data

Regex URL match

Existing Tools (March 2016)

Stubby4J

WireMock

Wilma

soapUI

MockServer

mounteback

Hoverfly

Mirage

We created project Nagual,

open source soon.

86 of 277

180

123

89

Execution Time in Minutes

65

Stub Dependencies

87 of 277

Move to Containers

88 of 277

180

123

89

65

Execution Time in Minutes

104

Using Containers

89 of 277

Run Databases

in Memory

90 of 277

180

123

89

65

104

Execution Time in Minutes

61

Run Databases in Memory

91 of 277

Don’t Clean

Test Data

92 of 277

180

123

89

65

104

61

Execution Time in Minutes

46

Don’t delete test data

93 of 277

Run in Parallel

94 of 277

4 6 8 10 12 14 16

Time to execute 12 9 7 5 8 12 17

The Sweet Spot

95 of 277

180

123

89

65

104

61

46

Execution Time in Minutes

5

Run in Parallel

96 of 277

Equalize Workload

97 of 277

98 of 277

99 of 277

180

123

89

65

104

61

46

5

Execution Time in Minutes

3

Equal Batches

Run in Parallel

Don’t delete test data

Run Databases in Memory

Using Containers

Stub Dependencies

Empty Databases

New Environment

100 of 277

After Hardware Upgrade

The Outcome

2:15 min.

1:38 min.

101 of 277

The tests are slow

The tests are unreliable

The tests can’t exactly pinpoint the problem

High Level Tests Problems

3 Minutes

No external dependencies

It’s cheap to run all tests after every change

102 of 277

In a couple of years, running all your automated tests, after every code change, for less than 3 minutes, will be standard development practice.

103 of 277

Recommended Reading

104 of 277

EmanuilSlavov.com

@EmanuilSlavov

105 of 277

Slide #, Photo Credits

1. https://www.flickr.com/photos/thomashawk

5. https://www.flickr.com/photos/100497095@N02

7. https://www.flickr.com/photos/andrewmalone

10. https://www.flickr.com/photos/astrablog

14. https://www.flickr.com/photos/foilman

16. https://www.flickr.com/photos/missusdoubleyou

18. https://www.flickr.com/photos/canonsnapper

20. https://www.flickr.com/photos/anotherangle

23. https://www.flickr.com/photos/-aismist

106 of 277

Code Coverage is a Strong Predictor of

Test Suite Effectiveness

in the Real World

Rahul Gopinath

Iftekhar Ahmed

107 of 277

When should we stop testing?

108 of 277

How to evaluate test suite effectiveness?

109 of 277

Previous research: Do not trust coverage

(In theory)

GTAC’15 Inozemtseva

110 of 277

Factors affecting test suite quality

Test suite quality

Coverage

Assertions

111 of 277

According to previous research

Test suite quality

Coverage

Assertions

Test suite size

GTAC’15 Inozemtseva

112 of 277

But...

What is the adequate test suite size?

  • Is there a maximum number of test cases for a given program?
  • Are different test cases equivalent in strength?
  • How do we account for duplicate tests?
  • Test suite sizes are not comparable even for the same program.

113 of 277

Can I use coverage to measure

suite effectiveness?

114 of 277

Statement coverage best predicts mutation score

A fault in a statement has 87% probability of being detected

if an organic test covers it.

M = 0.87xS

Size of dots follow size of projects

R2 = 0.94

Results from 250 real world programs

largest > 100 KLOC

On Developer written test suites

115 of 277

Statement coverage best predicts mutation score

A fault in a statement has 61% probability of being detected

if a generated test covers it.

M = 0.61xS

Size of dots follow size of projects

R2 = 0.70

Results from 250 real world programs

largest > 100 KLOC

On Randoop generated test suites

116 of 277

But

Controlling for test suite size, coverage provides little extra information.

Hence don't use coverage [GTAC’15 inozemtseva]

Why use mutation?

Mutation score provides little extra information (<6%) compared to coverage.

117 of 277

Does coverage have no extra value?

GTAC’15 Inozemtseva

Our Research

# Programs

5

250

Selection of programs

Ad hoc

Systematic sample from Github

Tool used

CodeCover, PIT

Emma, Cobertura, CodeCover, PIT

Test suites

Random subsets of original

Organic & Randomly generated

(New results)

Removal of influence of size

Ad hoc

Statistical

Our study is much larger, systematic (not ad hoc), and follows the real world usage

Our Research (New results)

M~TestsuiteSize

12.84%

M~log(TSize)

51.26%

residuals(M~log(TSize))~S

75.25%

Statement coverage can explain 75% variability in mutation score after eliminating influence of test suite size.

118 of 277

Is mutation analysis better than coverage analysis?

119 of 277

Mutation analysis: High cost of analysis

Δ=b2 – 4ac

d = b^2 + 4 * a * c;�d = b^2 * 4 * a * c;�d = b^2 / 4 * a * c;�d = b^2 ^ 4 * a * c;�d = b^2 % 4 * a * c;

d = b^2 << 4 * a * c;

d = b^2 >> 4 * a * c;

d = b^2 * 4 + a * c;�d = b^2 * 4 - a * c;�d = b^2 * 4 / a * c;�d = b^2 * 4 ^ a * c;�d = b^2 * 4 % a * c;

d = b^2 * 4 << a * c;

d = b^2 * 4 >> a * c;

d = b^2 * 4 * a + c;�d = b^2 * 4 * a - c;�d = b^2 * 4 * a / c;�d = b^2 * 4 * a ^ c;�d = b^2 * 4 * a % c;

d = b^2 * 4 * a << c;

d = b^2 * 4 * a >> c;

d = b + 2 - 4 * a * c;�d = b - 2 - 4 * a * c;�d = b * 2 - 4 * a * c;�d = b / 2 - 4 * a * c;�d = b % 2 - 4 * a * c;

d = b^0 - 4 * a * c;�d = b^1 - 4 * a * c;

d = b^-1 - 4 * a * c;

d = b^MAX - 4 * a * c;

d = b^MIN - 4 * a * c;

d = b^2 - 0 * a * c;�d = b^2 - 1 * a * c;�d = b^2 – (-1) * a * c;�d = b^2 - MAX * a * c;�d = b^2 - MIN * a * c;�

120 of 277

Mutation score is very costly

121 of 277

Mutation analysis: Equivalent mutants

Δ=b2 – 22ac

d = b^2 - (2^2) * a * c;�d = b^2 - (2*2) * a * c;�d = b^2 - (2+2) * a * c;

Mutants

Original

Equivalent Mutant

Normal Mutant

Or: Do not trust low mutation scores

122 of 277

Low mutation score does not indicate a low quality test suite.

123 of 277

Mutation analysis: Equivalent mutants

Δ=b2 – 22ac

d = b^2 - (-4) * a * c;�d = b^2 + 4 * a * c;�d = (-b)^2 - 4 * a * c;

Mutants

Original

Equivalent Mutant

Redundant Mutant

Or: Do not trust low mutation scores

124 of 277

High mutation score does not indicate a high quality test suite.

125 of 277

Mutation Analysis: Different Operators

Δ=b2 – 4ac

d = b^2 + 4 * a * c;

>>> dis.dis(d)

2 0 LOAD_FAST 0 (b)

3 LOAD_CONST 1 (2)

6 LOAD_CONST 2 (4)

9 LOAD_FAST 1 (a)

12 BINARY_MULTIPLY

13 LOAD_FAST 2 (c)

16 BINARY_MULTIPLY

17 BINARY_SUBTRACT

18 BINARY_XOR

19 RETURN_VALUE x

[2016 Software Quality Journal]

126 of 277

Mutation score is not a consistent measure

127 of 277

Does a high coverage test suite

actually prevent bugs?

128 of 277

We looked at bugfixes on actual programs

An uncovered line is twice as likely to have a bug fix

as that of a line covered by any test case.

[FSE 2016]

Covered

Uncovered

p

Statement

0.68

1.20

0.00

Block

0.42

0.83

0.00

Method

0.40

0.87

0.00

Class

0.45

0.32

0.10

Difference in bug-fixes between covered and

Uncovered program elements

129 of 277

Does a high coverage test suite

actually prevent bugs?

Yes it does

130 of 277

Summary

Do not dismiss coverage lightly

Beware of mutation analysis caveats

Coverage is a pretty good heuristic on where the bugs hide.

  • Coverage is highly correlated with mutation score (92%)
  • Coverage provides 75% more information than just test suite size.

  • Mutation score provides little extra information compared to coverage.
  • Mutation score can be unreliable.

131 of 277

Assume non-equivalent, non-redundant, uniform fault distribution for mutants

at one’s own peril.

Beware of theoretical spherical cows…

132 of 277

Backup slides

133 of 277

That is,

  • Coverage is highly correlated with mutation score (92%)
  • Mutation score provides little extra information compared to coverage.
  • Coverage provides 75% more information than just test suite size.
  • Mutation score can be unreliable.
  • Coverage thresholds actually help reduce incidence of bugs.

134 of 277

Mutation X Path Coverage

135 of 277

Mutation X Branch Coverage

136 of 277

Computations

require(Coverage)

data(o.db)

o <- subset(subset(o.db, tloc != 0), select=c('pit.mutation.cov', 'cobertura.line.cov', 'loc', 'tloc'))

o$l.tloc <- log2(o$tloc)

oo <- subset(o, l.tloc != -Inf)

ooo <- na.omit(oo)

> cor.test(pit.mutation.cov,tloc)

t = 1.973, df = 232, p-value = 0.04969

95 percent confidence interval: 0.0002148688 0.2525430013

sample estimates: cor 0.1284574

> cor.test(pit.mutation.cov,l.tloc)

data: pit.mutation.cov and l.tloc

t = 9.0938, df = 232, p-value < 2.2e-16

95 percent confidence interval: 0.4114269 0.6013377

sample estimates: cor 0.5126249

> cor.test(resid(lm(pit.mutation.cov~log(tloc))),cobertura.line.cov)

data: resid(lm(pit.mutation.cov ~ log(tloc))) and cobertura.line.cov

t = 17.406, df = 232, p-value < 2.2e-16

95 percent confidence interval: 0.6909857 0.8032663

sample estimates: cor 0.7525441

> summary(lm(pit.mutation.cov~log(tloc)))

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.13644 0.06031 -2.262 0.0246 *

log(tloc) 0.09950 0.01094 9.094 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2839 on 232 degrees of freedom

Multiple R-squared: 0.2628, Adjusted R-squared: 0.2596

F-statistic: 82.7 on 1 and 232 DF, p-value: < 2.2e-16

> summary(lm(pit.mutation.cov~log(tloc)+cobertura.line.cov))

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.074859 0.031645 -2.366 0.018828 *

log(tloc) 0.023658 0.006487 3.647 0.000328 ***

cobertura.line.cov 0.785488 0.031628 24.836 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1485 on 231 degrees of freedom

Multiple R-squared: 0.7991, Adjusted R-squared: 0.7974

F-statistic: 459.5 on 2 and 231 DF, p-value: < 2.2e-16

137 of 277

Does Mutation score correlate to fixed bugs?

138 of 277

Mutant semiotics (how faults map to failures) is not well understood

Affected by factors of the particular project

  • Style of development, coding guidelines etc
  • Complexity of algorithms
  • Coupling between modules

139 of 277

Can weak mutation analysis help?

Rather than the failure of a test case for a mutant, we only require a change in state. It is easier to compute, but:

  • Does not verify assertions
  • So, Just another coverage technique
  • Redundant and Equivalent mutants remain

140 of 277

Method

250 real world projects from Github, largest > 100 KLOC.

Tests

Developer written

Randoop generated

Statement

Branch

Path

Mutation

Emma

X

Cobertura

X

X

Codecover

X

X

JMockit

X

X

PIT

X

X

Major

X

Judy

X

141 of 277

Mutation analysis has a number of other problems

  • Mutants are not similar in their difficulty to kill
    • So a test suite that is optimized for killing difficult mutants is at a disadvantage
  • Coupling effect has not been validated for complex systems
    • According to Wah, the coupling will decrease as the system gets larger.

142 of 277

The fault distribution may not be uniform

A majority of mutants are very easy to kill, but some are stubborn.

Does two test suites with say 50% mutation score have the same strength?

Testsuites optimized for harder to detect faults are penalized.

143 of 277

Correlation does not imply causation?

It was pointed out in the previous talk that correlation between coverage and mutation score does not imply a causal relationship between the two. We can counter it by:

Logic

A test suite with zero coverage will not kill any mutants.

A test suite can only kill mutants on the lines it covers.

Statistically

Using additive noise models to identify cause and effect. (ongoing research)

144 of 277

ClusterRunner

Making fast test-feedback easy through horizontal scaling.

Joseph Harrington and Taejun Lee Productivity Engineering

145 of 277

What is ClusterRunner?

146 of 277

147 of 277

148 of 277

Functional

Tests

Integration Tests

Unit Tests

Manual

Tests

149 of 277

150 of 277

151 of 277

Develop

Test

Feature

Design

Release

152 of 277

Develop

Test

Feature

Design

Release

153 of 277

PHPUnit testsuite duration at Box

154 of 277

155 of 277

“A problem isn’t a problem if you can throw money at it.”

156 of 277

157 of 277

PHPUnit

Scala SBT

nosetests

QUnit

JUnit

158 of 277

Requirements

Easy to configure and use

Test technology agnostic

Fast test feedback

159 of 277

160 of 277

www.ClusterRunner.com

161 of 277

Our 30-hour testsuite

17

minutes

162 of 277

ClusterRunner in Action

  • Bring up a cluster
  • Set up your project
  • Execute a build
  • Look at the results

163 of 277

Bring up a Cluster

# On master.box.com�clusterrunner master � --port 43000

# On slave1.box.com, slave2.box.com�clusterrunner slave � --master-url master.box.com:43000

164 of 277

Bring up a Cluster

http://master.box.com:43000/v1/slave/

165 of 277

166 of 277

Set up Your Project

  • Create clusterrunner.yaml at the root of your project repo.
    • Commands to run
    • How to distribute

167 of 277

168 of 277

Set up Your Project

�> phpunit ./test/php/EarthTest.php

> phpunit ./test/php/WindTest.php

> phpunit ./test/php/FireTest.php

> phpunit ./test/php/WaterTest.php

> phpunit ./test/php/HeartTest.php

169 of 277

Execute a Build

Now we’re ready to build!

�clusterrunner build� --master-url master.box.com:43000

git � --url http://github.com/myproject� --job-name PHPUnit

170 of 277

171 of 277

View Build Results

http://master.box.com:43000/v1/build/1/

172 of 277

173 of 277

View Build Results

http://master.box.com:43000/v1/build/1/subjob/

174 of 277

175 of 277

View Build Results

http://master.box.com:43000/v1/build/1/result

176 of 277

177 of 277

178 of 277

179 of 277

180 of 277

What’s next for ClusterRunner

  • AWS integration with autoscaling
  • Docker support
  • Improvements to deployment mechanism
  • In-place upgrades
  • Web UI

181 of 277

clusterrunner.com

Get Involved!

182 of 277

productivity@box.com

Contact Us

183 of 277

Multi-device Testing

E2E test infra for mobile products of today and tomorrow

angli@google.com

adorokhine@google.com

184 of 277

Overview

E2E testing challenges

Introducing Mobly

Sample test

Controlling Android devices

Custom controller

Demo

185 of 277

E2E Testing

Unit Tests

Integration/Component Tests

E2E Tests

Testing Pyramid

Where magic dwells

186 of 277

E2E Testing is Important

Applications involving multiple devices

P2P data transfer, nearby discovery

Product under test is not a conventional device.

Internet-Of-Things, VR

Need to control and vary physical environment

RF: Wi-Fi router, attenuators

Lighting, physical position

Interact with other software/cloud services

iPerf server, cloud service backend, network components

187 of 277

E2E Testing is Hard!

Most test frameworks are for single-device app testing

Need to trigger complex actions on devices

Some may need system privilege

Need to synchronize steps between multiple devices

Logic may be centralized (hard to write) or decentralized (hard to trigger)

Need to drive a wide range of equipment

attenuator, call box, power meter, wireless AP etc

Need to communicate with cloud services

Need to collect debugging artifacts from many sources

188 of 277

Our Solution - Mobly

Lightweight Python framework (Py2/3 compatible)

Test logic runs on a host machine

Controls a collection of devices/equipment in a test bed

Bundled with controller library for essential equipment

Android device, power meter, etc

Flexible and pluggable

Custom controller module for your own toys

Open source and ready to go!

189 of 277

Mobly Architecture

Test Bed

Computer

Mobly

Test Script

Mobile

Device

Network Switch

Attenuator

Call Box

Cloud Service

Test Harness

Test bed allocation, device provisioning, and results aggregation

190 of 277

Sample Tests

Hello from the other side

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLo�HELLO�HELLO�HELLO�HELLO

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO

191 of 277

Describe a Test Bed

{� 'testbed': [{� 'name': 'SimpleTestBed',� 'AndroidDevice': '*'� }],� 'logpath': '/tmp/mobly_logs'�}

192 of 277

Test Script - Hello!

from mobly import base_test�from mobly import test_runner��class HelloWorldTest(base_test.BaseTestClass):� def setup_class(self):� self.ads = self.register_controller(android_device)� self.dut1 = self.ads[0]�� def test_hello_world(self):� self.dut1.sl4a.makeToast('Hello!')

if __name__ == '__main__':� test_runner.main()

Invocation:�$ ./path/to/hello_world_test.py -c path/to/config.json

193 of 277

Beyond the Basics

Config:�{� 'testbed': [{� ...� }],� 'logpath': '/tmp/mobly_logs',� 'toast_text': 'Hey there!'�}

�Code:

self.user_params['toast_text'] # 'Hey there!'

194 of 277

Beyond the Basics

Device specific logger

self.caller.log.info("I did something.")�# <timestamp> [AndroidDevice|<serial>] I did something

In test bed config:�'AndroidDevice': [{'serial': 'xyz', 'label': 'caller'},� {'serial': 'abc', 'label': 'callee',� 'phone_number': '123456'}]

In code:�self.callee = android_device.get_device(self.ads, label='callee')�self.callee.phone_number # '123456'

Specific device info

195 of 277

Controlling Android Devices

adb/shell

UI

API Calls

Custom Java Logic

196 of 277

Controlling Android Devices

adb

ad.adb.shell('pm clear com.my.package')

UI automator

ad.uia = uiautomator.Device(serial=ad.serial)

ad.uia(text='Hello World!').wait.exists(timeout=1000)

Android API calls, including system/hidden APIs, via SL4A

ad.sl4a.wifiConnect({'SSID': 'GoogleGuest'})

Custom Java logic

ad.register_snippets('trigger', 'com.my.package.snippets')

ad.trigger.myImpeccableLogic(5)

197 of 277

System API Calls

> self.dut.sl4a.makeToast('Hello World!')

SL4A (Scripting Layer for Android) is an RPC service exposing API calls on Android

self.dut.api is the RPC client for SL4A.

Original version works on regular Android builds.

Fork in AOSP can make direct system privileged calls (system/hidden APIs).

198 of 277

Custom Snippets

SL4A is not sufficient

SL4A methods are mapped to Android APIs, but tests need more than just Android API calls.

Current AOSP SL4A requires system privilege

Custom snippets allows users to define custom method that does anything they want.

Custom snippets can be used with other useful libs like Espresso

199 of 277

Custom Snippets

package com.mypackage.testing.snippets.example;

public class ExampleSnippet implements Snippet {� public ExampleSnippet(Context context) {}�� @Rpc(description='Returns a string containing the given number.')� public String getFoo(Integer input) {� return 'foo ' + input;� }�� @Override� public void shutdown() {}�}

200 of 277

Custom Snippets

Add your snippet classes to AndroidManifest.xml for the androidTest apk

<meta-data� android:name='mobly-snippets'� android:value='com.my.app.test.MySnippet1,� com.my.app.test.MySnippet2' />

Compile it into an apk

apply plugin: 'com.android.application'� dependencies {� androidTestCompile 'com.google.android.mobly:snippetlib:0.0.1'� }

201 of 277

Custom Snippets

Install the apk on your device

Load and call it

ad.load_snippets(name='snippets',� package='com.mypackage.testing.snippets.example')�foo = ad.snippets.getFoo(2) # 'foo 2'

202 of 277

Espresso in Custom Snippets

import static android.support.test.espresso.Espresso.onView;�import static android.support.test.espresso.action.ViewActions.swipeUp;�import static android.support.test.espresso.matcher.ViewMatchers.withId;

public class ExampleSnippet implements Snippet {� public ExampleSnippet(Context context) {}�� @Rpc(description='Performs a swipe using espresso')� public void performSwipe() {� onView(withId(R.id.my_view_id)).perform(swipeUp());� }�}

203 of 277

Custom Controllers

Plug in your own toys

204 of 277

Loose Controller Interface

def create(configs):� '''Instantiate controller objects'''��def destroy(objects):� '''Destroy controller objects'''

def get_info(objects):� '''[optional] Get controller info for test summary'''

205 of 277

Using Custom Controllers

from my.project.testing.controllers import car��def setup_class(self):� self.cars = self.register_controller(car)��def test_something(self):� self.cars[0].drive()

206 of 277

Video Demo

  • A test bed with two phones and one watch.
  • Phone A gives the voice command to watch.
  • Watch initiates a call to phone B.
  • Phone B gets a ringing call notification.
  • Phone A hangs up.

207 of 277

Video Demo

208 of 277

Coming Soon

iOS controller libs

Dependent on libimobiledevice

KIFTest, XCTest, XCUITest

Async events in snippets

Standard snippet and python utils for basic Android operations

Support non-Nexus Android devices

209 of 277

Thank You!

Questions?

210 of 277

Scale vs Value

Test Automation at the BBC

David Buckhurst & Jitesh Gosai

211 of 277

212 of 277

213 of 277

214 of 277

215 of 277

Lots of innovation

Chair hive

216 of 277

217 of 277

218 of 277

219 of 277

220 of 277

221 of 277

222 of 277

223 of 277

224 of 277

225 of 277

226 of 277

227 of 277

228 of 277

229 of 277

230 of 277

Live

Insights

&

Operational

Notifications

231 of 277

232 of 277

Scale vs Value

233 of 277

www.bbc.co.uk/opensource

@BBCOpenSource

@davidbuckhurst @JitGo

234 of 277

Finding bugs in

C/C++ libraries using

libFuzzer

Kostya Serebryany, GTAC 2016

235 of 277

Agenda

  • What is fuzzing
  • Why fuzz
  • What to fuzz
  • How to fuzz
    • … with libFuzzer
  • Demo (CVE-2016-5179)

236 of 277

What is Fuzzing

  • Somehow generate a test input�
  • Feed it to the code under test�
  • Repeat

237 of 277

Why fuzz

  • Bugs specific to C/C++ that require the sanitizers to catch:
    • Use-after-free, buffer overflows, Uses of uninitialized memory, Memory leaks
  • Arithmetic bugs:
    • Div-by-zero, Int/float overflows, bitwise shifts by invalid amount
  • Plain crashes:
    • NULL dereferences, Uncaught exceptions
  • Concurrency bugs:
    • Data races, Deadlocks
  • Resource usage bugs:
    • Memory exhaustion, hangs or infinite loops, infinite recursion (stack overflows)
  • Logical bugs:
    • Discrepancies between two implementations of the same protocol (example)
    • Assertion failures

238 of 277

What to fuzz

  • Anything that consumes untrusted or complicated inputs:
    • Parsers of any kind (xml, pdf, truetype, ...)
    • Media codecs (audio, video, raster & vector images, etc)
    • Network protocols, RPC libraries (gRPC)
    • Crypto (boringssl, openssl)
    • Compression (zip, gzip, bzip2, brotli, …)
    • Compilers and interpreters (PHP, Perl, Python, Go, Clang, …)
    • Regular expression matchers (PCRE, RE2, libc’s regcomp)
    • Text/UTF processing (icu)
    • Databases (SQLite)
    • Browsers, text editors/processors (Chrome, OpenOffice)
  • OS Kernels (Linux), drivers, supervisors and VMs
  • UI (Chrome UI)

239 of 277

How to fuzz

  • Generation-based fuzzing
    • Usually a target-specific grammar-based generator�
  • Mutation-based fuzzing
    • Acquire a corpus of test inputs
    • Apply random mutations to the inputs�
  • Guided mutation-based fuzzing
    • Execute mutations with coverage instrumentation
    • If new coverage is observed the mutation is permanently added to the corpus

240 of 277

Fuzz Target - a C/C++ function worth fuzzing

extern "C"

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t DataSize) {

if (DataSize >= 3 &&

Data[0]=='F' &&

Data[1]=='U' &&

Data[2]=='Z' &&

Data[3]=='Z')

DoMoreStuff(Data, DataSize);

return 0;

}

241 of 277

libFuzzer - an engine for guided in-process fuzzing

  • libFuzzer: a library; provides main()
  • Build your target code with extra compiler flags
  • Link your target with libFuzzer
  • Pass a directory with the initial test corpus and run

% clang++ -g my-code.cc libFuzzer.a -o my-fuzzer \

-fsanitize=address -fsanitize-coverage=trace-pc-guard

% ./my-fuzzer MY_TEST_CORPUS_DIR

242 of 277

CVE-2016-5179 (c-ares, asynchronous DNS requests)

extern "C"

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t DataSize) {

unsigned char *buf; int buflen;

std::string s(reinterpret_cast<const char *>(Data), DataSize);

ares_create_query(s.c_str(), ns_c_in, ns_t_a, 0x1234, 0, &buf,

&buflen, 0);

free(buf);

return 0;

}

243 of 277

244 of 277

present perfect => present continuous

  • “The project X has been fuzzed, hence it is somewhat secure”�
  • False:
    • Bug discovery techniques evolve
    • The project X evolves
    • Fuzzing is CPU intensive and needs time to find bugs�
  • “The project X is being continuously fuzzed, the code coverage is monitored.”
    • Much better!

245 of 277

Oss-fuzz - fuzzing as a service for OSS

Based on ClusterFuzz, the fuzzing backend used for fuzzing Chrome components

Supported engines: libFuzzer, AFL, Radamsa, ...

https://github.com/google/oss-fuzz

246 of 277

Q&A

247 of 277

Can MongoDB Recover from Catastrophe?

How I learned to crash a server

{ name : "Jonathan Abrahams",

title : "Senior Quality Engineer",

location : "New York, NY",

twitter : "@MongoDB",

facebook : "MongoDB" }

248 of 277

A machine may crash for a variety of reasons:

  • Termination of virtual machine or host
  • Hardware failure
  • OS failure

Machine crash

Unexpected termination of mongod

Application crash

249 of 277

Why do we need to crash a machine?

We could abort mongod, but this would not fully simulate an unexpected crash of a machine or OS (kernel):

Immediate loss of power may prevent cached I/O from being flushed to disk.

A kernel panic can leave an application (and its data) in an unrecoverable state.

250 of 277

System passes h/w & s/w checks

mongod goes into recovery mode

mongod ready for client connection

System restart

251 of 277

How can we crash a machine?

We started by crashing the machine manually, by pulling the cord.

We evolved to using an appliance timer, which would power the machine off/on every 15 minutes.

We also figured out that setting up a cron job to send an internal crash command (more on this later) to the machine for a random period would do the job.

And then we realized, we need to do it a bit more often.

252 of 277

How did we really crash that machine, and can we do over and over and over and over...?

253 of 277

Why do we need to do it over and over and over?

A crash of a machine may be catastrophic. In order to uncover any subtle recovery bugs, we want to repeatedly crash a machine and test if it has recovered. A failure may only be encountered 1 out of 100 times!

254 of 277

Ubiquiti mPower PRO to the rescue!

Programmable power device, with ssh access from LAN via WiFi or Ethernet.

255 of 277

How do we turn off and on the power?

ssh admin@mpower

local outlet="output1"

# Send power cycle to mFi mPower to specified outlet

echo 0 > /dev/$outlet

sleep 10

echo 1 > /dev/$outlet

256 of 277

Physical vs. Virtual

It is necessary to test both type of machines as machine crashes are different and the underlying host OS and hardware may provide different I/O caching and data protection. Virtual machines typically rely on shared resources and physical machines typically use dedicated resources.

257 of 277

How do we crash a virtual machine?

We can crash it from the VM host:

KVM (Kernel-based VM): virsh destroy <vm>

VmWare: vmrun stop <vm> hard

258 of 277

How do we restart a crashed VM?

We can restart it from the VM host:

KVM (Kernel-based VM): virsh start <vm>

VmWare: vmrun start <vm>

259 of 277

How else can we crash a machine?

We can crash it using the magical SysRq key sequence (Linux only):

echo 1 | sudo tee /proc/sys/kernel/sysrq

echo b | sudo tee /proc/sysrq-trigger

260 of 277

How do we get the machine to restart?

Enable the BIOS setting to boot up after AC power is provided.

261 of 277

Restarting a Windows Machine

To disable a Windows machine from prompting you after unexpected shutdown:

bcdedit /set {default} bootstatuspolicy ignoreallfailures

bcdedit /set {current} bootstatuspolicy ignoreallfailures

bcdedit /timeout 5

262 of 277

The machine is running

Now that we figured out how to get our machine to crash and restart, we restart the mongod and it will go into recovery mode.

263 of 277

Recovery mode of mongod

Performed automatically when mongod starts, if there was an unclean shutdown detected.

WiredTiger starts from the last stable copy of the data on disk from the last checkpoint. The journal log is then applied and a new checkpoint is applied.

264 of 277

Before the crash!

Stimulate mongod by running several simultaneous (mongo shell) clients which provide a moderate load utilizing nearly all supported operations. This is important, as CRUD operations will cause mongod to perform I/O operations, which should never lead to file or data corruption.

265 of 277

Options, options

Client operations optionally provide:

Checkpoint document

Write & Read concerns

The mongod process is tested in a variety modes, including:

Standalone or single node replica set

Storage engine, i.e., mmapv1, wiredTiger

266 of 277

What do we do after mongod has restarted?

After the machine has been restarted, we start mongod on a private port and it goes into recovery mode. Once that completes, we perform further client validation, via mongo (shell):

serverStatus

Optionally, run validate against all databases and collections

Optionally, verify if a checkpoint document exists

Failure to recover, connect to mongod, or perform the other validation steps is considered a test failure.

267 of 277

What do we do after mongod has restarted?

Now that the recovery validation have passed, we will proceed with the pre-crash steps:

Stop and restart mongod on a public port

Start new set of (mongo shell) clients to perform various DB operations

268 of 277

Why do we care about validation?

The validate command checks the structures within a namespace for correctness by scanning the collection’s data and indexes. The command returns information regarding the on-disk representation of the collection.

Failing validation indicates that something has been corrupted, most likely due to an incomplete I/O operation during the unexpected shutdown.

269 of 277

Failure analysis

Since our developers could be local (NYC) or worldwide (Boston, Sydney), we want a self-service application they can use to reproduce reported failures. A bash script has been developed which can execute on both local hardware and in the cloud (AWS).

We save any artifacts useful for our developers to be able to analyze the failure:

Backup data files before starting mongod

Backup data files after mongod completes recovery

mongod and mongo (shell) log files

270 of 277

The crash testing helped to:

Extend our testing to scenarios not previously covered

Provide local and remote teams with tools to reproduce and analyze failures

Improve robustness of the mongod storage layer

271 of 277

Results, results

Storage engine bugs were discovered from the power cycle testing and led to fixes/improvements.

We have plans to incorporate this testing into our continuous integration.

272 of 277

Some bugs discovered

SERVER-20295 Power cycle test - mongod fails to start with invalid object size in storage.bson

SERVER-19774 WT_NOTFOUND: item not found during DB recovery

SERVER-19692 Mongod failed to open connection, remained in hung state, when running WT with LSM

SERVER-18838 DB fails to recover creates and drops after system crash

SERVER-18379 DB fails to recover when specifying LSM, after system crash

SERVER-18316 Database with WT engine fails to recover after system crash

SERVER-16702 Mongod fails during journal replay with mmapv1 after power cycle

SERVER-16021 WT failed to start with "lsm-worker: Error in LSM worker thread 2: No such file or directory"

273 of 277

Open issues?

Can we crash Windows using an internal command (cue the laugh track…)?

274 of 277

Closing remarks

275 of 277

Organizing committee

Alan Myrvold

Amar Amte

Andrea Dawson

Ari Shamash

Carly Schaeffer

Dan Giovannelli

David Aristizabal

Diego Cavalcanti

Jaydeep Mehta

Joe Drummey

Josephine Chandra

Kathleen Li

Lena Wakayama

Lesley Katzen

Madison Garcia

Matt Lowrie

Matthew Halupka

Sonal Shah

Travis Ellett

Yvette Nameth

276 of 277

London 2017

277 of 277

GTAC 2017

testing.googleblog.com