1 of 277

Welcome Back!

2 of 277

Developer Experience, FTW!

Niranjan Tulpule

3 of 277

Software development is being democratized

4 of 277

Core computing platforms are more accessible than ever

PCs

Smartphones & Tablets

2014

1.5M

1983

0

5 of 277

Free developer tools

Open Source building blocks

Free developer education

We’re lowering barrier to becoming a developer

6 of 277

2015

2016

2017

2018

2019

2020

2014

2013

2012

2011

Total Number of Active Apps in the App Store

2010

It’s never been easier to write apps

5M

3M

1M

7 of 277

Valuation of these 9 companies as a country's GDP would be in the top 50

Uber: $66B
Snapchat: $40B
Whatsapp: $16B
Airbnb: $25B
Flipkart: $15B
Pinterest: $11B
Lyft: $5.5B
Ola Cabs: $5B
Gojek: $1.3B

Valuation of these companies as a country's GDP would be in the top 50

8 of 277

51%�Stability Issues

41%�Functionality Related

7%�Speed

1%�Other

Classification of 1-star reviews (Sampling of Play Store reviews, May 2016)

Writing high quality apps is still hard

9 of 277

2.5K+

100+

~700

100M

Crittercism Mobile Experience Benchmark

Compounded complexity

Device

Manufacturer

model

OS versions

Carriers

Permutations

10 of 277

Improving software quality & testability by investing in Developer Experience.

11 of 277

Develop

Release

Monitor

Firebase Test Lab

for Android

12 of 277

Test on your users’ devices

13 of 277

Use with your existing workflow

Ahmed to place product shot here

>

_

Android Studio

Command line

Jenkins

Jenkins logo by Charles Lowell and Frontside CC BY-SA 3.0 https://wiki.jenkins-ci.org/display/JENKINS/Logo

14 of 277

Robo crawls your app automatically

15 of 277

Create Espresso tests by just using your app

Ahmed to place product shot here

16 of 277

Millions of Tests, and counting!

After extensive evaluation of the market, we've found that Firebase Test Lab is the best product for writing and running Espresso tests directly from Android Studio, saving us tons of time and effort around automated testing.

- Timothy West, Jet

17 of 277

Actionable Results at your fingertips

Get actionable results at your fingertips

Develop

Release

Monitor

Firebase Test Lab

for Android

Play Pre-Launch Report

18 of 277

Pre-launch report

Pre-launch reports summarize issues found when testing your app on a wide range of devices

19 of 277

20 of 277

21 of 277

22 of 277

Apps using the Play Pre-Launch Report show ~20% fewer crashes!

~60% of the crashes seen on Pre-Launch Report are fixed before public rollout.

23 of 277

Actionable Results at your fingertips

Get actionable results at your fingertips

Develop

Release

Monitor

Firebase Test Lab

for Android

Play Pre-Launch Report

Firebase Crash Reporting

24 of 277

Firebase Crash Reporting

Get actionable insights and comprehensive analytics whenever your users experience crashes and other errors

25 of 277

Integrate Gradle/Pod

0-1 init lines of code

Start capturing errors!

26 of 277

fatal error A

6K

7K

non-fatal error A

5K

6K

fatal error B

4K

4.8K

fatal error C

3K

Clustering

27 of 277

28 of 277

Get the big picture with comprehensive metrics on app versions, OS levels and device models

29 of 277

Find the exact line where the error happens

30 of 277

Minimize the time and effort to

resolve issues with data about your users’ devices

31 of 277

Log custom events before an error happens

//On Android

FirebaseCrash.log("Activity created.");

//On iOS

FIRCrashLog(@"Button clicked.");

32 of 277

Provide more context with events leading up to an error

33 of 277

Understand the Impact of Crashes on the Bottom Line

Confidential + Proprietary

34 of 277

Fix the bug, then win them back with

a timely push notification

Confidential + Proprietary

35 of 277

Looking ahead

Machine learning

Compilers

Toolchains

36 of 277

The shift to mobile caught us by surprise...

PCs

Smartphones & Tablets

2014

1.5M

1983

0

37 of 277

Thank You

38 of 277

Docker Based Geo Dispersed Test Farm �- Test Infrastructure Practice in Intel Android Program

Chen Guobing, Yu Jerry

38

39 of 277

Agenda

Test Infrastructure Challenges
Test as a Service
Docker Based Test Farm
Test Distribution
Technical Challenges
Questions

39

40 of 277

Taxonomies

40

41 of 277

Test Infrastructure Challenges

Maximize the use of Development Vehicles (Engineering samples)
Maximize the use of automated test
Minimize the maintenance cost of the Test Infra, test benches and test assets

41

42 of 277

Test as a Service – What We Need

Anyone

Any automated Test

Any Device

Anywhere

Anytime

42

43 of 277

Target Users - Usages

Test on demand and automated release testing
Failed test cases Re-run or failure reproduce

Automated pre-commit and post-commit testing

Test on demand, developer’s own build
Work with other dev tool, e.g. dichotomy check

Continuous Integration

Testing

QA

Release Testing

Developer

Testing

43

44 of 277

Docker Based Geo Dispersed Test Farm

44

45 of 277

Test Distribution

Test Catalog

Capability:

Platform:

Location:

Campaign A

capability: pmeter

“Run campaign A on XYZ platform in SH”

Test Distributor

Test Campaign ← Capability → Test Bench

45

46 of 277

Technical Challenges – Anywhere, Any Device

DUT and Test Equipment controls

$ docker run … --device=/dev/bus/usb/001/004

--device=/dev/ttySerial0 …

DUT state transition management

46

47 of 277

Technical Challenges – Anyone, Any Automated Test

Hierarchal code maintain

Easily customized

All-in-one in delivery

Create once, run anywhere

Release and deliver test suites in the way of docker image.

47

48 of 277

Questions?��Contacts: �jerry.yu@intel.com�guobing.chen@intel.com�

48

49 of 277

OpenHTF

an open-source hardware testing framework

https://github.com/google/openhtf

50 of 277

Motivation for OpenHTF

Drastically reduce the amount of boilerplate code needed to:

exercise a piece of hardware

take measurements along the way

generate a record of the whole process

Make operator interactions simple but flexible.

Allow test engineers to focus on authoring actual test logic.

“Simplicity is requisite for reliability.” ~Edsger W. Dijkstra

51 of 277

Google:

A Software Company

...at least, it used to be!

52 of 277

Google:

Now With More Hardware!

madsci

With the addition of teams such as Android and ChromeOS, new types of hardware testing became necessary.

Due to the dearth of existing internal tools (and because This Is Google™), each team developed their own internal tools.

Different teams have different requirements, some are running tests from Google’s internal source repo, some need to ship tests to Vendors/CMs.

Even with those teams, different projects are at different development stages, and have different preferred tradeoffs between test infrastructure and development speed.

The origins of OpenHTF are in manufacturing (from the Glass team), a space not well covered by these other tools.

Tests written using the other test frameworks often look similar to software unit tests with assert statements frequently used to control pass/fail outcomes.

In some cases they even use certain unit test frameworks’ base classes under the hood.

This makes test authoring familiar to software engineers, but much less so to test engineers, especially vendors/CMs.

It also makes these tests difficult to run on a manufacturing floor.

A notable exception, ChromeOS Factory, is designed for manufacturing, but uses a test-on-device paradigm, which wasn’t suitable for Glass.

fahhem

main difference between hw and sw testing is fundamentally “what do you trust?”

jethier

OpenHTF came from a team that included both hardware engineers and software engineers working closely together, and was written (and rewritten!) with NON-software-engineers in mind as the primary consumers. Specifically, we wanted to make something that catered to:

Hardware test engineers

Electrical Engineers

Contract manufacturing partners

Vendors

Over the course of several iterations, it grew to be lightweight, focused, and platform agnostic.

The intention is for test authors to write hardware test logic only, and to ignore other concerns like deployment and scheduling.

When it comes to deployment and scheduling, everyone has different needs, and different systems they need to integrate with, so a one-size-fits-all solution to these things would be riddled with compromises. Instead, we leave that to each team to independently support with their own Infrastructure engineering team.

Platform agnostic in the sense that it works on all 3 major OS’s (Windows, Mac, and Linux), and doesn’t care what hardware you’re testing, whether it’s a phone, a watch, a camera, a sensor, or another computer.

(extra notes)

Other test frameworks

ACTS (Android Connectivity Test Suite), which is part of the Android Open Source Project.

ChromeOS Factory, which is part of The Chromium Projects.

Autotest, originally out of Platforms for Linux Kernel testing.

Avocado

53 of 277

Our Solution

A python library that provides a set of convenient abstractions for authoring hardware testing code.

54 of 277

Use Cases

Manufacturing Floor

Automated Lab

Benchtop

55 of 277

Core Abstractions

Test

Plug

Test Equipment &

Device Under Test

Output Callback:

JSON to disk,

upload via network, etc.

Phase

Output

Record

Measurement

jethier

OpenHTF tests are broken down into phases, kind of similar to “test cases”.

Phases are the atomic unit of testing, and can be as simple as regular Python functions.

Measurements are, as they sound, the individual pieces of data collected during the test.

Sometimes you want to store an artifact like a JPEG or a .WAV that was generated during the test. Attachments are the abstraction we provide for doing that.

Plugs are interfaces to hardware (more info on that later).

And output callbacks let you do something useful with the test results once a test run is complete.

madsci

An important thing to reiterate when describing these concepts is that the intention is to lower the barrier to entry of test development, following a “Keep it simple” mantra.

fahhem

Python structure of these abstractions

56 of 277

Tests & Phases

free-for-all!!!!

This is a pretty distilled example of an OpenHTF test script.

The overall flow is to:

Import the framework, define some phases (functions).

Instantiate a test, passing in those phases.

Add some output callbacks to do something useful with the test record at the end of the test.

And run the test’s ‘execute’ method.

PASS/FAIL is determined by the values assigned to the various measurements.

You declare a measurement by decorating your phase with the @measures decorator.

You can specify validators (numerical limits being the most common) in the declaration.

Then, inside the phase, you set the measurement as you would a normal variable in Python.

And OpenHTF takes care of running the validator you’ve declared on the value you set, and determining whether the measurement is a PASS or a FAIL.

When the phase is done, all the measurement PASS/FAIL outcomes are rolled up into an overall PASS/FAIL for the phase, which in turn rolls up to an overall PASS/FAIL for the test itself.

Once your test finishes, you have a structured record of test execution.

Different teams and users want to do different things with their test records.

Someone who’s running test on the bench in the lab might just want to dump the record to a human-readable JSON file on the local filesystem.

Whereas someone who is setting up a test station at the tail end of a manufacturing line in the factory will want to upload the test records to some central data server where the data can be aggregated, queried, etc.

To stay with our goal of versatility, OpenHTF lets you set an arbitrary series of output callbacks for your test.

Output callbacks are just Python functions (or callables) that take in a test record and do what you want to do with it.

At the end of the test run, OpenHTF calls all the output callbacks in turn, passing each one the completed test record from the test run.

We provide some useful ones (like OutputToJSON), but you can use those as a template and write your own as well.

This is a great way to integrate with, say, an MES system on the manufacturing floor.

57 of 277

Plugs

jethier

For interfacing with hardware, OpenHTF uses an abstraction called “plugs”.

They’re called “plugs” because “plug-ins” was too indicative of pure software extensions, and the term “plugs” in contrast called physical hardware to mind.

This is one of the areas where we’re hoping to get some help from you guys, the wider testing community!

Our desire is to have a large collection of plugs available for interfacing with tons of different off-the-shelf test equipment, whether it’s power supplies, scopes, analyzers, sensors, etc.

But right now our collection of plugs is pretty thin, mostly because we’ve been focused on the framework itself, but also because a lot of our internal use cases just use their own proprietary plug code.

But the short explanation of plugs is that you subclass our BasePlug class, and add the logic you need to control your specific DUT or test equipment.

58 of 277

Web GUI

59 of 277

Q&A

60 of 277

Detecting loop inefficiencies automatically

(to appear in FSE 2016)

Monika Dhok (IISc Bangalore, India)*

Murali Krishna Ramanathan (IISc Bangalore, India)

61 of 277

Software efficiency is very important

Performance issues are hard to detect during testing �

These issues are found even in well tested commercial softwares�

Degrade application responsiveness and user experience

62 of 277

Performance bugs are critical

Implementation mistakes that cause inefficiency

Difficult to catch them during compiler optimizations

Fixing them can result in large speedups, thereby improving efficiency

63 of 277

Redundant traversal bugs

When program iterates over a data structure repeatedly without any intermediate modifications

Public class A{

1. Public boolean containsAny(Collection c1, Collection c2){

2. Iterator itr = c1.iterator();

3. while(itr.hasNext())

4. if(c2.contains(itr.next()))

5. Return true;

6. Return false;

}

Complexity : O(size(c1) x size(c2))

64 of 277

Performance tests are written by developers

65 of 277

Detecting redundant traversals

Toddler [ICSE 13]

66 of 277

Static analysis techniques alone are not effective

Challenges :

How to confirm the validity of the bug?�

How to expose the root cause?

Execution trace can be helpful�

How to detect that the performance bug is fixed?

67 of 277

Automated tests not effective for performance bugs

Toddler[ICSE 13]

68 of 277

Challenges involved in writing performance tests

Virtual call resolution � Generating tests for all possible resolutions of method � invocation is not scalable

Generating appropriate context� Realization of the defect can be dependent on certain � conditions that affect the reachability of the inefficient loop

Arrangement of elements � Problem can only occur when data structure has large � elements arranged in particular fashion

69 of 277

Glider

We propose a novel and scalable approach to automatically generate tests for exposing loop inefficiencies

70 of 277

Glider is available online

https://drona.csa.iisc.ernet.in/~sss/tools/glider

71 of 277

Performance bug caught by glider

72 of 277

Results

We have implemented our approach on SOOT bytecode framework�and evaluated it on number of libraries

Our approach detected 46 bugs across 7 java libraries including 34 �previously unknown bugs.

Tests generated using our approach significantly outperform the �randomly generated tests.

73 of 277

Questions?

74 of 277

NEED FOR SPEED

accelerate tests from 3 hours to 3 minutes

emo@komfo.com

75 of 277

3

hours

3

minutes

600 API tests

76 of 277

Before

After

The

3 Minute

Goal

77 of 277

It’s not about the numbers or techniques you’ll see.

It’s all about continuous improvement.

78 of 277

Dedicated

Environment

79 of 277

Execution Time in Minutes

180

123

New Environment

80 of 277

Empty Databases

81 of 277

The time needed to create data for one test:

And then the test starts

Call 12 API endpoints

Modify data in 11 tables

Takes about 1.2 seconds

82 of 277

180

123

Execution Time in Minutes

89

Empty Databases

83 of 277

Simulate

Dependencies

84 of 277

+Some More

STUB

Stub all external dependencies

Core API

85 of 277

Transparent

Fake SSL certs

Dynamic Responses

Local Storage

Return Binary Data

Regex URL match

Existing Tools (March 2016)

Stubby4J

WireMock

Wilma

soapUI

MockServer

mounteback

Hoverfly

Mirage

We created project Nagual,

open source soon.

86 of 277

180

123

89

Execution Time in Minutes

65

Stub Dependencies

87 of 277

Move to Containers

88 of 277

180

123

89

65

Execution Time in Minutes

104

Using Containers

89 of 277

Run Databases

in Memory

90 of 277

180

123

89

65

104

Execution Time in Minutes

61

Run Databases in Memory

91 of 277

Don’t Clean

Test Data

92 of 277

180

123

89

65

104

61

Execution Time in Minutes

46

Don’t delete test data

93 of 277

Run in Parallel

94 of 277

4 6 8 10 12 14 16

Time to execute 12 9 7 5 8 12 17

The Sweet Spot

95 of 277

180

123

89

65

104

61

46

Execution Time in Minutes

5

Run in Parallel

96 of 277

Equalize Workload

97 of 277

98 of 277

99 of 277

180

123

89

65

104

61

46

5

Execution Time in Minutes

3

Equal Batches

Run in Parallel

Don’t delete test data

Run Databases in Memory

Using Containers

Stub Dependencies

Empty Databases

New Environment

100 of 277

After Hardware Upgrade

The Outcome

2:15 min.

1:38 min.

101 of 277

The tests are slow

The tests are unreliable

The tests can’t exactly pinpoint the problem

High Level Tests Problems

3 Minutes

No external dependencies

It’s cheap to run all tests after every change

102 of 277

In a couple of years, running all your automated tests, after every code change, for less than 3 minutes, will be standard development practice.

103 of 277

104 of 277

EmanuilSlavov.com

@EmanuilSlavov

105 of 277

Slide #, Photo Credits

1. https://www.flickr.com/photos/thomashawk

5. https://www.flickr.com/photos/100497095@N02

7. https://www.flickr.com/photos/andrewmalone

10. https://www.flickr.com/photos/astrablog

14. https://www.flickr.com/photos/foilman

16. https://www.flickr.com/photos/missusdoubleyou

18. https://www.flickr.com/photos/canonsnapper

20. https://www.flickr.com/photos/anotherangle

23. https://www.flickr.com/photos/-aismist

106 of 277

Code Coverage is a Strong Predictor of

Test Suite Effectiveness

in the Real World

Rahul Gopinath

Iftekhar Ahmed

107 of 277

When should we stop testing?

108 of 277

How to evaluate test suite effectiveness?

109 of 277

Previous research: Do not trust coverage

(In theory)

GTAC’15 Inozemtseva

110 of 277

Factors affecting test suite quality

Test suite quality

Coverage

Assertions

111 of 277

According to previous research

Test suite quality

Coverage

Assertions

Test suite size

GTAC’15 Inozemtseva

112 of 277

But...

What is the adequate test suite size?

Is there a maximum number of test cases for a given program?
Are different test cases equivalent in strength?
How do we account for duplicate tests?
Test suite sizes are not comparable even for the same program.

113 of 277

Can I use coverage to measure

suite effectiveness?

114 of 277

Statement coverage best predicts mutation score

A fault in a statement has 87% probability of being detected

if an organic test covers it.

M = 0.87xS

Size of dots follow size of projects

R² = 0.94

Results from 250 real world programs

largest > 100 KLOC

On Developer written test suites

115 of 277

Statement coverage best predicts mutation score

A fault in a statement has 61% probability of being detected

if a generated test covers it.

M = 0.61xS

Size of dots follow size of projects

R² = 0.70

Results from 250 real world programs

largest > 100 KLOC

On Randoop generated test suites

116 of 277

But

Controlling for test suite size, coverage provides little extra information.

Hence don't use coverage [GTAC’15 inozemtseva]

Why use mutation?

Mutation score provides little extra information (<6%) compared to coverage.

117 of 277

Does coverage have no extra value?

	GTAC’15 Inozemtseva	Our Research
# Programs	5	250
Selection of programs	Ad hoc	Systematic sample from Github
Tool used	CodeCover, PIT	Emma, Cobertura, CodeCover, PIT
Test suites	Random subsets of original	Organic & Randomly generated

(New results)
Removal of influence of size	Ad hoc	Statistical

Our study is much larger, systematic (not ad hoc), and follows the real world usage

	Our Research (New results)
M~TestsuiteSize	12.84%
M~log(TSize)	51.26%
residuals(M~log(TSize))~S	75.25%

Statement coverage can explain 75% variability in mutation score after eliminating influence of test suite size.

118 of 277

Is mutation analysis better than coverage analysis?

119 of 277

Mutation analysis: High cost of analysis

Δ=b² – 4ac

d = b^2 + 4 * a * c;�d = b^2 * 4 * a * c;�d = b^2 / 4 * a * c;�d = b^2 ^ 4 * a * c;�d = b^2 % 4 * a * c;

d = b^2 << 4 * a * c;

d = b^2 >> 4 * a * c;

d = b^2 * 4 + a * c;�d = b^2 * 4 - a * c;�d = b^2 * 4 / a * c;�d = b^2 * 4 ^ a * c;�d = b^2 * 4 % a * c;

d = b^2 * 4 << a * c;

d = b^2 * 4 >> a * c;

d = b^2 * 4 * a + c;�d = b^2 * 4 * a - c;�d = b^2 * 4 * a / c;�d = b^2 * 4 * a ^ c;�d = b^2 * 4 * a % c;

d = b^2 * 4 * a << c;

d = b^2 * 4 * a >> c;

d = b + 2 - 4 * a * c;�d = b - 2 - 4 * a * c;�d = b * 2 - 4 * a * c;�d = b / 2 - 4 * a * c;�d = b % 2 - 4 * a * c;

d = b^0 - 4 * a * c;�d = b^1 - 4 * a * c;

d = b^-1 - 4 * a * c;

d = b^MAX - 4 * a * c;

d = b^MIN - 4 * a * c;

d = b^2 - 0 * a * c;�d = b^2 - 1 * a * c;�d = b^2 – (-1) * a * c;�d = b^2 - MAX * a * c;�d = b^2 - MIN * a * c;�

120 of 277

Mutation score is very costly

121 of 277

Mutation analysis: Equivalent mutants

Δ=b² – 2²ac

d = b^2 - (2^2) * a * c;�d = b^2 - (2*2) * a * c;�d = b^2 - (2+2) * a * c;

Mutants

Original

Equivalent Mutant

Normal Mutant

Or: Do not trust low mutation scores

122 of 277

Low mutation score does not indicate a low quality test suite.

123 of 277

Mutation analysis: Equivalent mutants

Δ=b² – 2²ac

d = b^2 - (-4) * a * c;�d = b^2 + 4 * a * c;�d = (-b)^2 - 4 * a * c;

Mutants

Original

Equivalent Mutant

Redundant Mutant

Or: Do not trust low mutation scores

124 of 277

High mutation score does not indicate a high quality test suite.

125 of 277

Mutation Analysis: Different Operators

Δ=b² – 4ac

d = b^2 + 4 * a * c;

>>> dis.dis(d)

2 0 LOAD_FAST 0 (b)

3 LOAD_CONST 1 (2)

6 LOAD_CONST 2 (4)

9 LOAD_FAST 1 (a)

12 BINARY_MULTIPLY

13 LOAD_FAST 2 (c)

16 BINARY_MULTIPLY

17 BINARY_SUBTRACT

18 BINARY_XOR

19 RETURN_VALUE x

[2016 Software Quality Journal]

126 of 277

Mutation score is not a consistent measure

127 of 277

Does a high coverage test suite

actually prevent bugs?

128 of 277

We looked at bugfixes on actual programs

An uncovered line is twice as likely to have a bug fix

as that of a line covered by any test case.

[FSE 2016]	Covered	Uncovered	p
Statement	0.68	1.20	0.00
Block	0.42	0.83	0.00
Method	0.40	0.87	0.00
Class	0.45	0.32	0.10

Difference in bug-fixes between covered and

Uncovered program elements

129 of 277

Does a high coverage test suite

actually prevent bugs?

Yes it does

130 of 277

Summary

Do not dismiss coverage lightly

Beware of mutation analysis caveats

Coverage is a pretty good heuristic on where the bugs hide.

Coverage is highly correlated with mutation score (92%)
Coverage provides 75% more information than just test suite size.

Mutation score provides little extra information compared to coverage.
Mutation score can be unreliable.

131 of 277

Assume non-equivalent, non-redundant, uniform fault distribution for mutants

at one’s own peril.

Beware of theoretical spherical cows…

132 of 277

Backup slides

133 of 277

That is,

Coverage is highly correlated with mutation score (92%)
Mutation score provides little extra information compared to coverage.
Coverage provides 75% more information than just test suite size.
Mutation score can be unreliable.
Coverage thresholds actually help reduce incidence of bugs.

134 of 277

Mutation X Path Coverage

135 of 277

Mutation X Branch Coverage

136 of 277

Computations

require(Coverage)

data(o.db)

o <- subset(subset(o.db, tloc != 0), select=c('pit.mutation.cov', 'cobertura.line.cov', 'loc', 'tloc'))

o$l.tloc <- log2(o$tloc)

oo <- subset(o, l.tloc != -Inf)

ooo <- na.omit(oo)

> cor.test(pit.mutation.cov,tloc)

t = 1.973, df = 232, p-value = 0.04969

95 percent confidence interval: 0.0002148688 0.2525430013

sample estimates: cor 0.1284574

> cor.test(pit.mutation.cov,l.tloc)

data: pit.mutation.cov and l.tloc

t = 9.0938, df = 232, p-value < 2.2e-16

95 percent confidence interval: 0.4114269 0.6013377

sample estimates: cor 0.5126249

> cor.test(resid(lm(pit.mutation.cov~log(tloc))),cobertura.line.cov)

data: resid(lm(pit.mutation.cov ~ log(tloc))) and cobertura.line.cov

t = 17.406, df = 232, p-value < 2.2e-16

95 percent confidence interval: 0.6909857 0.8032663

sample estimates: cor 0.7525441

> summary(lm(pit.mutation.cov~log(tloc)))

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.13644 0.06031 -2.262 0.0246 *

log(tloc) 0.09950 0.01094 9.094 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2839 on 232 degrees of freedom

Multiple R-squared: 0.2628, Adjusted R-squared: 0.2596

F-statistic: 82.7 on 1 and 232 DF, p-value: < 2.2e-16

> summary(lm(pit.mutation.cov~log(tloc)+cobertura.line.cov))

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.074859 0.031645 -2.366 0.018828 *

log(tloc) 0.023658 0.006487 3.647 0.000328 ***

cobertura.line.cov 0.785488 0.031628 24.836 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1485 on 231 degrees of freedom

Multiple R-squared: 0.7991, Adjusted R-squared: 0.7974

F-statistic: 459.5 on 2 and 231 DF, p-value: < 2.2e-16

137 of 277

Does Mutation score correlate to fixed bugs?

138 of 277

Mutant semiotics (how faults map to failures) is not well understood

Affected by factors of the particular project

Style of development, coding guidelines etc
Complexity of algorithms
Coupling between modules

139 of 277

Can weak mutation analysis help?

Rather than the failure of a test case for a mutant, we only require a change in state. It is easier to compute, but:

Does not verify assertions
So, Just another coverage technique
Redundant and Equivalent mutants remain

140 of 277

Method

250 real world projects from Github, largest > 100 KLOC.

Tests

Developer written

Randoop generated

	Statement	Branch	Path	Mutation
Emma	X
Cobertura	X	X
Codecover	X	X
JMockit	X		X
PIT	X			X
Major				X
Judy				X

141 of 277

Mutation analysis has a number of other problems

Mutants are not similar in their difficulty to kill

So a test suite that is optimized for killing difficult mutants is at a disadvantage

Coupling effect has not been validated for complex systems

According to Wah, the coupling will decrease as the system gets larger.

142 of 277

The fault distribution may not be uniform

A majority of mutants are very easy to kill, but some are stubborn.

Does two test suites with say 50% mutation score have the same strength?

Testsuites optimized for harder to detect faults are penalized.

143 of 277

Correlation does not imply causation?

It was pointed out in the previous talk that correlation between coverage and mutation score does not imply a causal relationship between the two. We can counter it by:

Logic

A test suite with zero coverage will not kill any mutants.

A test suite can only kill mutants on the lines it covers.

Statistically

Using additive noise models to identify cause and effect. (ongoing research)

Given two random variables X and Y , X is assumed to cause Y if

(i) Y can be obtained as a function of X plus a noise term independent of X, but

(ii) X cannot be obtained as a function of Y plus independent noise,

then we infer that X causes Y .

In this case, where (i) and (ii) hold simultaneously, the CAM is termed identifiable

(Consistency of Causal Inference under the Additive Noise Model)

(Shimizu et al., 2006; Hoyer et al., 2009; Tillman et al., 2009; Peters et al., 2011a;b)

Fit Y as a function f(X),

obtain the residuals η(Y,f) = Y − f(X),

fit X as a function g(Y ),

obtain the residuals η(X,g) = X − g(Y ),

decide X → Y if

η(Y,f) ⊥ X

but

η(X,g) !⊥ Y ,

decide Y → X if the reverse holds true,

abstain otherwise.

> mu_line <- lm(ooo$pit.mutation.cov~ooo$cobertura.line.cov)$residuals

> line_mu <- lm(ooo$cobertura.line.cov~ooo$pit.mutation.cov)$residuals

> boxplot(mu_line,ooo$cobertura.line.cov)

> boxplot(line_mu,ooo$pit.mutation.cov)

> t.test(mu_line,ooo$cobertura.line.cov)

Welch Two Sample t-test

data: mu_line and ooo$cobertura.line.cov

t = -17.209, df = 318.55, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.4768821 -0.3790312

sample estimates:

mean of x mean of y

-1.700321e-18 4.279566e-01

> t.test(line_mu,ooo$pit.mutation.cov)

Welch Two Sample t-test

data: line_mu and ooo$pit.mutation.cov

t = -16.062, df = 337.66, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.4325586 -0.3381731

sample estimates:

mean of x mean of y

-2.321884e-18 3.853659e-01

> require(CAM)

> x <- subset(ooo, select=c(pit.mutation.cov, cobertura.line.cov))

> CAM(x)$Adj

144 of 277

ClusterRunner

Making fast test-feedback easy through horizontal scaling.

Joseph Harrington and Taejun Lee Productivity Engineering

145 of 277

What is ClusterRunner?

146 of 277

147 of 277

148 of 277

Functional

Tests

Integration Tests

Unit Tests

Manual

Tests

149 of 277

150 of 277

151 of 277

Develop

Test

Feature

Design

Release

152 of 277

Develop

Test

Feature

Design

Release

153 of 277

PHPUnit testsuite duration at Box

154 of 277

155 of 277

“A problem isn’t a problem if you can throw money at it.”

156 of 277

157 of 277

PHPUnit

Scala SBT

nosetests

QUnit

JUnit

158 of 277

Requirements

Easy to configure and use

Test technology agnostic

Fast test feedback

159 of 277

160 of 277

www.ClusterRunner.com

161 of 277

Our 30-hour testsuite

17

minutes

We have been using ClusterRunner at Box since late 2014, and is a service that is as irreplaceable as Jenkins here at Box.

The 30 hour testsuite I mentioned earlier—takes on average 17 minutes to run today using ClusterRunner.

For comparison, 30 hours is more time than I get to spend writing code in a week. 17 minutes is how long I spend writing one email.

30 hours, 17 minutes--that magnitude of build time difference totally changes how developers work.

And I’d like to note that the reason why it takes 17 minutes to run, is because the longest single test file in this testsuite takes 17 minutes to run. Assuming you have enough hardware, ClusterRunner is able to efficiently allocate load such that your builds take only as long as the longest single test in your testsuite.

Ok, that’s enough context. I’m going to hand it off to Joey to actually show you ClusterRunner in action.

162 of 277

ClusterRunner in Action

Bring up a cluster
Set up your project
Execute a build
Look at the results

163 of 277

Bring up a Cluster

�# On master.box.com�clusterrunner master � --port 43000

# On slave1.box.com, slave2.box.com�clusterrunner slave � --master-url master.box.com:43000

164 of 277

Bring up a Cluster

http://master.box.com:43000/v1/slave/

165 of 277

166 of 277

Set up Your Project

Create clusterrunner.yaml at the root of your project repo.

Commands to run
How to distribute

167 of 277

168 of 277

Set up Your Project

�> phpunit ./test/php/EarthTest.php

> phpunit ./test/php/WindTest.php

> phpunit ./test/php/FireTest.php

> phpunit ./test/php/WaterTest.php

> phpunit ./test/php/HeartTest.php

169 of 277

Execute a Build

Now we’re ready to build!

�clusterrunner build� --master-url master.box.com:43000

git � --url http://github.com/myproject� --job-name PHPUnit

170 of 277

171 of 277

View Build Results

http://master.box.com:43000/v1/build/1/

172 of 277

173 of 277

View Build Results

http://master.box.com:43000/v1/build/1/subjob/

174 of 277

175 of 277

View Build Results

http://master.box.com:43000/v1/build/1/result

176 of 277

177 of 277

178 of 277

179 of 277

180 of 277

What’s next for ClusterRunner

AWS integration with autoscaling
Docker support
Improvements to deployment mechanism
In-place upgrades
Web UI

181 of 277

clusterrunner.com

Get Involved!

182 of 277

productivity@box.com

Contact Us

183 of 277

Multi-device Testing

E2E test infra for mobile products of today and tomorrow

angli@google.com

adorokhine@google.com

184 of 277

Overview

E2E testing challenges

Introducing Mobly

Sample test

Controlling Android devices

Custom controller

Demo

185 of 277

E2E Testing

Unit Tests

Integration/Component Tests

E2E Tests

Testing Pyramid

Where magic dwells

186 of 277

E2E Testing is Important

Applications involving multiple devices

P2P data transfer, nearby discovery

Product under test is not a conventional device.

Internet-Of-Things, VR

Need to control and vary physical environment

RF: Wi-Fi router, attenuators

Lighting, physical position

Interact with other software/cloud services

iPerf server, cloud service backend, network components

187 of 277

E2E Testing is Hard!

Most test frameworks are for single-device app testing

Need to trigger complex actions on devices

Some may need system privilege

Need to synchronize steps between multiple devices

Logic may be centralized (hard to write) or decentralized (hard to trigger)

Need to drive a wide range of equipment

attenuator, call box, power meter, wireless AP etc

Need to communicate with cloud services

Need to collect debugging artifacts from many sources

188 of 277

Our Solution - Mobly

Lightweight Python framework (Py2/3 compatible)

Test logic runs on a host machine

Controls a collection of devices/equipment in a test bed

Bundled with controller library for essential equipment

Android device, power meter, etc

Flexible and pluggable

Custom controller module for your own toys

Open source and ready to go!

189 of 277

Mobly Architecture

Test Bed

Computer

Mobly

Test Script

Mobile

Device

Network Switch

Attenuator

Call Box

Cloud Service

Test Harness

Test bed allocation, device provisioning, and results aggregation

190 of 277

Sample Tests

Hello from the other side

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLo�HELLO�HELLO�HELLO�HELLO

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO

191 of 277

Describe a Test Bed

{� 'testbed': [{� 'name': 'SimpleTestBed',� 'AndroidDevice': '*'� }],� 'logpath': '/tmp/mobly_logs'�}

192 of 277

Test Script - Hello!

from mobly import base_test�from mobly import test_runner��class HelloWorldTest(base_test.BaseTestClass):� def setup_class(self):� self.ads = self.register_controller(android_device)� self.dut1 = self.ads[0]�� def test_hello_world(self):� self.dut1.sl4a.makeToast('Hello!')

if __name__ == '__main__':� test_runner.main()

Invocation:�$ ./path/to/hello_world_test.py -c path/to/config.json

193 of 277

Beyond the Basics

Config:�{� 'testbed': [{� ...� }],� 'logpath': '/tmp/mobly_logs',� 'toast_text': 'Hey there!'�}

�Code:

self.user_params['toast_text'] # 'Hey there!'

194 of 277

Beyond the Basics

Device specific logger

self.caller.log.info("I did something.")�# <timestamp> [AndroidDevice|<serial>] I did something

In test bed config:�'AndroidDevice': [{'serial': 'xyz', 'label': 'caller'},� {'serial': 'abc', 'label': 'callee',� 'phone_number': '123456'}]

In code:�self.callee = android_device.get_device(self.ads, label='callee')�self.callee.phone_number # '123456'

Specific device info

195 of 277

Controlling Android Devices

adb/shell

UI

API Calls

Custom Java Logic

196 of 277

Controlling Android Devices

adb

ad.adb.shell('pm clear com.my.package')

UI automator

ad.uia = uiautomator.Device(serial=ad.serial)

ad.uia(text='Hello World!').wait.exists(timeout=1000)

Android API calls, including system/hidden APIs, via SL4A

ad.sl4a.wifiConnect({'SSID': 'GoogleGuest'})

Custom Java logic

ad.register_snippets('trigger', 'com.my.package.snippets')

ad.trigger.myImpeccableLogic(5)

197 of 277

System API Calls

> self.dut.sl4a.makeToast('Hello World!')

SL4A (Scripting Layer for Android) is an RPC service exposing API calls on Android

self.dut.api is the RPC client for SL4A.

Original version works on regular Android builds.

Fork in AOSP can make direct system privileged calls (system/hidden APIs).

198 of 277

Custom Snippets

SL4A is not sufficient

SL4A methods are mapped to Android APIs, but tests need more than just Android API calls.

Current AOSP SL4A requires system privilege

Custom snippets allows users to define custom method that does anything they want.

Custom snippets can be used with other useful libs like Espresso

199 of 277

Custom Snippets

package com.mypackage.testing.snippets.example;

public class ExampleSnippet implements Snippet {� public ExampleSnippet(Context context) {}�� @Rpc(description='Returns a string containing the given number.')� public String getFoo(Integer input) {� return 'foo ' + input;� }�� @Override� public void shutdown() {}�}

200 of 277

Custom Snippets

Add your snippet classes to AndroidManifest.xml for the androidTest apk

Compile it into an apk

apply plugin: 'com.android.application'� dependencies {� androidTestCompile 'com.google.android.mobly:snippetlib:0.0.1'� }

201 of 277

Custom Snippets

Install the apk on your device

Load and call it

ad.load_snippets(name='snippets',� package='com.mypackage.testing.snippets.example')�foo = ad.snippets.getFoo(2) # 'foo 2'

202 of 277

Espresso in Custom Snippets

import static android.support.test.espresso.Espresso.onView;�import static android.support.test.espresso.action.ViewActions.swipeUp;�import static android.support.test.espresso.matcher.ViewMatchers.withId;

public class ExampleSnippet implements Snippet {� public ExampleSnippet(Context context) {}�� @Rpc(description='Performs a swipe using espresso')� public void performSwipe() {� onView(withId(R.id.my_view_id)).perform(swipeUp());� }�}

203 of 277

Custom Controllers

Plug in your own toys

204 of 277

Loose Controller Interface

def create(configs):� '''Instantiate controller objects'''��def destroy(objects):� '''Destroy controller objects'''

def get_info(objects):� '''[optional] Get controller info for test summary'''

205 of 277

Using Custom Controllers

from my.project.testing.controllers import car��def setup_class(self):� self.cars = self.register_controller(car)��def test_something(self):� self.cars[0].drive()

206 of 277

Video Demo

A test bed with two phones and one watch.
Phone A gives the voice command to watch.
Watch initiates a call to phone B.
Phone B gets a ringing call notification.
Phone A hangs up.

207 of 277

Video Demo

208 of 277

Coming Soon

iOS controller libs

Dependent on libimobiledevice

KIFTest, XCTest, XCUITest

Async events in snippets

Standard snippet and python utils for basic Android operations

Support non-Nexus Android devices

209 of 277

Thank You!

Questions?

Resources:

Mobly on Github

SL4A code link

Snippet Lib on Github

Google group

210 of 277

Scale vs Value

Test Automation at the BBC

David Buckhurst & Jitesh Gosai

211 of 277

212 of 277

213 of 277

214 of 277

JG: Notes

So we started looking at behaviour driven development or BDD as a way to understand what our app should be doing

this would involve developers, tester and product owners coming together to discuss how a new feature should work and describe it in plain english - no technical jargon. - know as a feature file - you may have heard these session called 3 amigos sessions

We then used this feature file as a starting point in what to automate .

the idea being if the automated test passed then we had satisfied that accepted criteria and built what was decided in the 3 amigo session

and we started to get pretty good at automating feature files

and running them on a real device was pretty simple as well

but we were still very depend on massively long running manual regression test cycles before a release of our product could actually be made

so we wanted to see if we could run more of these automated tests on more devices to try and cut our regression test cycles

Original notes

JG

BDD and automation

Got really good at writing UI tests -- running on a single device. Following bdd practices

But we were still dependent on massive, time consuming regression cycles

Wanted a way to scale our automation to cut regression

215 of 277

Lots of innovation

Chair hive

216 of 277

217 of 277

218 of 277

219 of 277

220 of 277

221 of 277

222 of 277

223 of 277

224 of 277

225 of 277

226 of 277

227 of 277

228 of 277

229 of 277

230 of 277

Live

Insights

&

Operational

Notifications

231 of 277

232 of 277

Scale vs Value

233 of 277

www.bbc.co.uk/opensource

@BBCOpenSource

@davidbuckhurst @JitGo

234 of 277

Finding bugs in

C/C++ libraries using

libFuzzer

Kostya Serebryany, GTAC 2016

235 of 277

Agenda

What is fuzzing
Why fuzz
What to fuzz
How to fuzz

… with libFuzzer

Demo (CVE-2016-5179)

236 of 277

What is Fuzzing

Somehow generate a test input�
Feed it to the code under test�
Repeat

237 of 277

Why fuzz

Bugs specific to C/C++ that require the sanitizers to catch:

Use-after-free, buffer overflows, Uses of uninitialized memory, Memory leaks

Arithmetic bugs:

Div-by-zero, Int/float overflows, bitwise shifts by invalid amount

Plain crashes:

NULL dereferences, Uncaught exceptions

Concurrency bugs:

Data races, Deadlocks

Resource usage bugs:

Memory exhaustion, hangs or infinite loops, infinite recursion (stack overflows)

Logical bugs:

Discrepancies between two implementations of the same protocol (example)
Assertion failures

238 of 277

What to fuzz

Anything that consumes untrusted or complicated inputs:

Parsers of any kind (xml, pdf, truetype, ...)
Media codecs (audio, video, raster & vector images, etc)
Network protocols, RPC libraries (gRPC)
Crypto (boringssl, openssl)
Compression (zip, gzip, bzip2, brotli, …)
Compilers and interpreters (PHP, Perl, Python, Go, Clang, …)
Regular expression matchers (PCRE, RE2, libc’s regcomp)
Text/UTF processing (icu)
Databases (SQLite)
Browsers, text editors/processors (Chrome, OpenOffice)

OS Kernels (Linux), drivers, supervisors and VMs
UI (Chrome UI)

239 of 277

How to fuzz

Generation-based fuzzing

Usually a target-specific grammar-based generator�

Mutation-based fuzzing

Acquire a corpus of test inputs
Apply random mutations to the inputs�

Guided mutation-based fuzzing

Execute mutations with coverage instrumentation
If new coverage is observed the mutation is permanently added to the corpus

240 of 277

Fuzz Target - a C/C++ function worth fuzzing

extern "C"

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t DataSize) {

if (DataSize >= 3 &&

Data[0]=='F' &&

Data[1]=='U' &&

Data[2]=='Z' &&

Data[3]=='Z')

DoMoreStuff(Data, DataSize);

return 0;

}

241 of 277

libFuzzer - an engine for guided in-process fuzzing

libFuzzer: a library; provides main()
Build your target code with extra compiler flags
Link your target with libFuzzer
Pass a directory with the initial test corpus and run

% clang++ -g my-code.cc libFuzzer.a -o my-fuzzer \

-fsanitize=address -fsanitize-coverage=trace-pc-guard

% ./my-fuzzer MY_TEST_CORPUS_DIR

242 of 277

CVE-2016-5179 (c-ares, asynchronous DNS requests)

extern "C"

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t DataSize) {

unsigned char *buf; int buflen;

std::string s(reinterpret_cast<const char *>(Data), DataSize);

ares_create_query(s.c_str(), ns_c_in, ns_t_a, 0x1234, 0, &buf,

&buflen, 0);

free(buf);

return 0;

}

243 of 277

Demo

244 of 277

present perfect => present continuous

“The project X has been fuzzed, hence it is somewhat secure”�
False:

Bug discovery techniques evolve
The project X evolves
Fuzzing is CPU intensive and needs time to find bugs�

“The project X is being continuously fuzzed, the code coverage is monitored.”

Much better!

245 of 277

Oss-fuzz - fuzzing as a service for OSS

Based on ClusterFuzz, the fuzzing backend used for fuzzing Chrome components

Supported engines: libFuzzer, AFL, Radamsa, ...

https://github.com/google/oss-fuzz

246 of 277

Q&A

libFuzzer.info

tutorial.libFuzzer.info

247 of 277

Can MongoDB Recover from Catastrophe?

How I learned to crash a server

{ name : "Jonathan Abrahams",

title : "Senior Quality Engineer",

location : "New York, NY",

twitter : "@MongoDB",

facebook : "MongoDB" }

248 of 277

A machine may crash for a variety of reasons:

Termination of virtual machine or host
Hardware failure
OS failure

Machine crash

Unexpected termination of mongod

Application crash

249 of 277

Why do we need to crash a machine?

We could abort mongod, but this would not fully simulate an unexpected crash of a machine or OS (kernel):

Immediate loss of power may prevent cached I/O from being flushed to disk.

A kernel panic can leave an application (and its data) in an unrecoverable state.

250 of 277

System passes h/w & s/w checks

mongod goes into recovery mode

mongod ready for client connection

System restart

251 of 277

How can we crash a machine?

We started by crashing the machine manually, by pulling the cord.

We evolved to using an appliance timer, which would power the machine off/on every 15 minutes.

We also figured out that setting up a cron job to send an internal crash command (more on this later) to the machine for a random period would do the job.

And then we realized, we need to do it a bit more often.

252 of 277

How did we really crash that machine, and can we do over and over and over and over...?

253 of 277

Why do we need to do it over and over and over?

A crash of a machine may be catastrophic. In order to uncover any subtle recovery bugs, we want to repeatedly crash a machine and test if it has recovered. A failure may only be encountered 1 out of 100 times!

254 of 277

Ubiquiti mPower PRO to the rescue!

Programmable power device, with ssh access from LAN via WiFi or Ethernet.

255 of 277

How do we turn off and on the power?

ssh admin@mpower

local outlet="output1"

# Send power cycle to mFi mPower to specified outlet

echo 0 > /dev/$outlet

sleep 10

echo 1 > /dev/$outlet

256 of 277

Physical vs. Virtual

It is necessary to test both type of machines as machine crashes are different and the underlying host OS and hardware may provide different I/O caching and data protection. Virtual machines typically rely on shared resources and physical machines typically use dedicated resources.

257 of 277

How do we crash a virtual machine?

We can crash it from the VM host:

KVM (Kernel-based VM): virsh destroy <vm>

VmWare: vmrun stop <vm> hard

258 of 277

How do we restart a crashed VM?

We can restart it from the VM host:

KVM (Kernel-based VM): virsh start <vm>

VmWare: vmrun start <vm>

259 of 277

How else can we crash a machine?

We can crash it using the magical SysRq key sequence (Linux only):

echo 1 | sudo tee /proc/sys/kernel/sysrq

echo b | sudo tee /proc/sysrq-trigger

260 of 277

How do we get the machine to restart?

Enable the BIOS setting to boot up after AC power is provided.

261 of 277

Restarting a Windows Machine

To disable a Windows machine from prompting you after unexpected shutdown:

bcdedit /set {default} bootstatuspolicy ignoreallfailures

bcdedit /set {current} bootstatuspolicy ignoreallfailures

bcdedit /timeout 5

262 of 277

The machine is running

Now that we figured out how to get our machine to crash and restart, we restart the mongod and it will go into recovery mode.

263 of 277

Recovery mode of mongod

Performed automatically when mongod starts, if there was an unclean shutdown detected.

WiredTiger starts from the last stable copy of the data on disk from the last checkpoint. The journal log is then applied and a new checkpoint is applied.

264 of 277

Before the crash!

Stimulate mongod by running several simultaneous (mongo shell) clients which provide a moderate load utilizing nearly all supported operations. This is important, as CRUD operations will cause mongod to perform I/O operations, which should never lead to file or data corruption.

265 of 277

Options, options

Client operations optionally provide:

Checkpoint document

Write & Read concerns

The mongod process is tested in a variety modes, including:

Standalone or single node replica set

Storage engine, i.e., mmapv1, wiredTiger

266 of 277

What do we do after mongod has restarted?

After the machine has been restarted, we start mongod on a private port and it goes into recovery mode. Once that completes, we perform further client validation, via mongo (shell):

serverStatus

Optionally, run validate against all databases and collections

Optionally, verify if a checkpoint document exists

Failure to recover, connect to mongod, or perform the other validation steps is considered a test failure.

267 of 277

What do we do after mongod has restarted?

Now that the recovery validation have passed, we will proceed with the pre-crash steps:

Stop and restart mongod on a public port

Start new set of (mongo shell) clients to perform various DB operations

268 of 277

Why do we care about validation?

The validate command checks the structures within a namespace for correctness by scanning the collection’s data and indexes. The command returns information regarding the on-disk representation of the collection.

Failing validation indicates that something has been corrupted, most likely due to an incomplete I/O operation during the unexpected shutdown.

269 of 277

Failure analysis

Since our developers could be local (NYC) or worldwide (Boston, Sydney), we want a self-service application they can use to reproduce reported failures. A bash script has been developed which can execute on both local hardware and in the cloud (AWS).

We save any artifacts useful for our developers to be able to analyze the failure:

Backup data files before starting mongod

Backup data files after mongod completes recovery

mongod and mongo (shell) log files

270 of 277

The crash testing helped to:

Extend our testing to scenarios not previously covered

Provide local and remote teams with tools to reproduce and analyze failures

Improve robustness of the mongod storage layer

271 of 277

Results, results

Storage engine bugs were discovered from the power cycle testing and led to fixes/improvements.

We have plans to incorporate this testing into our continuous integration.

272 of 277

Some bugs discovered

SERVER-20295 Power cycle test - mongod fails to start with invalid object size in storage.bson

SERVER-19774 WT_NOTFOUND: item not found during DB recovery

SERVER-19692 Mongod failed to open connection, remained in hung state, when running WT with LSM

SERVER-18838 DB fails to recover creates and drops after system crash

SERVER-18379 DB fails to recover when specifying LSM, after system crash

SERVER-18316 Database with WT engine fails to recover after system crash

SERVER-16702 Mongod fails during journal replay with mmapv1 after power cycle

SERVER-16021 WT failed to start with "lsm-worker: Error in LSM worker thread 2: No such file or directory"

273 of 277

Open issues?

Can we crash Windows using an internal command (cue the laugh track…)?

274 of 277

Closing remarks

275 of 277

Organizing committee

Alan Myrvold

Amar Amte

Andrea Dawson

Ari Shamash

Carly Schaeffer

Dan Giovannelli

David Aristizabal

Diego Cavalcanti

Jaydeep Mehta

Joe Drummey

Josephine Chandra

Kathleen Li

Lena Wakayama

Lesley Katzen

Madison Garcia

Matt Lowrie

Matthew Halupka

Sonal Shah

Travis Ellett

Yvette Nameth

276 of 277

London 2017

277 of 277

GTAC 2017

testing.googleblog.com