第 1 页，共 277 页

Welcome Back!

第 2 页，共 277 页

Developer Experience, FTW!

Niranjan Tulpule

第 3 页，共 277 页

Software development is being democratized

第 4 页，共 277 页

Core computing platforms are more accessible than ever

PCs

Smartphones & Tablets

2014

1.5M

1983

0

第 5 页，共 277 页

Free developer tools

Open Source building blocks

Free developer education

We’re lowering barrier to becoming a developer

第 6 页，共 277 页

2015

2016

2017

2018

2019

2020

2014

2013

2012

2011

Total Number of Active Apps in the App Store

2010

It’s never been easier to write apps

5M

3M

1M

第 7 页，共 277 页

Valuation of these 9 companies as a country's GDP would be in the top 50

Uber: $66B
Snapchat: $40B
Whatsapp: $16B
Airbnb: $25B
Flipkart: $15B
Pinterest: $11B
Lyft: $5.5B
Ola Cabs: $5B
Gojek: $1.3B

Valuation of these companies as a country's GDP would be in the top 50

第 8 页，共 277 页

51%�Stability Issues

41%�Functionality Related

7%�Speed

1%�Other

Classification of 1-star reviews (Sampling of Play Store reviews, May 2016)

Writing high quality apps is still hard

第 9 页，共 277 页

2.5K+

100+

~700

100M

Crittercism Mobile Experience Benchmark

Compounded complexity

Device

Manufacturer

model

OS versions

Carriers

Permutations

第 10 页，共 277 页

Improving software quality & testability by investing in Developer Experience.

第 11 页，共 277 页

Develop

Release

Monitor

Firebase Test Lab

for Android

第 12 页，共 277 页

Test on your users’ devices

第 13 页，共 277 页

Use with your existing workflow

Ahmed to place product shot here

>

_

Android Studio

Command line

Jenkins

Jenkins logo by Charles Lowell and Frontside CC BY-SA 3.0 https://wiki.jenkins-ci.org/display/JENKINS/Logo

第 14 页，共 277 页

Robo crawls your app automatically

第 15 页，共 277 页

Create Espresso tests by just using your app

Ahmed to place product shot here

第 16 页，共 277 页

Millions of Tests, and counting!

After extensive evaluation of the market, we've found that Firebase Test Lab is the best product for writing and running Espresso tests directly from Android Studio, saving us tons of time and effort around automated testing.

- Timothy West, Jet

第 17 页，共 277 页

Actionable Results at your fingertips

Get actionable results at your fingertips

Develop

Release

Monitor

Firebase Test Lab

for Android

Play Pre-Launch Report

第 18 页，共 277 页

Pre-launch report

Pre-launch reports summarize issues found when testing your app on a wide range of devices

第 19 页，共 277 页

第 20 页，共 277 页

第 21 页，共 277 页

第 22 页，共 277 页

Apps using the Play Pre-Launch Report show ~20% fewer crashes!

~60% of the crashes seen on Pre-Launch Report are fixed before public rollout.

第 23 页，共 277 页

Actionable Results at your fingertips

Get actionable results at your fingertips

Develop

Release

Monitor

Firebase Test Lab

for Android

Play Pre-Launch Report

Firebase Crash Reporting

第 24 页，共 277 页

Firebase Crash Reporting

Get actionable insights and comprehensive analytics whenever your users experience crashes and other errors

第 25 页，共 277 页

Integrate Gradle/Pod

0-1 init lines of code

Start capturing errors!

第 26 页，共 277 页

fatal error A

6K

7K

non-fatal error A

5K

6K

fatal error B

4K

4.8K

fatal error C

3K

Clustering

第 27 页，共 277 页

第 28 页，共 277 页

Get the big picture with comprehensive metrics on app versions, OS levels and device models

第 29 页，共 277 页

Find the exact line where the error happens

第 30 页，共 277 页

Minimize the time and effort to

resolve issues with data about your users’ devices

第 31 页，共 277 页

Log custom events before an error happens

//On Android

FirebaseCrash.log("Activity created.");

//On iOS

FIRCrashLog(@"Button clicked.");

第 32 页，共 277 页

Provide more context with events leading up to an error

第 33 页，共 277 页

Understand the Impact of Crashes on the Bottom Line

Confidential + Proprietary

第 34 页，共 277 页

Fix the bug, then win them back with

a timely push notification

Confidential + Proprietary

第 35 页，共 277 页

Looking ahead

Machine learning

Compilers

Toolchains

第 36 页，共 277 页

The shift to mobile caught us by surprise...

PCs

Smartphones & Tablets

2014

1.5M

1983

0

第 37 页，共 277 页

Thank You

第 38 页，共 277 页

Docker Based Geo Dispersed Test Farm �- Test Infrastructure Practice in Intel Android Program

Chen Guobing, Yu Jerry

38

第 39 页，共 277 页

Agenda

Test Infrastructure Challenges
Test as a Service
Docker Based Test Farm
Test Distribution
Technical Challenges
Questions

39

第 40 页，共 277 页

Taxonomies

40

第 41 页，共 277 页

Test Infrastructure Challenges

Maximize the use of Development Vehicles (Engineering samples)
Maximize the use of automated test
Minimize the maintenance cost of the Test Infra, test benches and test assets

41

第 42 页，共 277 页

Test as a Service – What We Need

Anyone

Any automated Test

Any Device

Anywhere

Anytime

42

第 43 页，共 277 页

Target Users - Usages

Test on demand and automated release testing
Failed test cases Re-run or failure reproduce

Automated pre-commit and post-commit testing

Test on demand, developer’s own build
Work with other dev tool, e.g. dichotomy check

Continuous Integration

Testing

QA

Release Testing

Developer

Testing

43

第 44 页，共 277 页

Docker Based Geo Dispersed Test Farm

44

第 45 页，共 277 页

Test Distribution

Test Catalog

Capability:

Platform:

Location:

Campaign A

capability: pmeter

“Run campaign A on XYZ platform in SH”

Test Distributor

Test Campaign ← Capability → Test Bench

45

第 46 页，共 277 页

Technical Challenges – Anywhere, Any Device

DUT and Test Equipment controls

$ docker run … --device=/dev/bus/usb/001/004

--device=/dev/ttySerial0 …

DUT state transition management

46

第 47 页，共 277 页

Technical Challenges – Anyone, Any Automated Test

Hierarchal code maintain

Easily customized

All-in-one in delivery

Create once, run anywhere

Release and deliver test suites in the way of docker image.

47

第 48 页，共 277 页

Questions?��Contacts: �jerry.yu@intel.com�guobing.chen@intel.com�

48

第 49 页，共 277 页

OpenHTF

an open-source hardware testing framework

https://github.com/google/openhtf

第 50 页，共 277 页

Motivation for OpenHTF

Drastically reduce the amount of boilerplate code needed to:

exercise a piece of hardware

take measurements along the way

generate a record of the whole process

Make operator interactions simple but flexible.

Allow test engineers to focus on authoring actual test logic.

“Simplicity is requisite for reliability.” ~Edsger W. Dijkstra

第 51 页，共 277 页

Google:

A Software Company

...at least, it used to be!

第 52 页，共 277 页

Google:

Now With More Hardware!

madsci

With the addition of teams such as Android and ChromeOS, new types of hardware testing became necessary.

Due to the dearth of existing internal tools (and because This Is Google™), each team developed their own internal tools.

Different teams have different requirements, some are running tests from Google’s internal source repo, some need to ship tests to Vendors/CMs.

Even with those teams, different projects are at different development stages, and have different preferred tradeoffs between test infrastructure and development speed.

The origins of OpenHTF are in manufacturing (from the Glass team), a space not well covered by these other tools.

Tests written using the other test frameworks often look similar to software unit tests with assert statements frequently used to control pass/fail outcomes.

In some cases they even use certain unit test frameworks’ base classes under the hood.

This makes test authoring familiar to software engineers, but much less so to test engineers, especially vendors/CMs.

It also makes these tests difficult to run on a manufacturing floor.

A notable exception, ChromeOS Factory, is designed for manufacturing, but uses a test-on-device paradigm, which wasn’t suitable for Glass.

fahhem

main difference between hw and sw testing is fundamentally “what do you trust?”

jethier

OpenHTF came from a team that included both hardware engineers and software engineers working closely together, and was written (and rewritten!) with NON-software-engineers in mind as the primary consumers. Specifically, we wanted to make something that catered to:

Hardware test engineers

Electrical Engineers

Contract manufacturing partners

Vendors

Over the course of several iterations, it grew to be lightweight, focused, and platform agnostic.

The intention is for test authors to write hardware test logic only, and to ignore other concerns like deployment and scheduling.

When it comes to deployment and scheduling, everyone has different needs, and different systems they need to integrate with, so a one-size-fits-all solution to these things would be riddled with compromises. Instead, we leave that to each team to independently support with their own Infrastructure engineering team.

Platform agnostic in the sense that it works on all 3 major OS’s (Windows, Mac, and Linux), and doesn’t care what hardware you’re testing, whether it’s a phone, a watch, a camera, a sensor, or another computer.

(extra notes)

Other test frameworks

ACTS (Android Connectivity Test Suite), which is part of the Android Open Source Project.

ChromeOS Factory, which is part of The Chromium Projects.

Autotest, originally out of Platforms for Linux Kernel testing.

Avocado

第 53 页，共 277 页

Our Solution

A python library that provides a set of convenient abstractions for authoring hardware testing code.

第 54 页，共 277 页

Use Cases

Manufacturing Floor

Automated Lab

Benchtop

第 55 页，共 277 页

Core Abstractions

Test

Plug

Test Equipment &

Device Under Test

Output Callback:

JSON to disk,

upload via network, etc.

Phase

Output

Record

Measurement

jethier

OpenHTF tests are broken down into phases, kind of similar to “test cases”.

Phases are the atomic unit of testing, and can be as simple as regular Python functions.

Measurements are, as they sound, the individual pieces of data collected during the test.

Sometimes you want to store an artifact like a JPEG or a .WAV that was generated during the test. Attachments are the abstraction we provide for doing that.

Plugs are interfaces to hardware (more info on that later).

And output callbacks let you do something useful with the test results once a test run is complete.

madsci

An important thing to reiterate when describing these concepts is that the intention is to lower the barrier to entry of test development, following a “Keep it simple” mantra.

fahhem

Python structure of these abstractions

第 56 页，共 277 页

Tests & Phases

free-for-all!!!!

This is a pretty distilled example of an OpenHTF test script.

The overall flow is to:

Import the framework, define some phases (functions).

Instantiate a test, passing in those phases.

Add some output callbacks to do something useful with the test record at the end of the test.

And run the test’s ‘execute’ method.

PASS/FAIL is determined by the values assigned to the various measurements.

You declare a measurement by decorating your phase with the @measures decorator.

You can specify validators (numerical limits being the most common) in the declaration.

Then, inside the phase, you set the measurement as you would a normal variable in Python.

And OpenHTF takes care of running the validator you’ve declared on the value you set, and determining whether the measurement is a PASS or a FAIL.

When the phase is done, all the measurement PASS/FAIL outcomes are rolled up into an overall PASS/FAIL for the phase, which in turn rolls up to an overall PASS/FAIL for the test itself.

Once your test finishes, you have a structured record of test execution.

Different teams and users want to do different things with their test records.

Someone who’s running test on the bench in the lab might just want to dump the record to a human-readable JSON file on the local filesystem.

Whereas someone who is setting up a test station at the tail end of a manufacturing line in the factory will want to upload the test records to some central data server where the data can be aggregated, queried, etc.

To stay with our goal of versatility, OpenHTF lets you set an arbitrary series of output callbacks for your test.

Output callbacks are just Python functions (or callables) that take in a test record and do what you want to do with it.

At the end of the test run, OpenHTF calls all the output callbacks in turn, passing each one the completed test record from the test run.

We provide some useful ones (like OutputToJSON), but you can use those as a template and write your own as well.

This is a great way to integrate with, say, an MES system on the manufacturing floor.

第 57 页，共 277 页

Plugs

jethier

For interfacing with hardware, OpenHTF uses an abstraction called “plugs”.

They’re called “plugs” because “plug-ins” was too indicative of pure software extensions, and the term “plugs” in contrast called physical hardware to mind.

This is one of the areas where we’re hoping to get some help from you guys, the wider testing community!

Our desire is to have a large collection of plugs available for interfacing with tons of different off-the-shelf test equipment, whether it’s power supplies, scopes, analyzers, sensors, etc.

But right now our collection of plugs is pretty thin, mostly because we’ve been focused on the framework itself, but also because a lot of our internal use cases just use their own proprietary plug code.

But the short explanation of plugs is that you subclass our BasePlug class, and add the logic you need to control your specific DUT or test equipment.

第 58 页，共 277 页

Web GUI

第 59 页，共 277 页

Q&A

第 60 页，共 277 页

Detecting loop inefficiencies automatically

(to appear in FSE 2016)

Monika Dhok (IISc Bangalore, India)*

Murali Krishna Ramanathan (IISc Bangalore, India)

第 61 页，共 277 页

Software efficiency is very important

Performance issues are hard to detect during testing �

These issues are found even in well tested commercial softwares�

Degrade application responsiveness and user experience

第 62 页，共 277 页

Performance bugs are critical

Implementation mistakes that cause inefficiency

Difficult to catch them during compiler optimizations

Fixing them can result in large speedups, thereby improving efficiency

第 63 页，共 277 页

Redundant traversal bugs

When program iterates over a data structure repeatedly without any intermediate modifications

Public class A{

1. Public boolean containsAny(Collection c1, Collection c2){

2. Iterator itr = c1.iterator();

3. while(itr.hasNext())

4. if(c2.contains(itr.next()))

5. Return true;

6. Return false;

}

Complexity : O(size(c1) x size(c2))

第 64 页，共 277 页

Performance tests are written by developers

第 65 页，共 277 页

Detecting redundant traversals

Toddler [ICSE 13]

第 66 页，共 277 页

Static analysis techniques alone are not effective

Challenges :

How to confirm the validity of the bug?�

How to expose the root cause?

Execution trace can be helpful�

How to detect that the performance bug is fixed?

第 67 页，共 277 页

Automated tests not effective for performance bugs

Toddler[ICSE 13]

第 68 页，共 277 页

Challenges involved in writing performance tests

Virtual call resolution � Generating tests for all possible resolutions of method � invocation is not scalable

Generating appropriate context� Realization of the defect can be dependent on certain � conditions that affect the reachability of the inefficient loop

Arrangement of elements � Problem can only occur when data structure has large � elements arranged in particular fashion

第 69 页，共 277 页

Glider

We propose a novel and scalable approach to automatically generate tests for exposing loop inefficiencies

第 70 页，共 277 页

Glider is available online

https://drona.csa.iisc.ernet.in/~sss/tools/glider

第 71 页，共 277 页

Performance bug caught by glider

第 72 页，共 277 页

Results

We have implemented our approach on SOOT bytecode framework�and evaluated it on number of libraries

Our approach detected 46 bugs across 7 java libraries including 34 �previously unknown bugs.

Tests generated using our approach significantly outperform the �randomly generated tests.

第 73 页，共 277 页

Questions?

第 74 页，共 277 页

NEED FOR SPEED

accelerate tests from 3 hours to 3 minutes

emo@komfo.com

第 75 页，共 277 页

3

hours

3

minutes

600 API tests

第 76 页，共 277 页

Before

After

The

3 Minute

Goal

第 77 页，共 277 页

It’s not about the numbers or techniques you’ll see.

It’s all about continuous improvement.

第 78 页，共 277 页

Dedicated

Environment

第 79 页，共 277 页

Execution Time in Minutes

180

123

New Environment

第 80 页，共 277 页

Empty Databases

第 81 页，共 277 页

The time needed to create data for one test:

And then the test starts

Call 12 API endpoints

Modify data in 11 tables

Takes about 1.2 seconds

第 82 页，共 277 页

180

123

Execution Time in Minutes

89

Empty Databases

第 83 页，共 277 页

Simulate

Dependencies

第 84 页，共 277 页

+Some More

STUB

Stub all external dependencies

Core API

第 85 页，共 277 页

Transparent

Fake SSL certs

Dynamic Responses

Local Storage

Return Binary Data

Regex URL match

Existing Tools (March 2016)

Stubby4J

WireMock

Wilma

soapUI

MockServer

mounteback

Hoverfly

Mirage

We created project Nagual,

open source soon.

第 86 页，共 277 页

180

123

89

Execution Time in Minutes

65

Stub Dependencies

第 87 页，共 277 页

Move to Containers

第 88 页，共 277 页

180

123

89

65

Execution Time in Minutes

104

Using Containers

第 89 页，共 277 页

Run Databases

in Memory

第 90 页，共 277 页

180

123

89

65

104

Execution Time in Minutes

61

Run Databases in Memory

第 91 页，共 277 页

Don’t Clean

Test Data

第 92 页，共 277 页

180

123

89

65

104

61

Execution Time in Minutes

46

Don’t delete test data

第 93 页，共 277 页

Run in Parallel

第 94 页，共 277 页

4 6 8 10 12 14 16

Time to execute 12 9 7 5 8 12 17

The Sweet Spot

第 95 页，共 277 页

180

123

89

65

104

61

46

Execution Time in Minutes

5

Run in Parallel

第 96 页，共 277 页

Equalize Workload

第 97 页，共 277 页

第 98 页，共 277 页

第 99 页，共 277 页

180

123

89

65

104

61

46

5

Execution Time in Minutes

3

Equal Batches

Run in Parallel

Don’t delete test data

Run Databases in Memory

Using Containers

Stub Dependencies

Empty Databases

New Environment

第 100 页，共 277 页

After Hardware Upgrade

The Outcome

2:15 min.

1:38 min.

第 101 页，共 277 页

The tests are slow

The tests are unreliable

The tests can’t exactly pinpoint the problem

High Level Tests Problems

3 Minutes

No external dependencies

It’s cheap to run all tests after every change

第 102 页，共 277 页

In a couple of years, running all your automated tests, after every code change, for less than 3 minutes, will be standard development practice.

第 103 页，共 277 页

第 104 页，共 277 页

EmanuilSlavov.com

@EmanuilSlavov

第 105 页，共 277 页

Slide #, Photo Credits

1. https://www.flickr.com/photos/thomashawk

5. https://www.flickr.com/photos/100497095@N02

7. https://www.flickr.com/photos/andrewmalone

10. https://www.flickr.com/photos/astrablog

14. https://www.flickr.com/photos/foilman

16. https://www.flickr.com/photos/missusdoubleyou

18. https://www.flickr.com/photos/canonsnapper

20. https://www.flickr.com/photos/anotherangle

23. https://www.flickr.com/photos/-aismist

第 106 页，共 277 页

Code Coverage is a Strong Predictor of

Test Suite Effectiveness

in the Real World

Rahul Gopinath

Iftekhar Ahmed

第 107 页，共 277 页

When should we stop testing?

第 108 页，共 277 页

How to evaluate test suite effectiveness?

第 109 页，共 277 页

Previous research: Do not trust coverage

(In theory)

GTAC’15 Inozemtseva

第 110 页，共 277 页

Factors affecting test suite quality

Test suite quality

Coverage

Assertions

第 111 页，共 277 页

According to previous research

Test suite quality

Coverage

Assertions

Test suite size

GTAC’15 Inozemtseva

第 112 页，共 277 页

But...

What is the adequate test suite size?

Is there a maximum number of test cases for a given program?
Are different test cases equivalent in strength?
How do we account for duplicate tests?
Test suite sizes are not comparable even for the same program.

第 113 页，共 277 页

Can I use coverage to measure

suite effectiveness?

第 114 页，共 277 页

Statement coverage best predicts mutation score

A fault in a statement has 87% probability of being detected

if an organic test covers it.

M = 0.87xS

Size of dots follow size of projects

R² = 0.94

Results from 250 real world programs

largest > 100 KLOC

On Developer written test suites

第 115 页，共 277 页

Statement coverage best predicts mutation score

A fault in a statement has 61% probability of being detected

if a generated test covers it.

M = 0.61xS

Size of dots follow size of projects

R² = 0.70

Results from 250 real world programs

largest > 100 KLOC

On Randoop generated test suites

第 116 页，共 277 页

But

Controlling for test suite size, coverage provides little extra information.

Hence don't use coverage [GTAC’15 inozemtseva]

Why use mutation?

Mutation score provides little extra information (<6%) compared to coverage.

第 117 页，共 277 页

Does coverage have no extra value?

	GTAC’15 Inozemtseva	Our Research
# Programs	5	250
Selection of programs	Ad hoc	Systematic sample from Github
Tool used	CodeCover, PIT	Emma, Cobertura, CodeCover, PIT
Test suites	Random subsets of original	Organic & Randomly generated

(New results)
Removal of influence of size	Ad hoc	Statistical

Our study is much larger, systematic (not ad hoc), and follows the real world usage

	Our Research (New results)
M~TestsuiteSize	12.84%
M~log(TSize)	51.26%
residuals(M~log(TSize))~S	75.25%

Statement coverage can explain 75% variability in mutation score after eliminating influence of test suite size.

第 118 页，共 277 页

Is mutation analysis better than coverage analysis?

第 119 页，共 277 页

Mutation analysis: High cost of analysis

Δ=b² – 4ac

d = b^2 + 4 * a * c;�d = b^2 * 4 * a * c;�d = b^2 / 4 * a * c;�d = b^2 ^ 4 * a * c;�d = b^2 % 4 * a * c;

d = b^2 << 4 * a * c;

d = b^2 >> 4 * a * c;

d = b^2 * 4 + a * c;�d = b^2 * 4 - a * c;�d = b^2 * 4 / a * c;�d = b^2 * 4 ^ a * c;�d = b^2 * 4 % a * c;

d = b^2 * 4 << a * c;

d = b^2 * 4 >> a * c;

d = b^2 * 4 * a + c;�d = b^2 * 4 * a - c;�d = b^2 * 4 * a / c;�d = b^2 * 4 * a ^ c;�d = b^2 * 4 * a % c;

d = b^2 * 4 * a << c;

d = b^2 * 4 * a >> c;

d = b + 2 - 4 * a * c;�d = b - 2 - 4 * a * c;�d = b * 2 - 4 * a * c;�d = b / 2 - 4 * a * c;�d = b % 2 - 4 * a * c;

d = b^0 - 4 * a * c;�d = b^1 - 4 * a * c;

d = b^-1 - 4 * a * c;

d = b^MAX - 4 * a * c;

d = b^MIN - 4 * a * c;

d = b^2 - 0 * a * c;�d = b^2 - 1 * a * c;�d = b^2 – (-1) * a * c;�d = b^2 - MAX * a * c;�d = b^2 - MIN * a * c;�

第 120 页，共 277 页

Mutation score is very costly

第 121 页，共 277 页

Mutation analysis: Equivalent mutants

Δ=b² – 2²ac

d = b^2 - (2^2) * a * c;�d = b^2 - (2*2) * a * c;�d = b^2 - (2+2) * a * c;

Mutants

Original

Equivalent Mutant

Normal Mutant

Or: Do not trust low mutation scores

第 122 页，共 277 页

Low mutation score does not indicate a low quality test suite.

第 123 页，共 277 页

Mutation analysis: Equivalent mutants

Δ=b² – 2²ac

d = b^2 - (-4) * a * c;�d = b^2 + 4 * a * c;�d = (-b)^2 - 4 * a * c;

Mutants

Original

Equivalent Mutant

Redundant Mutant

Or: Do not trust low mutation scores

第 124 页，共 277 页

High mutation score does not indicate a high quality test suite.

第 125 页，共 277 页

Mutation Analysis: Different Operators

Δ=b² – 4ac

d = b^2 + 4 * a * c;

>>> dis.dis(d)

2 0 LOAD_FAST 0 (b)

3 LOAD_CONST 1 (2)

6 LOAD_CONST 2 (4)

9 LOAD_FAST 1 (a)

12 BINARY_MULTIPLY

13 LOAD_FAST 2 (c)

16 BINARY_MULTIPLY

17 BINARY_SUBTRACT

18 BINARY_XOR

19 RETURN_VALUE x

[2016 Software Quality Journal]

第 126 页，共 277 页

Mutation score is not a consistent measure

第 127 页，共 277 页

Does a high coverage test suite

actually prevent bugs?

第 128 页，共 277 页

We looked at bugfixes on actual programs

An uncovered line is twice as likely to have a bug fix

as that of a line covered by any test case.

[FSE 2016]	Covered	Uncovered	p
Statement	0.68	1.20	0.00
Block	0.42	0.83	0.00
Method	0.40	0.87	0.00
Class	0.45	0.32	0.10

Difference in bug-fixes between covered and

Uncovered program elements

第 129 页，共 277 页

Does a high coverage test suite

actually prevent bugs?

Yes it does

第 130 页，共 277 页

Summary

Do not dismiss coverage lightly

Beware of mutation analysis caveats

Coverage is a pretty good heuristic on where the bugs hide.

Coverage is highly correlated with mutation score (92%)
Coverage provides 75% more information than just test suite size.

Mutation score provides little extra information compared to coverage.
Mutation score can be unreliable.

第 131 页，共 277 页

Assume non-equivalent, non-redundant, uniform fault distribution for mutants

at one’s own peril.

Beware of theoretical spherical cows…

第 132 页，共 277 页

Backup slides

第 133 页，共 277 页

That is,

Coverage is highly correlated with mutation score (92%)
Mutation score provides little extra information compared to coverage.
Coverage provides 75% more information than just test suite size.
Mutation score can be unreliable.
Coverage thresholds actually help reduce incidence of bugs.

第 134 页，共 277 页

Mutation X Path Coverage

第 135 页，共 277 页

Mutation X Branch Coverage

第 136 页，共 277 页

Computations

require(Coverage)

data(o.db)

o <- subset(subset(o.db, tloc != 0), select=c('pit.mutation.cov', 'cobertura.line.cov', 'loc', 'tloc'))

o$l.tloc <- log2(o$tloc)

oo <- subset(o, l.tloc != -Inf)

ooo <- na.omit(oo)

> cor.test(pit.mutation.cov,tloc)

t = 1.973, df = 232, p-value = 0.04969

95 percent confidence interval: 0.0002148688 0.2525430013

sample estimates: cor 0.1284574

> cor.test(pit.mutation.cov,l.tloc)

data: pit.mutation.cov and l.tloc

t = 9.0938, df = 232, p-value < 2.2e-16

95 percent confidence interval: 0.4114269 0.6013377

sample estimates: cor 0.5126249

> cor.test(resid(lm(pit.mutation.cov~log(tloc))),cobertura.line.cov)

data: resid(lm(pit.mutation.cov ~ log(tloc))) and cobertura.line.cov

t = 17.406, df = 232, p-value < 2.2e-16

95 percent confidence interval: 0.6909857 0.8032663

sample estimates: cor 0.7525441

> summary(lm(pit.mutation.cov~log(tloc)))

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.13644 0.06031 -2.262 0.0246 *

log(tloc) 0.09950 0.01094 9.094 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2839 on 232 degrees of freedom

Multiple R-squared: 0.2628, Adjusted R-squared: 0.2596

F-statistic: 82.7 on 1 and 232 DF, p-value: < 2.2e-16

> summary(lm(pit.mutation.cov~log(tloc)+cobertura.line.cov))

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.074859 0.031645 -2.366 0.018828 *

log(tloc) 0.023658 0.006487 3.647 0.000328 ***

cobertura.line.cov 0.785488 0.031628 24.836 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1485 on 231 degrees of freedom

Multiple R-squared: 0.7991, Adjusted R-squared: 0.7974

F-statistic: 459.5 on 2 and 231 DF, p-value: < 2.2e-16

第 137 页，共 277 页

Does Mutation score correlate to fixed bugs?

第 138 页，共 277 页

Mutant semiotics (how faults map to failures) is not well understood

Affected by factors of the particular project

Style of development, coding guidelines etc
Complexity of algorithms
Coupling between modules

第 139 页，共 277 页

Can weak mutation analysis help?

Rather than the failure of a test case for a mutant, we only require a change in state. It is easier to compute, but:

Does not verify assertions
So, Just another coverage technique
Redundant and Equivalent mutants remain

第 140 页，共 277 页

Method

250 real world projects from Github, largest > 100 KLOC.

Tests

Developer written

Randoop generated

	Statement	Branch	Path	Mutation
Emma	X
Cobertura	X	X
Codecover	X	X
JMockit	X		X
PIT	X			X
Major				X
Judy				X

第 141 页，共 277 页

Mutation analysis has a number of other problems

Mutants are not similar in their difficulty to kill

So a test suite that is optimized for killing difficult mutants is at a disadvantage

Coupling effect has not been validated for complex systems

According to Wah, the coupling will decrease as the system gets larger.

第 142 页，共 277 页

The fault distribution may not be uniform

A majority of mutants are very easy to kill, but some are stubborn.

Does two test suites with say 50% mutation score have the same strength?

Testsuites optimized for harder to detect faults are penalized.

第 143 页，共 277 页

Correlation does not imply causation?

It was pointed out in the previous talk that correlation between coverage and mutation score does not imply a causal relationship between the two. We can counter it by:

Logic

A test suite with zero coverage will not kill any mutants.

A test suite can only kill mutants on the lines it covers.

Statistically

Using additive noise models to identify cause and effect. (ongoing research)

Given two random variables X and Y , X is assumed to cause Y if

(i) Y can be obtained as a function of X plus a noise term independent of X, but

(ii) X cannot be obtained as a function of Y plus independent noise,

then we infer that X causes Y .

In this case, where (i) and (ii) hold simultaneously, the CAM is termed identifiable

(Consistency of Causal Inference under the Additive Noise Model)

(Shimizu et al., 2006; Hoyer et al., 2009; Tillman et al., 2009; Peters et al., 2011a;b)

Fit Y as a function f(X),

obtain the residuals η(Y,f) = Y − f(X),

fit X as a function g(Y ),

obtain the residuals η(X,g) = X − g(Y ),

decide X → Y if

η(Y,f) ⊥ X

but

η(X,g) !⊥ Y ,

decide Y → X if the reverse holds true,

abstain otherwise.

> mu_line <- lm(ooo$pit.mutation.cov~ooo$cobertura.line.cov)$residuals

> line_mu <- lm(ooo$cobertura.line.cov~ooo$pit.mutation.cov)$residuals

> boxplot(mu_line,ooo$cobertura.line.cov)

> boxplot(line_mu,ooo$pit.mutation.cov)

> t.test(mu_line,ooo$cobertura.line.cov)

Welch Two Sample t-test

data: mu_line and ooo$cobertura.line.cov

t = -17.209, df = 318.55, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.4768821 -0.3790312

sample estimates:

mean of x mean of y

-1.700321e-18 4.279566e-01

> t.test(line_mu,ooo$pit.mutation.cov)

Welch Two Sample t-test

data: line_mu and ooo$pit.mutation.cov

t = -16.062, df = 337.66, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.4325586 -0.3381731

sample estimates:

mean of x mean of y

-2.321884e-18 3.853659e-01

> require(CAM)

> x <- subset(ooo, select=c(pit.mutation.cov, cobertura.line.cov))

> CAM(x)$Adj

第 144 页，共 277 页

ClusterRunner

Making fast test-feedback easy through horizontal scaling.

Joseph Harrington and Taejun Lee Productivity Engineering

第 145 页，共 277 页

What is ClusterRunner?

第 146 页，共 277 页

第 147 页，共 277 页

第 148 页，共 277 页

Functional

Tests

Integration Tests

Unit Tests

Manual

Tests

第 149 页，共 277 页

第 150 页，共 277 页

第 151 页，共 277 页

Develop

Test

Feature

Design

Release

第 152 页，共 277 页

Develop

Test

Feature

Design

Release

第 153 页，共 277 页

PHPUnit testsuite duration at Box

第 154 页，共 277 页

第 155 页，共 277 页

“A problem isn’t a problem if you can throw money at it.”

第 156 页，共 277 页

第 157 页，共 277 页

PHPUnit

Scala SBT

nosetests

QUnit

JUnit

第 158 页，共 277 页

Requirements

Easy to configure and use

Test technology agnostic

Fast test feedback

第 159 页，共 277 页

第 160 页，共 277 页

www.ClusterRunner.com

第 161 页，共 277 页

Our 30-hour testsuite

17

minutes

We have been using ClusterRunner at Box since late 2014, and is a service that is as irreplaceable as Jenkins here at Box.

The 30 hour testsuite I mentioned earlier—takes on average 17 minutes to run today using ClusterRunner.

For comparison, 30 hours is more time than I get to spend writing code in a week. 17 minutes is how long I spend writing one email.

30 hours, 17 minutes--that magnitude of build time difference totally changes how developers work.

And I’d like to note that the reason why it takes 17 minutes to run, is because the longest single test file in this testsuite takes 17 minutes to run. Assuming you have enough hardware, ClusterRunner is able to efficiently allocate load such that your builds take only as long as the longest single test in your testsuite.

Ok, that’s enough context. I’m going to hand it off to Joey to actually show you ClusterRunner in action.

第 162 页，共 277 页

ClusterRunner in Action

Bring up a cluster
Set up your project
Execute a build
Look at the results

第 163 页，共 277 页

Bring up a Cluster

�# On master.box.com�clusterrunner master � --port 43000

# On slave1.box.com, slave2.box.com�clusterrunner slave � --master-url master.box.com:43000

第 164 页，共 277 页

Bring up a Cluster

http://master.box.com:43000/v1/slave/

第 165 页，共 277 页

第 166 页，共 277 页

Set up Your Project

Create clusterrunner.yaml at the root of your project repo.

Commands to run
How to distribute

第 167 页，共 277 页

第 168 页，共 277 页

Set up Your Project

�> phpunit ./test/php/EarthTest.php

> phpunit ./test/php/WindTest.php

> phpunit ./test/php/FireTest.php

> phpunit ./test/php/WaterTest.php

> phpunit ./test/php/HeartTest.php

第 169 页，共 277 页

Execute a Build

Now we’re ready to build!

�clusterrunner build� --master-url master.box.com:43000

git � --url http://github.com/myproject� --job-name PHPUnit

第 170 页，共 277 页

第 171 页，共 277 页

View Build Results

http://master.box.com:43000/v1/build/1/

第 172 页，共 277 页

第 173 页，共 277 页

View Build Results

http://master.box.com:43000/v1/build/1/subjob/

第 174 页，共 277 页

第 175 页，共 277 页

View Build Results

http://master.box.com:43000/v1/build/1/result

第 176 页，共 277 页

第 177 页，共 277 页

第 178 页，共 277 页

第 179 页，共 277 页

第 180 页，共 277 页

What’s next for ClusterRunner

AWS integration with autoscaling
Docker support
Improvements to deployment mechanism
In-place upgrades
Web UI

第 181 页，共 277 页

clusterrunner.com

Get Involved!

第 182 页，共 277 页

productivity@box.com

Contact Us

第 183 页，共 277 页

Multi-device Testing

E2E test infra for mobile products of today and tomorrow

angli@google.com

adorokhine@google.com

第 184 页，共 277 页

Overview

E2E testing challenges

Introducing Mobly

Sample test

Controlling Android devices

Custom controller

Demo

第 185 页，共 277 页

E2E Testing

Unit Tests

Integration/Component Tests

E2E Tests

Testing Pyramid

Where magic dwells

第 186 页，共 277 页

E2E Testing is Important

Applications involving multiple devices

P2P data transfer, nearby discovery

Product under test is not a conventional device.

Internet-Of-Things, VR

Need to control and vary physical environment

RF: Wi-Fi router, attenuators

Lighting, physical position

Interact with other software/cloud services

iPerf server, cloud service backend, network components

第 187 页，共 277 页

E2E Testing is Hard!

Most test frameworks are for single-device app testing

Need to trigger complex actions on devices

Some may need system privilege

Need to synchronize steps between multiple devices

Logic may be centralized (hard to write) or decentralized (hard to trigger)

Need to drive a wide range of equipment

attenuator, call box, power meter, wireless AP etc

Need to communicate with cloud services

Need to collect debugging artifacts from many sources

第 188 页，共 277 页

Our Solution - Mobly

Lightweight Python framework (Py2/3 compatible)

Test logic runs on a host machine

Controls a collection of devices/equipment in a test bed

Bundled with controller library for essential equipment

Android device, power meter, etc

Flexible and pluggable

Custom controller module for your own toys

Open source and ready to go!

第 189 页，共 277 页

Mobly Architecture

Test Bed

Computer

Mobly

Test Script

Mobile

Device

Network Switch

Attenuator

Call Box

Cloud Service

Test Harness

Test bed allocation, device provisioning, and results aggregation

第 190 页，共 277 页

Sample Tests

Hello from the other side

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLo�HELLO�HELLO�HELLO�HELLO

HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO�HELLO

第 191 页，共 277 页

Describe a Test Bed

{� 'testbed': [{� 'name': 'SimpleTestBed',� 'AndroidDevice': '*'� }],� 'logpath': '/tmp/mobly_logs'�}

第 192 页，共 277 页

Test Script - Hello!

from mobly import base_test�from mobly import test_runner��class HelloWorldTest(base_test.BaseTestClass):� def setup_class(self):� self.ads = self.register_controller(android_device)� self.dut1 = self.ads[0]�� def test_hello_world(self):� self.dut1.sl4a.makeToast('Hello!')

if __name__ == '__main__':� test_runner.main()

Invocation:�$ ./path/to/hello_world_test.py -c path/to/config.json

第 193 页，共 277 页

Beyond the Basics

Config:�{� 'testbed': [{� ...� }],� 'logpath': '/tmp/mobly_logs',� 'toast_text': 'Hey there!'�}

�Code:

self.user_params['toast_text'] # 'Hey there!'

第 194 页，共 277 页

Beyond the Basics

Device specific logger

self.caller.log.info("I did something.")�# <timestamp> [AndroidDevice|<serial>] I did something

In test bed config:�'AndroidDevice': [{'serial': 'xyz', 'label': 'caller'},� {'serial': 'abc', 'label': 'callee',� 'phone_number': '123456'}]

In code:�self.callee = android_device.get_device(self.ads, label='callee')�self.callee.phone_number # '123456'

Specific device info

第 195 页，共 277 页

Controlling Android Devices

adb/shell

UI

API Calls

Custom Java Logic

第 196 页，共 277 页

Controlling Android Devices

adb

ad.adb.shell('pm clear com.my.package')

UI automator

ad.uia = uiautomator.Device(serial=ad.serial)

ad.uia(text='Hello World!').wait.exists(timeout=1000)

Android API calls, including system/hidden APIs, via SL4A

ad.sl4a.wifiConnect({'SSID': 'GoogleGuest'})

Custom Java logic

ad.register_snippets('trigger', 'com.my.package.snippets')

ad.trigger.myImpeccableLogic(5)

第 197 页，共 277 页

System API Calls

> self.dut.sl4a.makeToast('Hello World!')

SL4A (Scripting Layer for Android) is an RPC service exposing API calls on Android

self.dut.api is the RPC client for SL4A.

Original version works on regular Android builds.

Fork in AOSP can make direct system privileged calls (system/hidden APIs).

第 198 页，共 277 页

Custom Snippets

SL4A is not sufficient

SL4A methods are mapped to Android APIs, but tests need more than just Android API calls.

Current AOSP SL4A requires system privilege

Custom snippets allows users to define custom method that does anything they want.

Custom snippets can be used with other useful libs like Espresso

第 199 页，共 277 页

Custom Snippets

package com.mypackage.testing.snippets.example;

public class ExampleSnippet implements Snippet {� public ExampleSnippet(Context context) {}�� @Rpc(description='Returns a string containing the given number.')� public String getFoo(Integer input) {� return 'foo ' + input;� }�� @Override� public void shutdown() {}�}

第 200 页，共 277 页

Custom Snippets

Add your snippet classes to AndroidManifest.xml for the androidTest apk

Compile it into an apk

apply plugin: 'com.android.application'� dependencies {� androidTestCompile 'com.google.android.mobly:snippetlib:0.0.1'� }

第 201 页，共 277 页

Custom Snippets

Install the apk on your device

Load and call it

ad.load_snippets(name='snippets',� package='com.mypackage.testing.snippets.example')�foo = ad.snippets.getFoo(2) # 'foo 2'

第 202 页，共 277 页

Espresso in Custom Snippets

import static android.support.test.espresso.Espresso.onView;�import static android.support.test.espresso.action.ViewActions.swipeUp;�import static android.support.test.espresso.matcher.ViewMatchers.withId;

public class ExampleSnippet implements Snippet {� public ExampleSnippet(Context context) {}�� @Rpc(description='Performs a swipe using espresso')� public void performSwipe() {� onView(withId(R.id.my_view_id)).perform(swipeUp());� }�}

第 203 页，共 277 页

Custom Controllers

Plug in your own toys

第 204 页，共 277 页

Loose Controller Interface

def create(configs):� '''Instantiate controller objects'''��def destroy(objects):� '''Destroy controller objects'''

def get_info(objects):� '''[optional] Get controller info for test summary'''

第 205 页，共 277 页

Using Custom Controllers

from my.project.testing.controllers import car��def setup_class(self):� self.cars = self.register_controller(car)��def test_something(self):� self.cars[0].drive()

第 206 页，共 277 页

Video Demo

A test bed with two phones and one watch.
Phone A gives the voice command to watch.
Watch initiates a call to phone B.
Phone B gets a ringing call notification.
Phone A hangs up.

第 207 页，共 277 页

Video Demo

第 208 页，共 277 页

Coming Soon

iOS controller libs

Dependent on libimobiledevice

KIFTest, XCTest, XCUITest

Async events in snippets

Standard snippet and python utils for basic Android operations

Support non-Nexus Android devices

第 209 页，共 277 页

Thank You!

Questions?

Resources:

Mobly on Github

SL4A code link

Snippet Lib on Github

Google group

第 210 页，共 277 页

Scale vs Value

Test Automation at the BBC

David Buckhurst & Jitesh Gosai

第 211 页，共 277 页

第 212 页，共 277 页

第 213 页，共 277 页

第 214 页，共 277 页

JG: Notes

So we started looking at behaviour driven development or BDD as a way to understand what our app should be doing

this would involve developers, tester and product owners coming together to discuss how a new feature should work and describe it in plain english - no technical jargon. - know as a feature file - you may have heard these session called 3 amigos sessions

We then used this feature file as a starting point in what to automate .

the idea being if the automated test passed then we had satisfied that accepted criteria and built what was decided in the 3 amigo session

and we started to get pretty good at automating feature files

and running them on a real device was pretty simple as well

but we were still very depend on massively long running manual regression test cycles before a release of our product could actually be made

so we wanted to see if we could run more of these automated tests on more devices to try and cut our regression test cycles

Original notes

JG

BDD and automation

Got really good at writing UI tests -- running on a single device. Following bdd practices

But we were still dependent on massive, time consuming regression cycles

Wanted a way to scale our automation to cut regression

第 215 页，共 277 页

Lots of innovation

Chair hive

第 216 页，共 277 页

第 217 页，共 277 页

第 218 页，共 277 页

第 219 页，共 277 页

第 220 页，共 277 页

第 221 页，共 277 页

第 222 页，共 277 页

第 223 页，共 277 页

第 224 页，共 277 页

第 225 页，共 277 页

第 226 页，共 277 页

第 227 页，共 277 页

第 228 页，共 277 页

第 229 页，共 277 页

第 230 页，共 277 页

Live

Insights

&

Operational

Notifications

第 231 页，共 277 页

第 232 页，共 277 页

Scale vs Value

第 233 页，共 277 页

www.bbc.co.uk/opensource

@BBCOpenSource

@davidbuckhurst @JitGo

第 234 页，共 277 页

Finding bugs in

C/C++ libraries using

libFuzzer

Kostya Serebryany, GTAC 2016

第 235 页，共 277 页

Agenda

What is fuzzing
Why fuzz
What to fuzz
How to fuzz

… with libFuzzer

Demo (CVE-2016-5179)

第 236 页，共 277 页

What is Fuzzing

Somehow generate a test input�
Feed it to the code under test�
Repeat

第 237 页，共 277 页

Why fuzz

Bugs specific to C/C++ that require the sanitizers to catch:

Use-after-free, buffer overflows, Uses of uninitialized memory, Memory leaks

Arithmetic bugs:

Div-by-zero, Int/float overflows, bitwise shifts by invalid amount

Plain crashes:

NULL dereferences, Uncaught exceptions

Concurrency bugs:

Data races, Deadlocks

Resource usage bugs:

Memory exhaustion, hangs or infinite loops, infinite recursion (stack overflows)

Logical bugs:

Discrepancies between two implementations of the same protocol (example)
Assertion failures

第 238 页，共 277 页

What to fuzz

Anything that consumes untrusted or complicated inputs:

Parsers of any kind (xml, pdf, truetype, ...)
Media codecs (audio, video, raster & vector images, etc)
Network protocols, RPC libraries (gRPC)
Crypto (boringssl, openssl)
Compression (zip, gzip, bzip2, brotli, …)
Compilers and interpreters (PHP, Perl, Python, Go, Clang, …)
Regular expression matchers (PCRE, RE2, libc’s regcomp)
Text/UTF processing (icu)
Databases (SQLite)
Browsers, text editors/processors (Chrome, OpenOffice)

OS Kernels (Linux), drivers, supervisors and VMs
UI (Chrome UI)

第 239 页，共 277 页

How to fuzz

Generation-based fuzzing

Usually a target-specific grammar-based generator�

Mutation-based fuzzing

Acquire a corpus of test inputs
Apply random mutations to the inputs�

Guided mutation-based fuzzing

Execute mutations with coverage instrumentation
If new coverage is observed the mutation is permanently added to the corpus

第 240 页，共 277 页

Fuzz Target - a C/C++ function worth fuzzing

extern "C"

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t DataSize) {

if (DataSize >= 3 &&

Data[0]=='F' &&

Data[1]=='U' &&

Data[2]=='Z' &&

Data[3]=='Z')

DoMoreStuff(Data, DataSize);

return 0;

}

第 241 页，共 277 页

libFuzzer - an engine for guided in-process fuzzing

libFuzzer: a library; provides main()
Build your target code with extra compiler flags
Link your target with libFuzzer
Pass a directory with the initial test corpus and run

% clang++ -g my-code.cc libFuzzer.a -o my-fuzzer \

-fsanitize=address -fsanitize-coverage=trace-pc-guard

% ./my-fuzzer MY_TEST_CORPUS_DIR

第 242 页，共 277 页

CVE-2016-5179 (c-ares, asynchronous DNS requests)

extern "C"

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t DataSize) {

unsigned char *buf; int buflen;

std::string s(reinterpret_cast<const char *>(Data), DataSize);

ares_create_query(s.c_str(), ns_c_in, ns_t_a, 0x1234, 0, &buf,

&buflen, 0);

free(buf);

return 0;

}

第 243 页，共 277 页

Demo

第 244 页，共 277 页

present perfect => present continuous

“The project X has been fuzzed, hence it is somewhat secure”�
False:

Bug discovery techniques evolve
The project X evolves
Fuzzing is CPU intensive and needs time to find bugs�

“The project X is being continuously fuzzed, the code coverage is monitored.”

Much better!

第 245 页，共 277 页

Oss-fuzz - fuzzing as a service for OSS

Based on ClusterFuzz, the fuzzing backend used for fuzzing Chrome components

Supported engines: libFuzzer, AFL, Radamsa, ...

https://github.com/google/oss-fuzz

第 246 页，共 277 页

Q&A

libFuzzer.info

tutorial.libFuzzer.info

第 247 页，共 277 页

Can MongoDB Recover from Catastrophe?

How I learned to crash a server

{ name : "Jonathan Abrahams",

title : "Senior Quality Engineer",

location : "New York, NY",

twitter : "@MongoDB",

facebook : "MongoDB" }

第 248 页，共 277 页

A machine may crash for a variety of reasons:

Termination of virtual machine or host
Hardware failure
OS failure

Machine crash

Unexpected termination of mongod

Application crash

第 249 页，共 277 页

Why do we need to crash a machine?

We could abort mongod, but this would not fully simulate an unexpected crash of a machine or OS (kernel):

Immediate loss of power may prevent cached I/O from being flushed to disk.

A kernel panic can leave an application (and its data) in an unrecoverable state.

第 250 页，共 277 页

System passes h/w & s/w checks

mongod goes into recovery mode

mongod ready for client connection

System restart

第 251 页，共 277 页

How can we crash a machine?

We started by crashing the machine manually, by pulling the cord.

We evolved to using an appliance timer, which would power the machine off/on every 15 minutes.

We also figured out that setting up a cron job to send an internal crash command (more on this later) to the machine for a random period would do the job.

And then we realized, we need to do it a bit more often.

第 252 页，共 277 页

How did we really crash that machine, and can we do over and over and over and over...?

第 253 页，共 277 页

Why do we need to do it over and over and over?

A crash of a machine may be catastrophic. In order to uncover any subtle recovery bugs, we want to repeatedly crash a machine and test if it has recovered. A failure may only be encountered 1 out of 100 times!

第 254 页，共 277 页

Ubiquiti mPower PRO to the rescue!

Programmable power device, with ssh access from LAN via WiFi or Ethernet.

第 255 页，共 277 页

How do we turn off and on the power?

ssh admin@mpower

local outlet="output1"

# Send power cycle to mFi mPower to specified outlet

echo 0 > /dev/$outlet

sleep 10

echo 1 > /dev/$outlet

第 256 页，共 277 页

Physical vs. Virtual

It is necessary to test both type of machines as machine crashes are different and the underlying host OS and hardware may provide different I/O caching and data protection. Virtual machines typically rely on shared resources and physical machines typically use dedicated resources.

第 257 页，共 277 页

How do we crash a virtual machine?

We can crash it from the VM host:

KVM (Kernel-based VM): virsh destroy <vm>

VmWare: vmrun stop <vm> hard

第 258 页，共 277 页

How do we restart a crashed VM?

We can restart it from the VM host:

KVM (Kernel-based VM): virsh start <vm>

VmWare: vmrun start <vm>

第 259 页，共 277 页

How else can we crash a machine?

We can crash it using the magical SysRq key sequence (Linux only):

echo 1 | sudo tee /proc/sys/kernel/sysrq

echo b | sudo tee /proc/sysrq-trigger

第 260 页，共 277 页

How do we get the machine to restart?

Enable the BIOS setting to boot up after AC power is provided.

第 261 页，共 277 页

Restarting a Windows Machine

To disable a Windows machine from prompting you after unexpected shutdown:

bcdedit /set {default} bootstatuspolicy ignoreallfailures

bcdedit /set {current} bootstatuspolicy ignoreallfailures

bcdedit /timeout 5

第 262 页，共 277 页

The machine is running

Now that we figured out how to get our machine to crash and restart, we restart the mongod and it will go into recovery mode.

第 263 页，共 277 页

Recovery mode of mongod

Performed automatically when mongod starts, if there was an unclean shutdown detected.

WiredTiger starts from the last stable copy of the data on disk from the last checkpoint. The journal log is then applied and a new checkpoint is applied.

第 264 页，共 277 页

Before the crash!

Stimulate mongod by running several simultaneous (mongo shell) clients which provide a moderate load utilizing nearly all supported operations. This is important, as CRUD operations will cause mongod to perform I/O operations, which should never lead to file or data corruption.

第 265 页，共 277 页

Options, options

Client operations optionally provide:

Checkpoint document

Write & Read concerns

The mongod process is tested in a variety modes, including:

Standalone or single node replica set

Storage engine, i.e., mmapv1, wiredTiger

第 266 页，共 277 页

What do we do after mongod has restarted?

After the machine has been restarted, we start mongod on a private port and it goes into recovery mode. Once that completes, we perform further client validation, via mongo (shell):

serverStatus

Optionally, run validate against all databases and collections

Optionally, verify if a checkpoint document exists

Failure to recover, connect to mongod, or perform the other validation steps is considered a test failure.

第 267 页，共 277 页

What do we do after mongod has restarted?

Now that the recovery validation have passed, we will proceed with the pre-crash steps:

Stop and restart mongod on a public port

Start new set of (mongo shell) clients to perform various DB operations

第 268 页，共 277 页

Why do we care about validation?

The validate command checks the structures within a namespace for correctness by scanning the collection’s data and indexes. The command returns information regarding the on-disk representation of the collection.

Failing validation indicates that something has been corrupted, most likely due to an incomplete I/O operation during the unexpected shutdown.

第 269 页，共 277 页

Failure analysis

Since our developers could be local (NYC) or worldwide (Boston, Sydney), we want a self-service application they can use to reproduce reported failures. A bash script has been developed which can execute on both local hardware and in the cloud (AWS).

We save any artifacts useful for our developers to be able to analyze the failure:

Backup data files before starting mongod

Backup data files after mongod completes recovery

mongod and mongo (shell) log files

第 270 页，共 277 页

The crash testing helped to:

Extend our testing to scenarios not previously covered

Provide local and remote teams with tools to reproduce and analyze failures

Improve robustness of the mongod storage layer

第 271 页，共 277 页

Results, results

Storage engine bugs were discovered from the power cycle testing and led to fixes/improvements.

We have plans to incorporate this testing into our continuous integration.

第 272 页，共 277 页

Some bugs discovered

SERVER-20295 Power cycle test - mongod fails to start with invalid object size in storage.bson

SERVER-19774 WT_NOTFOUND: item not found during DB recovery

SERVER-19692 Mongod failed to open connection, remained in hung state, when running WT with LSM

SERVER-18838 DB fails to recover creates and drops after system crash

SERVER-18379 DB fails to recover when specifying LSM, after system crash

SERVER-18316 Database with WT engine fails to recover after system crash

SERVER-16702 Mongod fails during journal replay with mmapv1 after power cycle

SERVER-16021 WT failed to start with "lsm-worker: Error in LSM worker thread 2: No such file or directory"

第 273 页，共 277 页

Open issues?

Can we crash Windows using an internal command (cue the laugh track…)?

第 274 页，共 277 页

Closing remarks

第 275 页，共 277 页

Organizing committee

Alan Myrvold

Amar Amte

Andrea Dawson

Ari Shamash

Carly Schaeffer

Dan Giovannelli

David Aristizabal

Diego Cavalcanti

Jaydeep Mehta

Joe Drummey

Josephine Chandra

Kathleen Li

Lena Wakayama

Lesley Katzen

Madison Garcia

Matt Lowrie

Matthew Halupka

Sonal Shah

Travis Ellett

Yvette Nameth

第 276 页，共 277 页

London 2017

第 277 页，共 277 页

GTAC 2017

testing.googleblog.com