1 of 20

Preempting Flaky Tests �via Non-Idempotent-Outcome Tests

Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, Wing Lam

anjiang@stanford.edu

Funding acknowledgments

CCF-1763788�CCF-1956374

62161146003

2 of 20

Developer Anecdote

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

4:15 PM

test₀

test₁

test₂

test_n

…

Build code

Run tests

3 of 20

Developer Anecdote

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

4:15 PM

Merge Changes

Pass

test₀

test₁

test₂

test_n

…

Build code

Run tests

4 of 20

Developer Anecdote

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

4:15 PM

Fail

Debug Changes

test₀

test₁

test₂

test_n

…

Build code

Run tests

5 of 20

Developer Anecdote

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

Servers

4:15 PM

5:00 PM

5:30 PM

6:15 PM

Servers

Build code

Run tests

Build code

Run tests

6 of 20

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);� return ts.size();

…

Servers

Build code

Run tests

Build code

Run tests

Developer Anecdote

Servers

4:15 PM

5:00 PM

5:30 PM

6:15 PM

Servers

Developer wastes time �debugging & running tests �and goes home �1 hour and 15 min later

1 hour

15 min

Flaky Test: a test that can �non-deterministically �pass and fail when run on the same code version

7 of 20

…�- static int add() {

+ static int add(r) {

- db.addRow(“”);

+ db.addRow(r);� return db.size();

…

Servers

…�- static int add() {

+ static int add(r) {

- db.addRow(“”);

+ db.addRow(r);� return db.size();

…

…�- static int add() {

+ static int add(r) {

- db.addRow(“”);

+ db.addRow(r);� return db.size();

…

Servers

4:15 PM

5:00 PM

5:30 PM

6:15 PM

Servers

Developer wastes time �debugging & running tests �and goes home �1 hour and 15 min later

1 hour

15 min

Flaky Test: a test that can �non-deterministically �pass and fail when run on the same version of the code

Public Outcry About Flaky Tests

8 of 20

What are Flaky Tests?

A test is flaky if it passes and fails for the same code version

Misleads developers to debug nonexistent faults in recent changes
Reduces trust in tests

Order-dependent tests are a prominent category of flaky tests

An order-dependent test deterministically passes or fails in any given test order, passes in 1+ order, and fails in 1+ order

9 of 20

Background: Victim and Polluter

// shared variable x is initialized to 0�void t1() { assert x == 0; } // victim�void t2() { x = 1; } // polluter

TestOrder1

TestOrder2

10 of 20

Background: Latent-Victim, Latent-Polluter

// shared variables x, y, z are initialized to 0�void t1() { assert x == 0; } // victim�void t2() { x = 1; } // polluter�void t3() { assert y == 0; } // latent-victim�void t4() { z = 1; } // latent-polluter

11 of 20

Non-Idempotent-Outcome (NIO) Test

// shared variables x, y, z, w are initialized to 0�void t1() { assert x == 0; } // victim�void t2() { x = 1; } // polluter�void t3() { assert y == 0; } // latent-victim�void t4() { z = 1; } // latent-polluter�void t5() { assert w = 0; w = 1;} // NIO

12 of 20

Why should we detect NIOs?

Typically, tests are not run twice
To preempt/prevent flaky tests

Why not fix latent-polluter?
Why not fix latent-victim?

Prior work

Gyori et al.¹ detect 575 latent-polluters

Manually filter 381 (66%) false positives (cannot reasonably become polluters)

Huo and Clause²detect latent-victimswith dynamic taint analysis

Do not report how many can reasonably become victims

They do NOT fix any tests

NIOs are more worth fixing

Both latent-victims and latent-polluters at the same time
Easy to detect, no false positives
Well-accepted fixes

¹ Gyori et al., “Reliable testing: Detecting state-polluting tests to prevent test dependency”. ISSTA 2015

² Huo and Clause, “Improving oracle quality by detecting brittle assertions and unused inputs in tests”. In FSE 2014

13 of 20

Contributions

Definition of NIO tests

Deterministically change from pass to fail when run twice

Effective detection & empirical evaluation

Propose 3 modes for detection
127 Java test suites 🡪 223 NIO tests
1006 Python projects 🡪 138 NIO tests

Well-accepted fixes

Inspect every NIO test (no false positive)
Open pull requests for 268 tests
192 accepted, 70 pending, only 6 rejected

14 of 20

Real Example of NIO

def cmd_mock():

def _cmd_mock(name: str):

cmd.__overrides__[name] = [‘/bin/true’]

yield _cmd_mock

- cmd.__overrides__ = []

+ cmd.__overrides__ = {}

def test_slurm_command(tmp_path, cmd_mock):

cmd_mock('srun')

Buggy Cleaning Code

TypeError: list indices must be integers or slices, not str

15 of 20

Real Example of NIO

def to_zero(tvd, northing, easting,� surface_northing, surface_easting):

# perform some checking

- northing -= surface_northing

- easting -= surface_easting

+ northing = northing - surface_northing

+ easting = easting - surface_easting

return tvd, northing, easting

# initialization for global variables: g1,…,g5

g1 = ...

def test_zero():

# global variables passed in as arguments

v1, v2, v3 = to_zero(g1, g2, g3, g4, g5)� np.testing.assert_equal (...) # assertion

Fix: Avoid Function Side Effect

AssertionError: �Mismatched elements: 121 / 121 (100%)

16 of 20

Prevalence of NIO Tests

Conclusion:

NIO tests are prevalent enough that every project should run NIO detection at least once

	Java	Python
# Test Suites (total)	127	1006
# Test Suites w/ NIO	34	138
% Test Suites w/ NIO	26%	9%
# NIO Tests	223	138

17 of 20

Different Detection Modes

Three Different Modes

Isolated-method

Run1: t1, t1
Run2: t2, t2
Run3: t3, t3

Isolated-class

Run1: t1, t1, t2, t2
Run2: t3, t3

Entire-suite

Run1: t1, t1, t2, t2, t3, t3

Conclusion

All three modes detect similar tests

Isolated-method (223) > Isolated-class (212) > Entire-suite (210)

Entire-suite has the lowest overhead
Why differ? See paper for details

TestClass A

TestClass B

Test Suite

18 of 20

Experience with Fixing NIO Tests

We detect 361 (233 Java + 138 Python) NIO tests
We fix 268 NIO tests by opening Pull Requests

192 tests accepted
70 tests pending
6 tests are rejected

We do not fix 51 NIO tests

Cannot localize pollution
Difficult to clean the pollution

42 tests are N/A

Not NIO in the latest version (fixed/deleted/etc)

Conclusion

Developers are generally positive about fixes for NIO tests
Providing reproducing steps and explaining the motivation help

19 of 20

NIO vs. Polluter vs. Victim

NIO tests are related to but not subsumed by polluters and victims
Detecting NIO tests can be an effective way to preempt polluters and victims

20 of 20

Conclusions

We focus on Non-Idempotent-Outcome (NIO) tests

Deterministically change from pass to fail when run twice

Detect and fix NIO tests

Preempt order-dependent flaky tests
Importance: in the intersection of latent-polluters and latent-victims
Detect 361 NIO tests (223 Java + 138 Python)

Opened pull requests for 268 tests, with 192 accepted

Dataset publicly available:

https://sites.google.com/view/nio-tests
IDoFT dataset (all flaky tests): https://github.com/TestingResearchIllinois/idoft

Questions? Email: Anjiang Wei <anjiang@stanford.edu>