1 of 20

Preempting Flaky Tests �via Non-Idempotent-Outcome Tests

Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, Wing Lam

1

Funding acknowledgments​

CCF-1763788�CCF-1956374

62161146003

2 of 20

2

Developer Anecdote

Servers

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

4:15 PM

test0

test1

test2

testn

Build code

Run tests

3 of 20

3

Developer Anecdote

Servers

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

4:15 PM

Merge Changes

Pass

test0

test1

test2

testn

Build code

Run tests

4 of 20

4

Developer Anecdote

Servers

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

4:15 PM

Fail

Debug Changes

test0

test1

test2

testn

Build code

Run tests

5 of 20

5

Developer Anecdote

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

Servers

Servers

4:15 PM

5:00 PM

5:30 PM

6:15 PM

Servers

Build code

Run tests

Build code

Run tests

6 of 20

6

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

Servers

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

…�- static int add() {

+ static int add(r) {

- ts.addRow(“”);

+ ts.addRow(r);return ts.size();

Servers

Build code

Run tests

Build code

Run tests

Developer Anecdote

Servers

Servers

4:15 PM

5:00 PM

5:30 PM

6:15 PM

Servers

Developer wastes time �debugging & running tests �and goes home �1 hour and 15 min later

1 hour

15 min

Flaky Test: a test that can �non-deterministically �pass and fail when run on the same code version

7 of 20

7

…�- static int add() {

+ static int add(r) {

- db.addRow(“”);

+ db.addRow(r);return db.size();

Servers

…�- static int add() {

+ static int add(r) {

- db.addRow(“”);

+ db.addRow(r);return db.size();

…�- static int add() {

+ static int add(r) {

- db.addRow(“”);

+ db.addRow(r);return db.size();

Servers

Servers

4:15 PM

5:00 PM

5:30 PM

6:15 PM

Servers

Servers

Developer wastes time �debugging & running tests �and goes home �1 hour and 15 min later

1 hour

15 min

Flaky Test: a test that can �non-deterministically �pass and fail when run on the same version of the code

Public Outcry About Flaky Tests

8 of 20

What are Flaky Tests?

  • A test is flaky if it passes and fails for the same code version
    • Misleads developers to debug nonexistent faults in recent changes
    • Reduces trust in tests
  • Order-dependent tests are a prominent category of flaky tests
    • An order-dependent test deterministically passes or fails in any given test order, passes in 1+ order, and fails in 1+ order

8

9 of 20

Background: Victim and Polluter

  •  

9

// shared variable x is initialized to 0void t1() { assert x == 0; } // victim�void t2() { x = 1; } // polluter

TestOrder1

t1

t2

TestOrder2

t2

t1

10 of 20

Background: Latent-Victim, Latent-Polluter

  •  

10

// shared variables x, y, z are initialized to 0void t1() { assert x == 0; } // victim�void t2() { x = 1; } // pollutervoid t3() { assert y == 0; } // latent-victimvoid t4() { z = 1; } // latent-polluter

11 of 20

Non-Idempotent-Outcome (NIO) Test

  •  

11

// shared variables x, y, z, w are initialized to 0�void t1() { assert x == 0; } // victim�void t2() { x = 1; } // pollutervoid t3() { assert y == 0; } // latent-victimvoid t4() { z = 1; } // latent-pollutervoid t5() { assert w = 0; w = 1;} // NIO

12 of 20

Why should we detect NIOs?

  • Typically, tests are not run twice
  • To preempt/prevent flaky tests
    • Why not fix latent-polluter?
    • Why not fix latent-victim?
  • Prior work
    • Gyori et al.1 detect 575 latent-polluters
      • Manually filter 381 (66%) false positives (cannot reasonably become polluters)
    • Huo and Clause2 detect latent-victims with dynamic taint analysis
      • Do not report how many can reasonably become victims
    • They do NOT fix any tests
  • NIOs are more worth fixing
    • Both latent-victims and latent-polluters at the same time
    • Easy to detect, no false positives
    • Well-accepted fixes

12

1 Gyori et al., “Reliable testing: Detecting state-polluting tests to prevent test dependency”. ISSTA 2015

2 Huo and Clause, “Improving oracle quality by detecting brittle assertions and unused inputs in tests. In FSE 2014

13 of 20

Contributions

  • Definition of NIO tests
    • Deterministically change from pass to fail when run twice
  • Effective detection & empirical evaluation
    • Propose 3 modes for detection
    • 127 Java test suites 🡪 223 NIO tests
    • 1006 Python projects 🡪 138 NIO tests
  • Well-accepted fixes
    • Inspect every NIO test (no false positive)
    • Open pull requests for 268 tests
    • 192 accepted, 70 pending, only 6 rejected

13

14 of 20

Real Example of NIO

def cmd_mock():

def _cmd_mock(name: str):

cmd.__overrides__[name] = [‘/bin/true’]

yield _cmd_mock

- cmd.__overrides__ = []

+ cmd.__overrides__ = {}

def test_slurm_command(tmp_path, cmd_mock):

cmd_mock('srun')

14

Buggy Cleaning Code

TypeError: list indices must be integers or slices, not str

15 of 20

Real Example of NIO

15

def to_zero(tvd, northing, easting,� surface_northing, surface_easting):

# perform some checking

- northing -= surface_northing

- easting -= surface_easting

+ northing = northing - surface_northing

+ easting = easting - surface_easting

return tvd, northing, easting

# initialization for global variables: g1,…,g5

g1 = ...

def test_zero():

# global variables passed in as arguments

v1, v2, v3 = to_zero(g1, g2, g3, g4, g5)� np.testing.assert_equal (...) # assertion

Fix: Avoid Function Side Effect

AssertionError: �Mismatched elements: 121 / 121 (100%)

16 of 20

Prevalence of NIO Tests

Conclusion:

    • NIO tests are prevalent enough that every project should run NIO detection at least once

16

Java

Python

# Test Suites (total)

127

1006

# Test Suites w/ NIO

34

138

% Test Suites w/ NIO

26%

9%

# NIO Tests

223

138

17 of 20

Different Detection Modes

  • Three Different Modes
    • Isolated-method
      • Run1: t1, t1
      • Run2: t2, t2
      • Run3: t3, t3
    • Isolated-class
      • Run1: t1, t1, t2, t2
      • Run2: t3, t3
    • Entire-suite
      • Run1: t1, t1, t2, t2, t3, t3
  • Conclusion
    • All three modes detect similar tests
      • Isolated-method (223) > Isolated-class (212) > Entire-suite (210)
    • Entire-suite has the lowest overhead
    • Why differ? See paper for details

17

TestClass A

t1

t2

TestClass B

t3

Test Suite

18 of 20

Experience with Fixing NIO Tests

  • We detect 361 (233 Java + 138 Python) NIO tests
  • We fix 268 NIO tests by opening Pull Requests
    • 192 tests accepted
    • 70 tests pending
    • 6 tests are rejected
  • We do not fix 51 NIO tests
    • Cannot localize pollution
    • Difficult to clean the pollution
  • 42 tests are N/A
    • Not NIO in the latest version (fixed/deleted/etc)
  • Conclusion
    • Developers are generally positive about fixes for NIO tests
    • Providing reproducing steps and explaining the motivation help

18

19 of 20

NIO vs. Polluter vs. Victim

  • NIO tests are related to but not subsumed by polluters and victims
  • Detecting NIO tests can be an effective way to preempt polluters and victims

19

20 of 20

Conclusions

  • We focus on Non-Idempotent-Outcome (NIO) tests
    • Deterministically change from pass to fail when run twice
  • Detect and fix NIO tests
    • Preempt order-dependent flaky tests
    • Importance: in the intersection of latent-polluters and latent-victims
    • Detect 361 NIO tests (223 Java + 138 Python)
      • Opened pull requests for 268 tests, with 192 accepted
  • Dataset publicly available:

20

Questions? Email: Anjiang Wei <anjiang@stanford.edu>