1 of 75

An Introduction to�Property-Based Testing

Zac Hatfield-Dodds & Ryan Soklaski

2 of 75

The Plan

  • Welcome, introductions ← you are here
  • Basic Property-Based Testing
    • exercises!
  • Describing your data
    • exercises!
    • break time!
  • Common Test Tactics
    • exercises!
  • Putting it into Practice
    • exercises!

3 of 75

Property-Based Testing 101

4 of 75

What is testing, anyway?

“Testing is the art and science of running your code and then checking that it did the right thing.

Cool things that aren’t testing: assertions, type-checkers, linters, code review, coffee or sleep...

5 of 75

A few kinds of tests:

“Testing is the art and science of running your code and then checking that it did the right thing.

Cool things that aren’t testing: assertions, type-checkers, linters, code review, coffee or sleep...

  • Unit tests
  • Integration tests
  • Snapshot tests
  • Parameterized tests
  • Fuzz tests
  • Property (-based) tests
  • Stateful model tests

See hillelwayne.com/a-bunch-of-tests/

6 of 75

7 of 75

8 of 75

9 of 75

10 of 75

11 of 75

12 of 75

13 of 75

These are properties!

Sorting is fully specified by just two properties:

  1. Output is in order
  2. Same elements in both

partial specs are still very useful for finding bugs :-)

14 of 75

15 of 75

16 of 75

Summary

  • Property-based testing lets us
    • Generate input data that we wouldn’t think of
    • Check that the result is not wrong, even if we don’t know the right answer
    • Discover bugs in our understanding, not just our code

  • We often don’t need assertions in the test
    • Generating ‘weird’ input data is surprisingly effective
    • If possible, put assertions in the code you’re testing

17 of 75

Exercises!

  1. github.com/rsokl/testing-tutorial/

  • $ pip install notebook numpy pytest hypothesis[cli]

  • ....

  • practice!

18 of 75

19 of 75

Describing your Data

20 of 75

An overview of Hypothesis strategies

  • Kinds of strategies
    • Scalar values - none, booleans, integers, datetimes, etc.
    • Collections - lists, tuples, dictionaries, fixed_dictionaries, etc.
    • Modifying strategies with map() and filter()
    • Specials - just, sampled_from, one_of, nothing, builds, composite, flatmap, data
    • Recursive data, three ways
    • Inferred strategies - from_type, from_dtype, mut_broadcastable, etc.
    • By location - core/stdlib, extras, third-party extensions
  • Useful recipes
  • Exercises for this part focus on data-gen, not properties

21 of 75

Scalar values

  • None and booleans�
  • Numbers - including nan support
    • min_value and max_value�
  • Strings - characters, text, bytestrings
    • min_size and max_size�
  • Date, time, and timezones�
  • (others omitted for space)�

You name it, Hypothesis can generate it. Literally.

22 of 75

Collections

st.lists(elements, min_size=0, max_size=None, unique_by=None, unique=False)

  • Want a variable-size collection, with similar elements?
  • Foundation for dictionaries(), sets(), iterables(), etc.�

st.tuples(...)

  • Fixed-length, with a different strategy for each element�

st.fixed_dictionaries()

  • Specify a strategy for each (known) key
  • Keys can be required, or optional

23 of 75

Modifying strategies with .map() and .filter()

.map()

  • generate example, then apply the function�e.g. st.integers().map(str) → numbers as strings�

.filter()

  • great for rejecting rare-ish bad inputs
  • see: hypothesis.assume() - usable inside map() and strategies and tests!
  • If more than ~20% of examples are rejected, try to find an alternative

24 of 75

25 of 75

Special: just() and sampled_from()

just() “lifts” a value into a strategy that will only ever generate that value� e.g. `timezones=just(UTC)` - if you only want to vary other args, use just()�

sampled_from() chooses an element of a sequence� e.g. join=sampled_from([“inner”, “outer”])`�

works well with enums, including flag enums� i.e. sampled_from(Permissions) can generate R, W, X, R|W, R|X, R|W|X, …

26 of 75

Special: one_of() and nothing()

one_of() takes the union of strategies, like adding sets

nothing() is like the empty set

  • one_of(integers(), nothing()) == integers()
  • nicer to work with than “a strategy or None”

Impossible to “subtract” strategies or take the intersection 😭

27 of 75

Special: builds()

Construct custom objects - you’ll use this a lot

  • Started as error-handling and syntactic-sugar over .map()
  • Has some nice type-based inference for missing args
    • more on this later

28 of 75

Recursive data: recursive() or deferred() ?

Simple rules of thumb:

  • In @composite, writing recursive function calls just works
  • st.recursive() for simple tree-structured data like JSON
  • st.deferred() when you get a NameError

29 of 75

Inferred strategies - from_type()

  • Integrated with builds(), but available standalone�
  • Designed as a timesaver
    • If it’s not working, write a strategy by hand
    • Combines well; not an all-or-nothing option�
  • st.register_type_strategy()
    • The automatic introspection is often good enough
    • If not, you can specify a strategy to use for your type�(or even an introspection function)

30 of 75

Inferred strategies - other

  • from_regex()
    • phonenums = from_regex(r"^\(\d{3}\) ?\d{4} ?\d{4}$")
    • varnames = from_regex(r"[a-z_A-Z0-9]+", fullmatch=True).filter(str.isidentifier)�
  • from_dtype() and mutually_broadcastable_shapes(signature=...)
    • Ergonomic tools for Numpy, from basic to esoteric.�
  • from_field() and from_model()
    • Django user? We’ve got you covered.

31 of 75

Inferred strategies - design tips

  • It’s a useful pattern to save time and make testing easier�
  • Err on the side of too general
    • “missed alarms” are the worst test problem
    • ‘Explicit is better than implicit’�
  • Support a smooth path from fully inferred to hand-written
    • e.g. builds() makes it easy to specify some args and have others inferred.

32 of 75

Special: @composite, .flatmap(), and data()

Three ways to generate data with internal dependencies and no filters�(e.g. “a tuple of a list, and a valid index into the list”)

Flatmap works for simple cases: from_type(type).flatmap(from_type)

@composite is semantically equivalent, better UX for nontrivial things

33 of 75

The ‘inner composite’ trick

34 of 75

The ‘inner composite’ trick

35 of 75

Special: data()

data() allows you to draw from strategies inside your test�- like @composite plus awareness of the test-so-far

Upside: incredibly flexible and powerful; arbitrary state and dependencies�Downside: can be too flexible and powerful, complicated failure reports

Summary: use data() if you need it �… but if @composite would also work, use the simpler tool instead.

36 of 75

Where to look for strategies

  • Core strategies
    • No dependencies outside the standard library� except for backports, e.g. zoneinfo for timezones()
    • Found in `hypothesis.strategies`
  • Extras strategies
    • Support for libraries such as Numpy, Pandas, Django, dateutil, etc.
    • Found in `hypothesis.extra.<libname>`
  • Third-party extensions

37 of 75

Exercises!

Aiming to teach a way of thinking

  • strategies are primitives
  • combine them however you like�

“Duct tape mindset”: if it’s not working yet, use more!

38 of 75

Break time!

39 of 75

40 of 75

The Plan

  • Welcome, introductions
  • Basic Property-Based Testing
    • exercises!
  • Describing your data
    • exercises!
    • break time!
  • Common Test Tactics ← you are here
    • exercises!
  • Putting it into Practice
    • exercises!

41 of 75

Common Test Tactics

42 of 75

Common properties you can test

  • Common properties
    • Fuzzing / “does not crash”
    • Roundtrip pairs
    • Equivalent functions
    • Metamorphic properties
  • Situational properties
    • Checking the output
    • Idempotent, commutative, associative, etc
    • Stateful / model-based tests
  • ‘Ghostwriting’ tests

43 of 75

this works shockingly well

(especially with assertions in your code)

test

NO assert

call.

44 of 75

Roundtrips

Every codebase has roundtrips:

  • save/load
  • encode/decode
  • send/receive
  • converting between data formats
  • logical inverses

they’re critical to our code, have complicated inputs and outputs,�errors are common, and they’re logic bugs are prone to silent failure.� �Property-test all your round-trips!

45 of 75

Equivalent functions

Exactly equivalent:

  • Single-thread vs. multi-thread
  • Old version vs. new version
  • foo(bar(x)) vs. bar(foo(x))

�Sometimes equivalent:

  • “same for a subset of inputs”
  • “same, unless FoobarError is raised”

46 of 75

Validate the output

I always feel silly writing these checks, but sometimes they catch a bug

  • Numbers are in the valid range (or at least finite)
  • Got the expected type
  • No empty strings, no null characters

Best to write these assertions in your code, not tests

  • Think “fast feedback for future changes”

47 of 75

Idempotent, commutative, associative, etc

Thanks to Haskell, property-based testing is named for “algebraic properties”

  • They’re pretty rare in Python code
  • But if you have them, might as well test them!

More common from set-like than number-like operations, e.g.�blog.developer.atlassian.com/programming-with-algebra/ �found them very useful in merging event streams

48 of 75

Model based / stateful testing

  • Get Hypothesis to choose sequences �of actions, as well as input data�
  • Very powerful, great for exploring APIs�
  • TBH this would be a whole workshop,�so I’m just telling you they exist.�

49 of 75

Metamorphic relations

i.e. between two related inputs

  • Modify uncovered code → same result
  • Double inputs, double outputs�(and elements in same order)
  • Negate + sort = sort + reverse + negate
  • +timedelta, to UTC = to UTC + timedelta
  • > income → >= post-tax income

e.g. add noops → function is equivalent�or known change → known change

50 of 75

assert,�fuzz,�roundtrip,

and then relax according to the 80/20 rule.

51 of 75

$ hypothesis write my.tests

an interactive live demo �of the Ghostwriter.

52 of 75

Exercises!

53 of 75

54 of 75

Putting it into Practice

55 of 75

Beyond the principles

You’ve learned the principles. Now, some tips for the real world

  • Design patterns for property-based test suites
    • Including when not to use them
    • Writing custom strategies for your project
  • Settings and settings profiles
  • Whether and how to share the example database
  • Coverage-guided fuzzing (Atheris or HypoFuzz)
  • Hypothesis’ release cadence, when to update

and then our final exercises will be real-world bughunting :-)

56 of 75

Designing PBT suites

PBT is part of a more general test plan - not a panacea!

  • Depending on the project, we typically use 10%-90% property-based tests

Custom strategies for your project

  • Single source of truth for “what weird edge cases do we need to handle”
  • Updating tests or strategies independently is much nicer
  • Some patterns:
    • functions which return strategies (possibly @composite)
    • just assign some useful globals
    • use register_type_strategy() for custom types

57 of 75

Better print()-debugging with note() and event()

note()

  • “print, but only during the final/minimal example”

event()

  • What proportion of inputs had this event?
  • Shown in statistics, not printed

58 of 75

Runtime Statistics

[example statistics output here, showing an event() and target report]

59 of 75

Dealing with external randomness

Random number generators:

  • Naive-random testing, e.g. Faker (no shrinking or search!)
  • Scheduling, e.g. backoff-with-jitter or async internals
  • Simulations

Best option: pass a random.Random() from the st.randoms() strategy

(non-PRNG randomness like thread timings is basically out of scope, sorry)

60 of 75

Dealing with global PRNGs

If you can’t pass a Random() instance…

random_module() will vary the seeds of all known ‘global’ PRNGs

hypothesis.register_random() can add to the list

Consider requesting upstream integration via a plugin�(Zac is usually happy to write these)

61 of 75

Settings

Profiles

  • Set from code, so you can use env vars or config files or whatever

As a decorator on a test function (quick and dirty)

From the pytest command-line (inc. profile selection)

62 of 75

Settings - performance

  • per-test-case deadline
    • If 200ms is too short, increase it to whatever
    • If VM performance is flaky, disable it entirely on VMs.
  • max_examples
    • 100 by default. Higher (or lower) takes proportionally more (or less) time

Check --hypothesis-show-statistics to see timings, including proportion of time spent generating data vs executing your test function

63 of 75

Settings - determinism

Maybe you only want to know about new bugs in CI

  • derandomize=True
  • --hypothesis-seed=N

And then run in nondeterministic mode in other tests.

See blog.nelhage.com/post/two-kinds-of-testing/

64 of 75

Reproducing failures

  • Just re-run the test, �and the database will do it�
  • Add an explicit example →�
  • Use print_blob=True and �@reproduce_failure

65 of 75

Reproducing failures

  • Just re-run the test, �and the database will do it�
  • Add an explicit example�
  • Use print_blob=True and �@reproduce_failure →

�Temporary decorator, but great�in CI when printing doesn’t work

66 of 75

Sharing the database

You could share the directory-based DB, but much better to use our native tools:

(and it’s easy to implement a Hypothesis DB on any key-value store)

67 of 75

target() -guided testing

Hypothesis is mostly “blackbox” - using heuristics and diversity-sampling�this is better than random, but a directed search is better again.�

hypothesis.target(score_to_maximise, label="for multi-objective optimisation")

  • chance of finding a bug improves over time
  • known uses:
    • `target(abs(a-b)) < error_threshold`
    • number of elements in a collection, tasks in a queue, steps executed
    • mean or maximum runtime of a task (or both, if you use `label`)
    • compression ratio for data (perhaps per-algorithm or per-level)
    • `1 if was_valid_input else 0` (avoids filtering problems)

68 of 75

Coverage-guided fuzzing: Atheris

No targets? No problem - target “executed this line of code”!��Atheris is Google’s libfuzzer wrapper for Python

  • Designed to run a single function for hours or days
  • Great for C-extensions and native code
  • Hypothesis integrates well with traditional fuzzers

69 of 75

Coverage-guided fuzzing: HypoFuzz

HypoFuzz is Zac’s fuzzing engine for Hypothesis test suites

  • pure-python
  • target() and event()
  • runs all the tests at once

�Better workflow integration,�great database support*, etc.��*but not better than .fuzz_one_input 😇

70 of 75

Where to go for support

hypothesis.readthedocs.io/en/latest/support.html

We do not promise free support, but you can try:

  • StackOverflow
  • mailing list
  • IRC

If you have a support or training budget, email us!

71 of 75

Updating Hypothesis

We do continuous deployment - every PR is a new release

Update on the schedule that works for you�e.g. weekly, monthly, to get a new feature or perf improvement

We take stability very seriously�...but you should still pin all your transitive dependencies

72 of 75

Exercises!

73 of 75

74 of 75

Q&A time

last chance before we wrap up

75 of 75

Thanks for coming!

Now go forth and test everything :-)