1 of 75

An Introduction to�Property-Based Testing

Zac Hatfield-Dodds & Ryan Soklaski

2 of 75

The Plan

Welcome, introductions ← you are here
Basic Property-Based Testing

exercises!

Describing your data

exercises!
break time!

Common Test Tactics

exercises!

Putting it into Practice

exercises!

3 of 75

Property-Based Testing 101

4 of 75

What is testing, anyway?

“Testing is the art and science of running your code and then checking that it did the right thing.

Cool things that aren’t testing: assertions, type-checkers, linters, code review, coffee or sleep...

5 of 75

A few kinds of tests:

“Testing is the art and science of running your code and then checking that it did the right thing.

Cool things that aren’t testing: assertions, type-checkers, linters, code review, coffee or sleep...

Unit tests
Integration tests
Snapshot tests
Parameterized tests
Fuzz tests
Property (-based) tests
Stateful model tests

See hillelwayne.com/a-bunch-of-tests/

6 of 75

Code samples via https://carbon.now.sh/

from collections import Counter

from hypothesis import given, strategies as st

import pytest

trusted_sort_function = sorted

# We can hand-write a test for various inputs, e.g. by type and order

def test_sorted_manually_ints():

assert sorted([1, 2, 3]) == [1, 2, 3]

def test_sorted_manually_floats():

assert sorted([3.0, 2.0, 1.0]) == [1.0, 2.0, 3.0]

def test_sorted_manually_strings():

assert sorted(["b", "c", "a"]) == ["a", "b", "c"]

# and we can use a parametrized test to avoid repeating ourselves

@pytest.mark.parametrize(

"arg, result",

[

([1, 2, 3], [1, 2, 3]),

([3.0, 2.0, 1.0], [1.0, 2.0, 3.0]),

(["b", "c", "a"], ["a", "b", "c"]),

],

)

def test_sorted_parametrized_with_known_results(arg, result):

assert sorted(arg) == result

# but what if we don't have (or trust) a known result?

@pytest.mark.parametrize(

"arg",

[[1, 2, 3], [3.0, 2.0, 1.0], ["b", "c", "a"]],

)

def test_sorted_parametrized_with_reference(arg):

assert sorted(arg) == trusted_sort_function(arg)

# What if we also don't have a known reference?

@pytest.mark.parametrize(

"arg",

[[1, 2, 3], [3.0, 2.0, 1.0], ["b", "c", "a"]],

)

def test_sorted_parametrized_without_known_result(arg):

result = sorted(arg)

# Check that the `result` is in sorted order

for a, b in zip(result, result[1:]):

assert a <= b

# but wait, `return []` shouldn't be valid!

@pytest.mark.parametrize("arg", ...)

def test_sorted_parametrized_without_known_result(arg):

result = sorted(arg)

# Check that the `result` is in sorted order

for a, b in zip(result, result[1:]):

assert a <= b

# Check that `result` has the same elements as `arg`

assert len(arg) == len(result)

assert set(arg) == set(result)

@pytest.mark.parametrize("arg", ...)

def test_sorted_parametrized_without_known_result(arg):

result = sorted(arg)

# Check that the `result` is in sorted order

for a, b in zip(result, result[1:]):

assert a <= b

# Check that `result` has the same elements as `arg`

assert tuple(result) in itertools.permutations(arg)

@pytest.mark.parametrize("arg", ...)

def test_sorted_parametrized_without_known_result(arg):

result = sorted(arg)

# Check that the `result` is in sorted order

for a, b in zip(result, result[1:]):

assert a <= b

# Check that `result` has the same elements as `arg`

assert Counter(arg) == Counter(result)

from hypothesis import given, strategies as st

@given(

arg=st.one_of(

st.lists(st.integers() | st.floats()),

st.lists(st.text()),

)

def test_sorted_property_based(arg):

...

7 of 75

8 of 75

9 of 75

10 of 75

11 of 75

12 of 75

13 of 75

These are properties!

Sorting is fully specified by just two properties:

Output is in order
Same elements in both

partial specs are still very useful for finding bugs :-)

14 of 75

15 of 75

16 of 75

Summary

Property-based testing lets us

Generate input data that we wouldn’t think of
Check that the result is not wrong, even if we don’t know the right answer
Discover bugs in our understanding, not just our code

We often don’t need assertions in the test

Generating ‘weird’ input data is surprisingly effective
If possible, put assertions in the code you’re testing

17 of 75

Exercises!

github.com/rsokl/testing-tutorial/

$ pip install notebook numpy pytest hypothesis[cli]

....

practice!

18 of 75

19 of 75

Describing your Data

20 of 75

An overview of Hypothesis strategies

Kinds of strategies

Scalar values - none, booleans, integers, datetimes, etc.
Collections - lists, tuples, dictionaries, fixed_dictionaries, etc.
Modifying strategies with map() and filter()
Specials - just, sampled_from, one_of, nothing, builds, composite, flatmap, data
Recursive data, three ways
Inferred strategies - from_type, from_dtype, mut_broadcastable, etc.
By location - core/stdlib, extras, third-party extensions

Useful recipes
Exercises for this part focus on data-gen, not properties

21 of 75

Scalar values

None and booleans�
Numbers - including nan support

min_value and max_value�

Strings - characters, text, bytestrings

min_size and max_size�

Date, time, and timezones�
(others omitted for space)�

You name it, Hypothesis can generate it. Literally.

22 of 75

Collections

st.lists(elements, min_size=0, max_size=None, unique_by=None, unique=False)

Want a variable-size collection, with similar elements?
Foundation for dictionaries(), sets(), iterables(), etc.�

st.tuples(...)

Fixed-length, with a different strategy for each element�

st.fixed_dictionaries()

Specify a strategy for each (known) key
Keys can be required, or optional

23 of 75

Modifying strategies with .map() and .filter()

.map()

generate example, then apply the function�e.g. st.integers().map(str) → numbers as strings�

.filter()

great for rejecting rare-ish bad inputs
see: hypothesis.assume() - usable inside map() and strategies and tests!
If more than ~20% of examples are rejected, try to find an alternative

24 of 75

25 of 75

Special: just() and sampled_from()

just() “lifts” a value into a strategy that will only ever generate that value� e.g. `timezones=just(UTC)` - if you only want to vary other args, use just()�

sampled_from() chooses an element of a sequence� e.g. join=sampled_from([“inner”, “outer”])`�

works well with enums, including flag enums� i.e. sampled_from(Permissions) can generate R, W, X, R|W, R|X, R|W|X, …

26 of 75

Special: one_of() and nothing()

one_of() takes the union of strategies, like adding sets

nothing() is like the empty set

one_of(integers(), nothing()) == integers()
nicer to work with than “a strategy or None”

Impossible to “subtract” strategies or take the intersection 😭

27 of 75

Special: builds()

Construct custom objects - you’ll use this a lot

Started as error-handling and syntactic-sugar over .map()
Has some nice type-based inference for missing args

more on this later

28 of 75

Recursive data: recursive() or deferred() ?

Simple rules of thumb:

In @composite, writing recursive function calls just works
st.recursive() for simple tree-structured data like JSON
st.deferred() when you get a NameError

29 of 75

Inferred strategies - from_type()

Integrated with builds(), but available standalone�
Designed as a timesaver

If it’s not working, write a strategy by hand
Combines well; not an all-or-nothing option�

st.register_type_strategy()

The automatic introspection is often good enough
If not, you can specify a strategy to use for your type�(or even an introspection function)

30 of 75

Inferred strategies - other

from_regex()

phonenums = from_regex(r"^$\d{3}$ ?\d{4} ?\d{4}$")
varnames = from_regex(r"[a-z_A-Z0-9]+", fullmatch=True).filter(str.isidentifier)�

from_dtype() and mutually_broadcastable_shapes(signature=...)

Ergonomic tools for Numpy, from basic to esoteric.�

from_field() and from_model()

Django user? We’ve got you covered.

31 of 75

Inferred strategies - design tips

It’s a useful pattern to save time and make testing easier�
Err on the side of too general

“missed alarms” are the worst test problem
‘Explicit is better than implicit’�

Support a smooth path from fully inferred to hand-written

e.g. builds() makes it easy to specify some args and have others inferred.

32 of 75

Special: @composite, .flatmap(), and data()

Three ways to generate data with internal dependencies and no filters�(e.g. “a tuple of a list, and a valid index into the list”)

Flatmap works for simple cases: from_type(type).flatmap(from_type)

@composite is semantically equivalent, better UX for nontrivial things

33 of 75

The ‘inner composite’ trick

34 of 75

The ‘inner composite’ trick

35 of 75

Special: data()

data() allows you to draw from strategies inside your test�- like @composite plus awareness of the test-so-far

Upside: incredibly flexible and powerful; arbitrary state and dependencies�Downside: can be too flexible and powerful, complicated failure reports

Summary: use data() if you need it �… but if @composite would also work, use the simpler tool instead.

36 of 75

Where to look for strategies

Core strategies

No dependencies outside the standard library� except for backports, e.g. zoneinfo for timezones()
Found in `hypothesis.strategies`

Extras strategies

Support for libraries such as Numpy, Pandas, Django, dateutil, etc.
Found in `hypothesis.extra.<libname>`

Third-party extensions

hypothesis.readthedocs.io/en/latest/strategies.html
strategies, inference function, plugins, and more
listed != endorsed

37 of 75

Exercises!

Aiming to teach a way of thinking �

strategies are primitives
combine them however you like�

“Duct tape mindset”: if it’s not working yet, use more!

38 of 75

Break time!

39 of 75

40 of 75

The Plan

Welcome, introductions
Basic Property-Based Testing

exercises!

Describing your data

exercises!
break time!

Common Test Tactics ← you are here

exercises!

Putting it into Practice

exercises!

41 of 75

Common Test Tactics

42 of 75

Common properties you can test

Common properties

Fuzzing / “does not crash”
Roundtrip pairs
Equivalent functions
Metamorphic properties

Situational properties

Checking the output
Idempotent, commutative, associative, etc
Stateful / model-based tests

‘Ghostwriting’ tests

43 of 75

this works shockingly well

(especially with assertions in your code)

test

NO assert

call.

44 of 75

Roundtrips

Every codebase has roundtrips:

save/load
encode/decode
send/receive
converting between data formats
logical inverses

they’re critical to our code, have complicated inputs and outputs,�errors are common, and they’re logic bugs are prone to silent failure.� �Property-test all your round-trips!

45 of 75

Equivalent functions

Exactly equivalent:

Single-thread vs. multi-thread
Old version vs. new version
foo(bar(x)) vs. bar(foo(x))

�Sometimes equivalent:

“same for a subset of inputs”
“same, unless FoobarError is raised”

46 of 75

Validate the output

I always feel silly writing these checks, but sometimes they catch a bug

Numbers are in the valid range (or at least finite)
Got the expected type
No empty strings, no null characters

Best to write these assertions in your code, not tests

Think “fast feedback for future changes”

47 of 75

Idempotent, commutative, associative, etc

Thanks to Haskell, property-based testing is named for “algebraic properties”

They’re pretty rare in Python code
But if you have them, might as well test them!

More common from set-like than number-like operations, e.g.�blog.developer.atlassian.com/programming-with-algebra/ �found them very useful in merging event streams

48 of 75

Model based / stateful testing

Get Hypothesis to choose sequences �of actions, as well as input data�
Very powerful, great for exploring APIs�
TBH this would be a whole workshop,�so I’m just telling you they exist.�

@happyautomata

49 of 75

Metamorphic relations

i.e. between two related inputs

Modify uncovered code → same result
Double inputs, double outputs�(and elements in same order)
Negate + sort = sort + reverse + negate
+timedelta, to UTC = to UTC + timedelta
> income → >= post-tax income

e.g. add noops → function is equivalent�or known change → known change

50 of 75

assert,�fuzz,�roundtrip,

and then relax according to the 80/20 rule.

51 of 75

$ hypothesis write my.tests

an interactive live demo �of the Ghostwriter.

52 of 75

Exercises!

53 of 75

54 of 75

Putting it into Practice

55 of 75

Beyond the principles

You’ve learned the principles. Now, some tips for the real world

Design patterns for property-based test suites

Including when not to use them
Writing custom strategies for your project

Settings and settings profiles
Whether and how to share the example database
Coverage-guided fuzzing (Atheris or HypoFuzz)
Hypothesis’ release cadence, when to update

and then our final exercises will be real-world bughunting :-)

56 of 75

Designing PBT suites

PBT is part of a more general test plan - not a panacea!

Depending on the project, we typically use 10%-90% property-based tests

Custom strategies for your project

Single source of truth for “what weird edge cases do we need to handle”
Updating tests or strategies independently is much nicer
Some patterns:

functions which return strategies (possibly @composite)
just assign some useful globals
use register_type_strategy() for custom types

57 of 75

Better print()-debugging with note() and event()

note()

“print, but only during the final/minimal example”

event()

What proportion of inputs had this event?
Shown in statistics, not printed

58 of 75

Runtime Statistics

[example statistics output here, showing an event() and target report]

59 of 75

Dealing with external randomness

Random number generators:

Naive-random testing, e.g. Faker (no shrinking or search!)
Scheduling, e.g. backoff-with-jitter or async internals
Simulations

Best option: pass a random.Random() from the st.randoms() strategy

(non-PRNG randomness like thread timings is basically out of scope, sorry)

60 of 75

Dealing with global PRNGs

If you can’t pass a Random() instance…

random_module() will vary the seeds of all known ‘global’ PRNGs

hypothesis.register_random() can add to the list

Consider requesting upstream integration via a plugin�(Zac is usually happy to write these)

61 of 75

Settings

Profiles

Set from code, so you can use env vars or config files or whatever

As a decorator on a test function (quick and dirty)

From the pytest command-line (inc. profile selection)

62 of 75

Settings - performance

per-test-case deadline

If 200ms is too short, increase it to whatever
If VM performance is flaky, disable it entirely on VMs.

max_examples

100 by default. Higher (or lower) takes proportionally more (or less) time

Check --hypothesis-show-statistics to see timings, including proportion of time spent generating data vs executing your test function

63 of 75

Settings - determinism

Maybe you only want to know about new bugs in CI

derandomize=True
--hypothesis-seed=N

And then run in nondeterministic mode in other tests.

See blog.nelhage.com/post/two-kinds-of-testing/

64 of 75

Reproducing failures

Just re-run the test, �and the database will do it�
Add an explicit example →�
Use print_blob=True and �@reproduce_failure

65 of 75

Reproducing failures

Just re-run the test, �and the database will do it�
Add an explicit example�
Use print_blob=True and �@reproduce_failure →

�Temporary decorator, but great�in CI when printing doesn’t work

66 of 75

Sharing the database

You could share the directory-based DB, but much better to use our native tools:

(and it’s easy to implement a Hypothesis DB on any key-value store)

67 of 75

target() -guided testing

Hypothesis is mostly “blackbox” - using heuristics and diversity-sampling�this is better than random, but a directed search is better again.�

hypothesis.target(score_to_maximise, label="for multi-objective optimisation")

chance of finding a bug improves over time
known uses:

`target(abs(a-b)) < error_threshold`
number of elements in a collection, tasks in a queue, steps executed
mean or maximum runtime of a task (or both, if you use `label`)
compression ratio for data (perhaps per-algorithm or per-level)
`1 if was_valid_input else 0` (avoids filtering problems)

68 of 75

Coverage-guided fuzzing: Atheris

No targets? No problem - target “executed this line of code”!��Atheris is Google’s libfuzzer wrapper for Python

Designed to run a single function for hours or days
Great for C-extensions and native code
Hypothesis integrates well with traditional fuzzers

69 of 75

Coverage-guided fuzzing: HypoFuzz

HypoFuzz is Zac’s fuzzing engine for Hypothesis test suites

pure-python
target() and event()
runs all the tests at once

�Better workflow integration,�great database support*, etc.��*but not better than .fuzz_one_input 😇

70 of 75

Where to go for support

hypothesis.readthedocs.io/en/latest/support.html

We do not promise free support, but you can try:

StackOverflow
mailing list
IRC

If you have a support or training budget, email us!

71 of 75

Updating Hypothesis

We do continuous deployment - every PR is a new release

Update on the schedule that works for you�e.g. weekly, monthly, to get a new feature or perf improvement

We take stability very seriously�...but you should still pin all your transitive dependencies

72 of 75

Exercises!

73 of 75

74 of 75

Q&A time

last chance before we wrap up

75 of 75

Thanks for coming!

Now go forth and test everything :-)