1 of 22

Property Testing Pandas with Bulwark

Zax Rosenberg, CFA

Senior Data Scientist @ SPINS

github.com/ZaxR

www.zaxrosenberg.com

2 of 22

Agenda

  • About Me
  • About Property Testing
  • About Bulwark
  • Demo
  • Contributing

3 of 22

About Me

  • Senior Data Scientist @ SPINS
  • Director of ChiPy’s Mentorship Program
  • Co-host of ChiPy’s Project Night
  • Blogger (zaxrosenberg.com/blog)
  • Speaker (github.com/ZaxR/talks)

4 of 22

About Me

  • NOT a testing expert

5 of 22

What is Property Testing?

Checking that some object has certain properties. For example:

some_list = [1, 2, 3, 4]

One property of some_list is that its values are in a range of 1-4. Another is that it’s mutable.

6 of 22

Why is it valuable?

  • Testing reduces bugs/lowers development cost
  • We don’t always have the exact data up front
  • Property tests can be relatively fast to run
  • Easy to include domain knowledge

7 of 22

Introducing Bulwark

8 of 22

Introducing Bulwark

  • Bulwark is an open-source library that lets you easily property test pandas dataframes.
  • It’s designed to make it easier for data analysts and data scientists to test our own code.

9 of 22

Bulwark’s Design

  • Property tests are available as functions (“checks”) and decorators.
  • Each check:
    • Takes a pd.DataFrame and optional additional arguments,
    • Makes an assertion about the pd.DataFrame, and
    • Returns the original, unaltered pd.DataFrame.
  • A failed checks raises an AssertionError, printing an informative message.
  • Each check has an auto-magically-generated associated decorator, allowing you to make your assertions outside the actual logic of your code. This is a core benefit of Bulwark.

10 of 22

Quickstart - Input

uh oh...

import convention

a check decorator

11 of 22

Quickstart - Result

row index 2, column ‘b’ fails

12 of 22

What if I have multiple checks?

pass a dict of

check: param pairs

13 of 22

What if I have multiple checks?

collects all the errors

14 of 22

What if I don’t want to raise errors?

prints instead of raises

15 of 22

What about when I go to production?

Pro tip: set a centralized config variable that toggles all decorators’ statuses.

doesn’t raise an error

turn off this check

16 of 22

Demo Time!

17 of 22

Where should I use Bulwark?

  • On ETL pipeline functions, especially E & L
    • Help enhance your understanding of the data upfront, even if you don’t do full EDA
    • Integration test by checking output
  • In unit tests

18 of 22

How should I use Bulwark?

  • Favor the decorator version within core code
    • Lets you disable/switch to warnings
    • Separates checks from code/business logic
  • Use check version in unit tests

19 of 22

When should I use Bulwark?

  • During development
  • Maybe at run time.
    • If it’s acceptable for runs to fail
    • If the state should never be reached, and you can’t handle the error.

20 of 22

Who’s using Bulwark?

  • 6,123 total downloads
  • 256 downloads/month (excluding mirrors)
  • Folks in > 45 countries

21 of 22

Contributing is easy, too!

  • Very friendly to folks new to open source! Adding a new check is as easy as writing a single function.
  • Full instructions available at: https://bulwark.readthedocs.io/en/latest/contributing.html

22 of 22

Find out more