1 of 14

Test All The Data!

Eric J. Ma

BP Lighting Talks, June 2015

2 of 14

About Me

Just finished 4th year PhD, MIT Biological Engineering.

Self-taught pythonista, journey really began here at Boston Python, with one master named Giles.

3 of 14

About Me

I play with statistics and biological data, and I think you need to test your data.

4 of 14

Why?

  1. You have data.
  2. You have assumptions of the data.
  3. You will modify the data, making more assumptions of the resultant data.

  • The data do not always follow your assumptions.

5 of 14

Wat?

THE DATA DO NOT ALWAYS FOLLOW YOUR ASSUMPTIONS

6 of 14

Example

Data: one column needs to be log10-transformed.

Assumptions:

  • dtype: int or float
  • greater than zero
  • a log10-transformed column of the data should not exist in the original data file.

7 of 14

HOW?!

8 of 14

Step 1: Install pytest

$ pip install pytest

9 of 14

Step 2: Create your test script

$ touch test_data.py

$ pico test_data.py

10 of 14

Step 3: Make your script read data

import pandas as pd

pd.read_csv(‘data.csv’)

11 of 14

Step 4: Write your test functions.

def test_column_is_correct():� assert data[‘column’].dtype == float

assert data[‘column’].min() > 0

def test_data_state_integrity():

assert ‘log10_col’ not in data.columns

12 of 14

Step 5: Run your tests

$ py.test

13 of 14

Step 6: Scream at your data provider

  • Actually, talk nicely :)
  • Discuss why data test failed.
  • Fix data.
  • Test again.

Rinse and repeat.

You’ll never have enough data tests!

14 of 14

t:@ericmjl w:ericmajinglong.com g:github.com/ericmjl