Test All The Data!
Eric J. Ma
BP Lighting Talks, June 2015
About Me
Just finished 4th year PhD, MIT Biological Engineering.
Self-taught pythonista, journey really began here at Boston Python, with one master named Giles.
About Me
I play with statistics and biological data, and I think you need to test your data.
Why?
Wat?
THE DATA DO NOT ALWAYS FOLLOW YOUR ASSUMPTIONS
Example
Data: one column needs to be log10-transformed.
Assumptions:
HOW?!
Step 1: Install pytest
$ pip install pytest
Step 2: Create your test script
$ touch test_data.py
$ pico test_data.py
Step 3: Make your script read data
import pandas as pd
pd.read_csv(‘data.csv’)
Step 4: Write your test functions.
def test_column_is_correct():� assert data[‘column’].dtype == float
assert data[‘column’].min() > 0
def test_data_state_integrity():
assert ‘log10_col’ not in data.columns
Step 5: Run your tests
$ py.test
Step 6: Scream at your data provider
Rinse and repeat.
You’ll never have enough data tests!
t:@ericmjl w:ericmajinglong.com g:github.com/ericmjl