1 of 10

J. Duncan

C. Singh

A. Agarwal

R. Kapoor

B. Yu

2 of 10

Veridical Flow

Yu & Kumbier, “Veridical Data Science” (2020)

A Python data science productivity package, heavily influenced by the PCS framework for the data science life-cycle (DSLC)

3 of 10

Coding PCS analysis from scratch is tedious

Data

perturbation

Modeling

Evaluation

...

...

...

...

...

...

...

...

...

helpful but not enough

Data cleaning?

Choice of eval metric?

Interpretation?

Prediction screening?

Solution: Collapse perturbations

down to a single step

Data

perturbations

Model

perturbations

Evaluation

...

Distributed computation, tracking, caching, and saving with minimal code

4 of 10

Main abstraction: Vset

A “veridical set” (Vset) is a set of arbitrary perturbations for some step in the DSLC.

Vset(‘my_vset’, [func1, func2, ...]) # provide a list of perturbations

build_vset(‘my_vset’, func, params) # provide a function and parameter variations

Can be any callable object or that has one of the unified interface methods.

Vsets provide a unified interface for data science:

resample_set(X_train, y_train) # call a Vset directly to apply functions to input

impute_set.transform(X_train) # e.g., fill missing values

model_set.fit(X_train, y_train) # fit all learners in a Vset

model_set.predict(X_test) # classification / regression

model_set.predict_proba(X_test) # predict class probabilities

eval_vset.evaluate(preds, y_test) # evaluate predictions

5 of 10

Veridical Flow constructs an implicit pipeline

model_set = Vset([lr, dt, rf], ...)

impute_set = Vset([mean_impute, median_impute], ...)

resample_set = build_Vset(resample, ...)

init

model

resample

X_train, y_train = init_args(X_train, y_train)

X_val, y_val = init_args(X_val, y_val)

X_test, y_test = init_args(X_test, y_test)

eval_set = Vset([acc, auroc], ...)

impute

eval

X_trains, y_trains = resample_set(X_train, y_train)

X_trains = impute_set.fit(X_trains).transform(X_trains)

metrics = eval_set.evaluate(y_val, preds)

preds = model_set.predict_proba(impute_set(X_val))

model_set.fit(X_trains, y_trains)

6 of 10

Inspect metrics and perturbation intervals

df_metrics = dict_to_df(metrics)

perturbation_stats(df_metrics, ‘eval’, 'impute')

7 of 10

Prediction screening

best_impute_set, best_model_set = \

filter_vset_by_metric(metrics, impute_set, model_set, ...)

init

resample

model

eval

impute

best_impute

best_model

X_trainval = best_impute_set.fit(X_trainval).transform(X_trainval)

best_model_set.fit(X_trainval, y_trainval)

eval_set.evaluate(y_test, test_preds)

metrics = eval_set.evaluate(y_val, preds)

test_preds = best_model_set.predict(best_impute_set.transform(X_test))

best_impute_set = filter_vset_by_metric(metrics, impute_set, ...)

8 of 10

Computation, tracking, & sharing

Distributed compute on heterogeneous resources

Hyperparam tracking + model saving

Caching

9 of 10

How vflow incorporates PCS

P - makes evaluation simple, provides prediction screening

C - distributed computation, caching, tracking, saving

S - Vary over large sets of perturbations, utilities to compute perturbation intervals

10 of 10

Thank you!